DataProcessingFramework

DPF - a framework for processing and filtering multimodal datasets.

Installation
Overview
Basic usage

Installation

Install with pip:

Install from repository:

Extra requirements:

filters

dev

llava

video_llava

lita

To install extra requirements run:

pip install .[dev,filters]

(insert needed extra requirements)

Overview

Framework supports following features:

Reading datasets
Filtering datasets and calculating metrics using different models and algorithms. Full list of filters can be found there
Effectively transforming data such as videos and images
Data filtering and transformation pipelines
Converting datasets to other formats
Validating datasets
Support for various file systems (local, s3)

DPF allows you to easily filter datasets and add new metadata. You can use various filters and transformations on your data, create pipelines from them and run them efficiently and quickly. Basic code examples for filtering data are given below:

Basic example

Check out basic usage for more info about DPF's API.

This is a simple example for image deduplication and image aesthetic quality prediction. All filters in DPF extract attributes from the dataset's data and write them into metadata. You can then use these attributes to filter the data according to your needs.

Run simple_example.py file:

Generate captions example

Code below generates synthetic captions for images in shards on remote S3-compatible storage and updates dataset's metadata without downloading shards:

Before running the example below, install extra requirements:

pip install DPF[filters,llava]

You can find more examples there

Supported data modalities

The framework supports data that has any combination of the following modalities:

Text
Image
Video

Datasets with several data of the same modality in one sample are not supported. For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc. Modalities that are not supported: image2image, image-text-image, etc.

Supported data formats

The dataset should be stored in one of the following formats:

Files
Shards
Sharded files

More about data formats

Basic usage

Configs

To read a dataset, you must first create a config that describes the dataset and the type of data in it. For each data format, you need to use the appropriate config.

Example for shards format:

Reading a dataset

You can read dataset using

DatasetReader.from_config

method:

Example for sharded files format:

Examples of reading data in other formats

Example reading a dataset directly from S3 storage:

Viewing and updating dataset

A dataset processor provides an interface for interacting with data and modifying it.

More about dataset processor

Filtering dataset

Filters are models or algorithms that calculate metrics for a dataset. Filters process the data and add new columns with the calculated metrics.

More about filters

Transforming dataset

You can transform data in dataset with DPF. For example, resize videos or photos in dataset. You can use

DPF.transforms

for these tasks.

More about transforms

Pipelines

Pipelines help to combine several filters into one pipeline and process the dataset using it. For example:

More about pipelines

DataProcessingFramework

Описание

Языки

Igor Pavlov
Merge pull request #57 from ai-forever/dev
2 года назад
b468d82
Не верифицирован

DataProcessingFramework

Installation

Overview

Basic example

Generate captions example

Supported data modalities

Supported data formats

Basic usage

Configs

Reading a dataset

Viewing and updating dataset

Filtering dataset

Transforming dataset

Pipelines

DataProcessingFramework

Описание

Языки

Igor PavlovMerge pull request #57 from ai-forever/dev2 года назадb468d82Не верифицирован

DataProcessingFramework

Installation

Overview

Basic example

Generate captions example

Supported data modalities

Supported data formats

Basic usage

Configs

Reading a dataset

Viewing and updating dataset

Filtering dataset

Transforming dataset

Pipelines

Igor Pavlov
Merge pull request #57 from ai-forever/dev
2 года назад
b468d82
Не верифицирован