DataProcessingFramework
DataProcessingFramework
DPF - a framework for processing and filtering multimodal datasets.
Installation
Install with pip:
Install from repository:
Extra requirements: , , , ,
To install extra requirements run: (insert needed extra requirements)
Overview
Framework supports following features:
- Reading datasets
- Filtering datasets and calculating metrics using different models and algorithms. Full list of filters can be found there
- Effectively transforming data such as videos and images
- Data filtering and transformation pipelines
- Converting datasets to other formats
- Validating datasets
- Support for various file systems (local, s3)
DPF allows you to easily filter datasets and add new metadata. You can use various filters and transformations on your data, create pipelines from them and run them efficiently and quickly. Basic code examples for filtering data are given below:
Basic example
Check out basic usage for more info about DPF's API.
This is a simple example for image deduplication and image aesthetic quality prediction. All filters in DPF extract attributes from the dataset's data and write them into metadata. You can then use these attributes to filter the data according to your needs.
Run simple_example.py file:
Generate captions example
Code below generates synthetic captions for images in shards on remote S3-compatible storage and updates dataset's metadata without downloading shards:
Before running the example below, install extra requirements:
You can find more examples there
Supported data modalities
The framework supports data that has any combination of the following modalities:
- Text
- Image
- Video
Datasets with several data of the same modality in one sample are not supported. For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc. Modalities that are not supported: image2image, image-text-image, etc.
Supported data formats
The dataset should be stored in one of the following formats:
- Files
- Shards
- Sharded files
Basic usage
Configs
To read a dataset, you must first create a config that describes the dataset and the type of data in it. For each data format, you need to use the appropriate config.
Example for shards format:
Reading a dataset
You can read dataset using method:
Example for sharded files format:
Examples of reading data in other formats
Example reading a dataset directly from S3 storage:
Viewing and updating dataset
A dataset processor provides an interface for interacting with data and modifying it.
Filtering dataset
Filters are models or algorithms that calculate metrics for a dataset. Filters process the data and add new columns with the calculated metrics.
Transforming dataset
You can transform data in dataset with DPF.
For example, resize videos or photos in dataset.
You can use for these tasks.
Pipelines
Pipelines help to combine several filters into one pipeline and process the dataset using it. For example: