DataProcessingFramework
DataProcessingFramework
DPF - a framework for processing and filtering multimodal datasets.
Installation
Install with pip:
pip install git+https://github.com/ai-forever/DataProcessingFramework
Install from repository:
git clone https://github.com/ai-forever/DataProcessingFrameworkcd DataProcessingFrameworkpip install .
Extra requirements: filters
, dev
, llava
, video_llava
, lita
To install extra requirements run: pip install .[dev,filters]
(insert needed extra requirements)
Overview
Framework supports following features:
- Reading datasets
- Filtering datasets and calculating metrics using different models and algorithms. Full list of filters can be found there
- Effectively transforming data such as videos and images
- Data filtering and transformation pipelines
- Converting datasets to other formats
- Validating datasets
- Support for various file systems (local, s3)
DPF allows you to easily filter datasets and add new metadata. You can use various filters and transformations on your data, create pipelines from them and run them efficiently and quickly. Basic code examples for filtering data are given below:
Basic example
Check out basic usage for more info about DPF's API.
This is a simple example for image deduplication and image aesthetic quality prediction. All filters in DPF extract attributes from the dataset's data and write them into metadata. You can then use these attributes to filter the data according to your needs.
from DPF import ShardsDatasetConfig, DatasetReader
# creating config for datasetconfig = ShardsDatasetConfig.from_path_and_columns( 'examples/example_dataset', image_name_col='image_name', text_col="caption")
# reading dataset's metadatareader = DatasetReader()processor = reader.read_from_config(config)
from DPF.filters.images.hash_filters import PHashFilterdatafilter = PHashFilter(sim_hash_size=8, workers=16) # creating PHash filter# calculating PHash# new column "image_phash_8" will be addedprocessor.apply_data_filter(datafilter)
print('Dataset length before deduplication:', len(processor))processor.filter_df(~processor.df['image_phash_8'].duplicated())print('Dataset length after deduplication:', len(processor))
from DPF.filters.images.aesthetic_improved_filter import ImprovedAestheticFilterdatafilter = ImprovedAestheticFilter( weights_folder='../weights', # path to weights folder, will be downloaded to this folder device='cuda:0', workers=16)processor.apply_data_filter(datafilter)
print(processor.df) # printing new dataset's metadata
Run simple_example.py file:
python simple_example.py
Generate captions example
Code below generates synthetic captions for images in shards on remote S3-compatible storage and updates dataset's metadata without downloading shards:
Before running the example below, install extra requirements: pip install DPF[filters,llava]
from DPF import S3Connector, DatasetReader, ShardsDatasetConfig
# creating connector for S3 storageconnector = S3Connector( key='access_key', secret='secret_key', endpoint_url='endpoint_url')
reader = DatasetReader(connector)
# creating dataset configconfig = ShardsDatasetConfig.from_path_and_columns( "s3://your-bucket/path/to/shards", image_name_col='image_name',)# reading a datasetprocessor = reader.read_from_config(config, workers=16)
from DPF.filters.images.llava_captioning_filter import LLaVaCaptioningFilter
# creating LLaVA captioner filterdatafilter = LLaVaCaptioningFilter( workers=16, prompt='short', batch_size=16, device="cuda:0")print(datafilter.result_columns) # prints list of columns that will be added# applying filter to datasetprocessor.apply_data_filter(datafilter) # new metadata is created
new_column_name = datafilter.result_columns[1] # name of new added column with generated caption
print(processor.df[new_column_name]) # prints generated image captions# adding new metadata to remote datasetprocessor.update_columns([new_column_name], workers=16)
You can find more examples there
Supported data modalities
The framework supports data that has any combination of the following modalities:
- Text
- Image
- Video
Datasets with several data of the same modality in one sample are not supported. For example, datasets with following modalities are supported: text-video, text-image, image-video, images, etc. Modalities that are not supported: image2image, image-text-image, etc.
Supported data formats
The dataset should be stored in one of the following formats:
- Files
- Shards
- Sharded files
Basic usage
Configs
To read a dataset, you must first create a config that describes the dataset and the type of data in it. For each data format, you need to use the appropriate config.
Example for shards format:
from DPF import ShardsDatasetConfig
config = ShardsDatasetConfig.from_path_and_columns( 'examples/example_dataset', # path to shards image_name_col='image_name', # name of column in csv file with image names text_col='caption' # name of column in csv file with text/captions)
Reading a dataset
You can read dataset using DatasetReader.from_config
method:
from DPF import ShardsDatasetConfig, DatasetReader
config = ShardsDatasetConfig.from_path_and_columns( 'examples/example_dataset', image_name_col='image_name', text_col='caption')
reader = DatasetReader()processor = reader.read_from_config(config)
Example for sharded files format:
from DPF import ShardedFilesDatasetConfig, DatasetReader
config = ShardedFilesDatasetConfig.from_path_and_columns( 'examples/example_video_dataset', video_name_col='video_name', text_col='caption')
reader = DatasetReader()processor = reader.read_from_config(config)
Examples of reading data in other formats
Example reading a dataset directly from S3 storage:
from DPF import S3Connector, DatasetReader, ShardsDatasetConfig
connector = S3Connector( key='access_key', secret='secret_key', endpoint_url='endpoint_url')reader = DatasetReader(connector)
config = ShardsDatasetConfig.from_path_and_columns( "s3://your-bucket/path/to/shards", image_name_col='image_name',)processor = reader.read_from_config(config, workers=16)
Viewing and updating dataset
A dataset processor provides an interface for interacting with data and modifying it.
Filtering dataset
Filters are models or algorithms that calculate metrics for a dataset. Filters process the data and add new columns with the calculated metrics.
Transforming dataset
You can transform data in dataset with DPF.
For example, resize videos or photos in dataset.
You can use DPF.transforms
for these tasks.
Pipelines
Pipelines help to combine several filters into one pipeline and process the dataset using it. For example:
from DPF.configs import ShardsDatasetConfigfrom DPF.dataset_reader import DatasetReaderfrom DPF.pipelines import FilterPipelinefrom DPF.filters.images.info_filter import ImageInfoFilterfrom DPF.filters.images.hash_filters import PHashFilter
reader = DatasetReader()config = ShardsDatasetConfig.from_path_and_columns( "examples/example_dataset", image_name_col='image_name',)processor = reader.read_from_config(config, workers=4)
pipeline = FilterPipeline("pipeline_example")pipeline.add_datafilter( ImageInfoFilter, {'workers': 4}, processor_run_kwargs={'return_none_on_error': True},)pipeline.add_datafilter(PHashFilter, {'workers': 4})pipeline.add_deduplication(["image_phash_8"])pipeline.add_shuffle()pipeline.run(processor)
Описание
Framework for processing and filtering datasets
Языки
Python