slovo
Slovo - Russian Sign Language Dataset
We introduce a large-scale video dataset Slovo for Russian Sign Language task. Slovo dataset size is about 16 GB, and it contains 20400 RGB videos for 1000 sign language gestures from 194 singers. Each class has 20 samples. The dataset is divided into training set and test set by subject
. The training set includes 15300 videos, and the test set includes 5100 videos. The total video recording time is ~9.2 hours. About 35% of the videos are recorded in HD format, and 65% of the videos are in FullHD resolution. The average video length with gesture is 50 frames.
For more information see our paper - arXiv.
Downloads
Main download link
Downloads | Size (GB) | Comment |
---|---|---|
Slovo | ~16 | Trimmed HD+ videos by annotations |
Origin | ~105 | Original HD+ videos from mining stage |
360p | ~13 | Resized original videos by
|
Landmarks | ~1.2 | Mediapipe hand landmark annotations for each frame of trimmed videos |
Also, you can download Slovo from Kaggle.
Annotation file is easy to use and contains some useful columns, see
file:
attachment_id | user_id | width | height | length | text | train | begin | end | |
---|---|---|---|---|---|---|---|---|---|
0 | de81cc1c-... | 1b... | 1440 | 1920 | 14 | привет | True | 30 | 45 |
1 | 3c0cec5a-... | 64... | 1440 | 1920 | 32 | утро | False | 43 | 66 |
2 | d17ca986-... | cf... | 1920 | 1080 | 44 | улица | False | 12 | 31 |
where:
- video file nameattachment_id
- unique anonymized user IDuser_id
- video widthwidth
- video heightheight
- video lengthlength
- gesture class in Russian Langaugetext
- train or test boolean flagtrain
- start of the gesture (for original dataset)begin
- end of the gesture (for original dataset)end
For convenience, we have also prepared a compressed version of the dataset, in which all videos are processed by the minimum side
. Download link - slovo360p.
Also, we annotate trimmed videos by using MediaPipe and provide hand keypoints in this annotation file.
Models
We provide some pre-trained models as the baseline for Russian sign language recognition. We tested models with frames number from [16, 32, 48], and the best for each are below. The first number in the model name is frames number and the second is frame interval.
Model Name | Model Size (MB) | Metric | ONNX | TorchScript |
---|---|---|---|---|
MViTv2-small-16-4 | 140.51 | 58.35 | weights | weights |
MViTv2-small-32-2 | 140.79 | 64.09 | weights | weights |
MViTv2-small-48-2 | 141.05 | 62.18 | weights | weights |
Swin-large-16-3 | 821.65 | 48.04 | weights | weights |
Swin-large-32-2 | 821.74 | 54.84 | weights | weights |
Swin-large-48-1 | 821.78 | 55.66 | weights | weights |
ResNet-i3d-16-3 | 146.43 | 32.86 | weights | weights |
ResNet-i3d-32-2 | 146.43 | 38.38 | weights | weights |
ResNet-i3d-48-1 | 146.43 | 43.91 | weights | weights |
SignFlow models
Model Name | Desc | ONNX | Params |
---|---|---|---|
SignFlow-A | 63.3 Top-1 Acc on WLASL-2000 (SOTA) | weights | 36M |
SignFlow-R | Pre-trained on ~50000 samples, has 267 classes, tested with GigaChat (as-is and context-based modes) | weights | 37M |
Demo
usage: demo.py [-h] -p CONFIG [--mp] [-v] [-l LENGTH]
optional arguments: -h, --help show this help message and exit -p CONFIG, --config CONFIG Path to config --mp Enable multiprocessing -v, --verbose Enable logging -l LENGTH, --length LENGTH Deque length for predictions
python demo.py -p <PATH_TO_CONFIG>
Authors and Credits
Citation
You can cite the paper using the following BibTeX entry:
@inproceedings{kapitanov2023slovo,
title={Slovo: Russian Sign Language Dataset},
author={Kapitanov, Alexander and Karina, Kvanchiani and Nagaev, Alexander and Elizaveta, Petrova},
booktitle={International Conference on Computer Vision Systems},
pages={63--73},
year={2023},
organization={Springer}
}
Links
License
This work is licensed under a variant of Creative Commons Attribution-ShareAlike 4.0 International License.
Please see the specific license.