hagrid

0
README.md

HaGRID - HAnd Gesture Recognition Image Dataset

hagrid

We introduce a large image dataset HaGRIDv2 (HAnd Gesture Recognition Image Dataset) for hand gesture recognition (HGR) systems. You can use it for image classification or image detection tasks. Proposed dataset allows to build HGR systems, which can be used in video conferencing services (Zoom, Skype, Discord, Jazz etc.), home automation systems, the automotive sector, etc. We have also released an algorithm for dynamic gesture recognition, which we described in our paper. This model is trained entirely on HaGRIDv2 and enables the recognition of dynamic gestures while being trained exclusively on static ones. You can find it in our repository.

HaGRIDv2 size is 1.5T and dataset contains 1,086,158 FullHD RGB images divided into 33 classes of gestures and a new separate "no_gesture" class, containing domain-specific natural hand postures. Also, some images have

no_gesture
class if there is a second gesture-free hand in the frame. This extra class contains 2,164 samples. The data were split into training 76%, 9% validation and testing 15% sets by subject
user_id
, with 821,458 images for train, 99,200 images for validation and 165,500 for test.

gestures

The dataset contains 65,977 unique persons and at least this number of unique scenes. The subjects are people over 18 years old. The dataset was collected mainly indoors with considerable variation in lighting, including artificial and natural light. Besides, the dataset includes images taken in extreme conditions such as facing and backing to a window. Also, the subjects had to show gestures at a distance of 0.5 to 4 meters from the camera.

Example of sample and its annotation:

example

For more information see our arxiv paper.

🔥 Changelog

  • 2025/02/27
    : We release Dynamic Gesture Recognition algorithm. 🙋
    • Introduced a novel algorithm that enables dynamic gesture recognition while being trained exclusively on static gestures
    • Fully trained on the HaGRIDv2-1M dataset
    • Designed for real-time applications in video conferencing, smart home control, automotive systems, and more
    • Open-source implementation with pretrained models available in the repository
  • 2024/09/24
    : We release HaGRIDv2. 🙏
    • The HaGRID dataset has been expanded with 15 new gesture classes, including two-handed gestures
    • New class "no_gesture" with domain-specific natural hand postures was addad (2,164 samples, divided by train/val/test containing 1,464, 200, 500 images, respectively)
    • Extra class
      no_gesture
      contains 200,390 bounding boxes
    • Added new models for gesture detection, hand detection and full-frame classification
    • Dataset size is 1.5T
    • 1,086,158 FullHD RGB images
    • Train/val/test split: (821,458) 76% / (99,200) 9% / (165,500) 15% by subject
      user_id
    • 65,977 unique persons
  • 2023/09/21
    : We release HaGRID 2.0. ✌️
    • All files for training and testing are combined into one directory
    • The data was further cleared and new ones were added
    • Multi-gpu training and testing
    • Added new models for detection and full-frame classification
    • Dataset size is 723GB
    • 554,800 FullHD RGB images (cleaned and updated classes, added diversity by race)
    • Extra class
      no_gesture
      contains 120,105 samples
    • Train/val/test split: (410,800) 74% / (54,000) 10% / (90,000) 16% by subject
      user_id
    • 37,583 unique persons
  • 2022/06/16
    : HaGRID (Initial Dataset) 💪
    • Dataset size is 716GB
    • 552,992 FullHD RGB images divided into 18 classes
    • Extra class
      no_gesture
      contains 123,589 samples
    • Train/test split: (509,323) 92% / (43,669) 8% by subject
      user_id
    • 34,730 unique persons from 18 to 65 years old
    • The distance is 0.5 to 4 meters from the camera

Installation

Clone and install required python packages:

Downloads

We split the train dataset into 34 archives by gestures because of the large size of data. Download and unzip them from the following links:

Dataset

GestureSizeGestureSizeGestureSize
call
37.2 GB
peace
41.4 GB
grabbing
48.7 GB
dislike
40.9 GB
peace_inverted
40.5 GB
grip
48.6 GB
fist
42.3 GB
rock
41.7 GB
hand_heart
39.6 GB
four
43.1 GB
stop
41.8 GB
hand_heart2
42.6 GB
like
42.2 GB
stop_inverted
41.4 GB
holy
52.7 GB
mute
43.2 GB
three
42.2 GB
little_finger
48.6 GB
ok
42.5 GB
three2
40.2 GB
middle_finger
50.5 GB
one
42.7 GB
two_up
41.8 GB
point
50.4 GB
palm
43.0 GB
two_up_inverted
40.9 GB
take_picture
37.3 GB
three3
54 GB
three_gun
50.1 GB
thumb_index
62.8 GB
thumb_index2
24.8 GB
timeout
39.5 GB
xsign
51.3 GB
no_gesture
493.9 MB

dataset
annotations:
annotations

HaGRIDv2 512px - lightweight version of the full dataset with

min_side = 512p
119.4 GB

or by using python script

Run the following command with key

--dataset
to download dataset with images. Download annotations for selected stage by
--annotations
key.

After downloading, you can unzip the archive by running the following command:

The structure of the dataset is as follows:

├── hagrid_dataset <PATH_TO_DATASET_FOLDER> │ ├── call │ │ ├── 00000000.jpg │ │ ├── 00000001.jpg │ │ ├── ... ├── hagrid_annotations │ ├── train <PATH_TO_JSON_TRAIN> │ │ ├── call.json │ │ ├── ... │ ├── val <PATH_TO_JSON_VAL> │ │ ├── call.json │ │ ├── ... │ ├── test <PATH_TO_JSON_TEST> │ │ ├── call.json │ │ ├── ...

Models

We provide some models pre-trained on HaGRIDv2 as the baseline with the classic backbone architectures for gesture classification, gesture detection and hand detection.

Gesture DetectorsmAP
YOLOv10x89.4
YOLOv10n88.2
SSDLiteMobileNetV3Large72.7

In addition, if you need to detect hands, you can use YOLO detection models, pre-trained on HaGRIDv2

Hand DetectorsmAP
YOLOv10x88.8
YOLOv10n87.9

However, if you need a single gesture, you can use pre-trained full frame classifiers instead of detectors. To use full frame models, remove the no_gesture class

Full Frame ClassifiersF1 Gestures
MobileNetV3_small86.7
MobileNetV3_large93.4
VitB1691.7
ResNet1898.3
ResNet15298.6
ConvNeXt base96.4

Train

You can use downloaded trained models, otherwise select a parameters for training in

configs
folder. To train the model, execute the following command:

Single GPU:

Multi GPU:

which -g is a list of GPU ids.

Every step, the current loss, learning rate and others values get logged to Tensorboard. See all saved metrics and parameters by opening a command line (this will open a webpage at

localhost:6006
):

Test

Test your model by running the following command:

Single GPU:

Multi GPU:

which -g is a list of GPU ids.

Demo

demo

Demo Full Frame Classifiers

Annotations

The annotations consist of bounding boxes of hands and gestures in COCO format

[top left X position, top left Y position, width, height]
with gesture labels. We provide
user_id
field that will allow you to split the train / val / test dataset yourself, as well as a meta-informations contains automatically annotated age, gender and race.

  • Key - image name without extension
  • Bboxes - list of normalized bboxes for each hand
    [top left X pos, top left Y pos, width, height]
  • Labels - list of class labels for each hand e.g.
    like
    ,
    stop
    ,
    no_gesture
  • United_bbox - united combination of two hand boxes in the case of two-handed gestures ("hand_heart", "hand_heart2", "thumb_index2", "timeout", "holy", "take_picture", "xsign") and 'null' in the case of one-handed gestures
  • United_label - a class label for united_bbox in case of two-handed gestures and 'null' in the case of one-handed gestures
  • User ID - subject id (useful for split data to train / val subsets).
  • Hand_landmarks - auto-annotated with MediaPipe landmarks for each hand.
  • Meta - automatically annotated with FairFace and MiVOLO neural networks meta-information contains age, gender and race

Bounding boxes

ObjectTrainValTestTotal
gesture980 924120 003200 0061 300 933
no gesture154 40319 41129 386203 200
total boxes1 135 327139 414229 3921 504 133

Landmarks

ObjectTrainValTestTotal
Total hands with landmarks983 991123 230201 1311 308 352

Converters

Yolo

We provide a script to convert annotations to YOLO format. To convert annotations, run the following command:

after conversion, you need change original definition img2labels to:

Coco

Also, we provide a script to convert annotations to Coco format. To convert annotations, run the following command:

License

Creative Commons License
This work is licensed under a variant of Creative Commons Attribution-ShareAlike 4.0 International License.

Please see the specific license.

Authors and Credits

Citation

You can cite the paper using the following BibTeX entry:

@misc{nuzhdin2024hagridv21mimagesstatic, title={HaGRIDv2: 1M Images for Static and Dynamic Hand Gesture Recognition}, author={Anton Nuzhdin and Alexander Nagaev and Alexander Sautin and Alexander Kapitanov and Karina Kvanchiani}, year={2024}, eprint={2412.01508}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.01508}, } @InProceedings{Kapitanov_2024_WACV, author = {Kapitanov, Alexander and Kvanchiani, Karina and Nagaev, Alexander and Kraynov, Roman and Makhliarchuk, Andrei}, title = {HaGRID -- HAnd Gesture Recognition Image Dataset}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {4572-4581} }