Коммиты
Corrects an issue where `self._ex_iterable` was erroneously used instead of `ex_iterable`, when both Distributed Data Parallel (DDP) and multi num_worker are used concurrently. This improper usage led to the generation of incorrect `shards_indices`, subsequently causing issues with the control flow responsible for worker creation. The fix ensures the appropriate iterable is used, thus providing a more accurate determination of whether a new worker should be instantiated or not.
* fix data_files when passing data_dir * add test * fix tests
An auto converstion for torch if the dataset format is uint16 or uint32 Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Change default compression argument for JsonDatasetWriter Change default compression type from None to "infer", to align with pandas' defaults * Fix incorrect default json compression when writing to buffer * fix empty space --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* Use tool.ruft.lint to silence deprecation messages * Bump ruff to 0.3.0 * Update pre-commit config * Remove black section from pyproject.toml
* data_files: support fsspec 2023.12.0 glob * fsspec: unpin version upper bound * fsspec: pin max version to <=2024.2.0 * data_files: remove unsupported fsspec-specific ** globbing * data_files: update resolve_pattern ** behavior docstring * fix split case with either prefix or suffix --------- Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
base parquet batch_size on parquet row group size
* Test JSON builder with list of strings * Make JSON builder support array of strings
Update the print message for chunked_dataset in the process.mdx batch processing section for clarity and accuracy
* Updated Quickstart Notebook link * Small fix * Nit --------- Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
* Improve error message for gated datasets on load Internal Slack discussion: https://huggingface.slack.com/archives/C02V51Q3800/p1708424971135029 * Point to dataset page URL * Harmonise error message
* Undo the changes in `arrow_writer.py` from #6636 See #6663. * Add test * Apply suggestions from code review * Nits --------- Co-authored-by: mariosasko <mariosasko777@gmail.com>
* docmunent usage of hfh cli instead of git * minor