Hugging Face Makes AI Dataset Streaming 100x Faster

April 20, 20261 min read

TL;DR

New streaming improvements cut download wait times from hours to seconds, letting you train on terabyte datasets without storing them locally.

Hugging Face has dramatically improved its dataset streaming capabilities, cutting request volumes by 99% while doubling data throughput. The enhancements address a critical bottleneck in large-scale AI training where downloading multi-terabyte datasets could previously stall workflows for hours.

The improvements target two key phases: startup initialization and ongoing data streaming. During startup, redundant requests that previously caused IP blocking have been eliminated through smarter caching and parallelization. For continuous streaming, larger request sizes and prefetching mechanisms now deliver data twice as fast.

These optimizations stem from real-world challenges Hugging Face encountered while training its own models. During development of the SmolLM3 and nanoVLM projects, the team discovered their test runs were generating over 100,000 requests per minute—enough to trigger automatic blocking from their own infrastructure.

The technical breakthroughs include enhanced file resolution caching and improved handling of DataLoader worker initialization. By reusing cached results from directory listings and file discovery operations, the system now avoids redundant API calls that previously plagued distributed training setups.

Underpinning these improvements is Hugging Face's Xet storage system, which uses content-defined chunking for efficient deduplication. Unlike traditional cloud storage, Xet transfers duplicate data only once, accelerating both uploads and streaming operations. This infrastructure is accessible through the company's pyspark-huggingface package for Spark integration.

For specialized use cases, Hugging Face has also enhanced custom streaming pipelines. The improvements have been validated in projects like LeRobot for video frame sampling and WebDataset for TAR archive streaming, demonstrating flexibility beyond standard dataset formats.

The company is already leveraging these enhancements in its nanoVLM training pipeline, where streaming now matches local SSD read speeds. This represents a significant shift from previous workflows that required three-hour data transfers before training could begin.

Developers can access these improvements by updating to the latest versions of Hugging Face's datasets and huggingfacehub libraries. The changes maintain full backward compatibility, meaning existing code with streaming=True will automatically benefit from the performance gains without modification.