Hugging Face Adds Mutable Storage Buckets for AI Projects

April 20, 20262 min read

TL;DR

New S3-style object storage uses Xet deduplication to speed up ML workflows and cut storage costs for teams.

Managing machine learning artifacts like checkpoints and processed datasets often involves mutable, high-throughput storage needs that traditional version control systems like Git struggle to handle efficiently. Hugging Face has introduced Storage Buckets to address this gap, providing a mutable, S3-like object storage solution integrated directly into the Hub. This new feature is designed for scenarios where users need to write fast, overwrite files, sync directories, and remove stale data without the constraints of a versioned repository. Buckets offer a non-versioned container that lives under user or organization namespaces, with standard permissions and browser accessibility, making them a practical alternative for dynamic AI workloads.

Buckets are built on Xet, Hugging Face's chunk-based storage backend, which fundamentally changes how data is stored by breaking content into chunks and deduplicating across files. This approach means that when uploading a processed dataset similar to a raw one or storing successive model checkpoints with frozen parts, many chunks already exist, so Buckets skip redundant bytes. This in less bandwidth usage, faster transfers, and more efficient storage, which is particularly beneficial for ML pipelines that constantly produce related artifacts like raw and processed data, checkpoints, and Agent traces. For Enterprise customers, billing is based on deduplicated storage, directly reducing costs while improving speed.

To enhance performance for distributed training and large-scale pipelines, Buckets include a pre-warming feature that allows users to bring data closer to their compute regions. By declaring where data is needed, Buckets ensure it is already in place when jobs start, avoiding cross-region reads that can slow throughput. This is especially useful for training clusters requiring fast access to large datasets or checkpoints and for multi-region setups where different pipeline parts run in different clouds. Hugging Face is partnering with AWS and GCP initially, with plans to add more cloud providers in the future, expanding global storage capabilities.

Users can quickly set up and manage Buckets through the CLI, with operations like creating a bucket and syncing directories taking under two minutes. The CLI supports previewing transfers with a dry-run option to see what will happen before execution and allows saving plans for later review. For programmatic integration, Python and JavaScript clients are available, enabling batch uploads, selective downloads, and deletes within training scripts or web applications. Additionally, Buckets work with fsspec-compatible filesystems, allowing libraries like pandas, Polars, and Dask to access Bucket contents using standard filesystem operations without extra setup.

Buckets serve as a fast, mutable layer for artifacts in motion, while versioned model or dataset repos handle stable deliverables. On the roadmap, Hugging Face plans to support direct transfers between Buckets and repos in both directions, such as promoting final checkpoint weights into a model repo or committing processed shards into a dataset repo. This separation maintains distinct working and publishing layers while fitting into a continuous Hub-native workflow, as tested during a private beta with partners like Jasper, Arcee, IBM, and PixAI, whose feedback helped shape the feature.

Storage Buckets are included in existing Hub storage plans, with free accounts offering starter storage and PRO and Enterprise plans providing higher limits. The primary goal is to fuel data-heavy AI workflows, and while a cold-storage tier for archival needs has been discussed internally, it is not on the short-term roadmap. This launch positions Buckets as a solution for keeping more of the ML workflow in one place on the Hub, offering a familiar S3-style model optimized for AI artifacts and a clear path to final publication.