DeepSeek AI Releases Fire-Flyer File System (3FS): A High-Performance Distributed File System Designed to Address the Challenges of AI Training and Inference Workload

The advancement of artificial intelligence has ushered in an era where data volumes and computational requirements are growing at an impressive pace. AI training and inference workloads demand not only significant compute power but also a storage solution that can manage large-scale, concurrent data access. Traditional file systems often fall short when faced with high-throughput […] The post DeepSeek AI Releases Fire-Flyer File System (3FS): A High-Performance Distributed File System Designed to Address the Challenges of AI Training and Inference Workload appeared first on MarkTechPost.

Feb 28, 2025 - 18:25
 0
DeepSeek AI Releases Fire-Flyer File System (3FS): A High-Performance Distributed File System Designed to Address the Challenges of AI Training and Inference Workload

The advancement of artificial intelligence has ushered in an era where data volumes and computational requirements are growing at an impressive pace. AI training and inference workloads demand not only significant compute power but also a storage solution that can manage large-scale, concurrent data access. Traditional file systems often fall short when faced with high-throughput data access, which can lead to performance bottlenecks that slow down training cycles and increase latency during inference. In distributed environments, where thousands of compute nodes may need to access data simultaneously, it becomes crucial to have a storage system that offers both low-latency access and reliable scalability. This is especially important for modern AI pipelines that handle vast datasets and real-time data operations.

DeepSeek AI has introduced the Fire-Flyer File System (3FS), a distributed file system crafted specifically to meet the demands of AI training and inference workloads. Designed with modern SSDs and RDMA networks in mind, 3FS offers a shared storage layer that is well-suited for the development of distributed applications. The file system’s architecture moves away from conventional designs by combining the throughput of thousands of SSDs with the network capacity provided by numerous storage nodes. This disaggregated approach enables applications to access storage without being restricted by traditional data locality considerations, allowing for a more flexible and efficient handling of data.

Technical Details and Benefits

At the heart of 3FS lies a thoughtful integration of several innovative features. One notable aspect is its disaggregated architecture. By uniting the capabilities of thousands of SSDs with the bandwidth of hundreds of storage nodes, 3FS facilitates large-scale data access while bypassing many limitations seen in more traditional, locality-dependent file systems.

Another key feature is the use of Chain Replication with Apportioned Queries (CRAQ) to maintain strong consistency across the system. While many distributed file systems rely on eventual consistency—which can complicate application logic—CRAQ ensures that data remains consistent even under high concurrency or in the event of node failures. This design choice simplifies the development process and helps maintain system reliability.

In addition, 3FS incorporates stateless metadata services that are supported by a transactional key-value store, such as FoundationDB. By decoupling metadata management from the storage layer, the system not only becomes more scalable but also reduces potential bottlenecks related to metadata operations. This separation of concerns means that as the volume of data grows, the system can manage metadata more efficiently without impacting overall performance.

For inference workloads, 3FS offers an innovative caching mechanism known as KVCache. Traditional DRAM-based caching can be both expensive and limited in capacity, but KVCache provides a cost-effective alternative that delivers high throughput and a larger cache capacity. This feature is particularly valuable in AI applications where repeated access to previously computed data, such as key and value vectors in language models, is essential to maintain performance.

Performance Benchmarks and Insights

The performance of 3FS has been assessed through several comprehensive benchmarking tests. In one test conducted on a cluster of 180 nodes, the system achieved a read throughput of approximately 6.6 TiB/s, even while handling background traffic from training operations. This benchmark illustrates the system’s capacity to manage large volumes of data in a demanding, real-world environment.

Another benchmark focused on sorting performance, using the GraySort test to evaluate how well 3FS handles large-scale data processing. On a cluster of 25 storage nodes and 50 compute nodes, the system sorted 110.5 TiB of data spread over 8,192 partitions in just over 30 minutes, resulting in an average throughput of 3.66 TiB/min. These figures are a strong indicator of 3FS’s ability to handle intensive data tasks efficiently.

The KVCache feature also demonstrated noteworthy performance improvements. During inference tests, KVCache reached a peak read throughput of 40 GiB/s. This level of performance is significant for AI systems where reducing latency is critical. Additionally, the system managed cache memory dynamically, maintaining robust performance even as it handled the intricacies of garbage collection for cache data.

Conclusion

DeepSeek AI’s introduction of the Fire-Flyer File System (3FS) represents a thoughtful response to the challenges inherent in modern AI workflows. By focusing on scalability, consistency, and efficient data access, 3FS provides a robust platform for both training and inference workloads. Its disaggregated architecture allows for a flexible use of thousands of SSDs and hundreds of storage nodes, while the use of CRAQ ensures that data remains consistently reliable—a feature that simplifies system design and improves overall stability.

The separation of metadata services from the storage layer, coupled with the innovative KVCache system for inference tasks, positions 3FS as a forward-thinking solution for distributed AI storage challenges. Performance benchmarks further confirm that the system is capable of managing large data volumes with impressive throughput and efficiency. Ultimately, the Fire-Flyer File System is a carefully engineered tool designed to meet the needs of today’s data-intensive AI applications, providing a dependable foundation for continued innovation in the field.


Check out the GitHub Repo. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

                        </div>
                                            <div class= read more