Introduction to Unisound and Atlas AI Platform
Unisound is an artificial intelligence company focusing on Internet of Things services. Unisound’s AI technology stacks include the perception and expression capabilities of signals, voices, images, and texts, and the cognitive technologies such as knowledge, understanding, analysis, and decision-making, towards a multi-modal AI system. Atlas is the supercomputing platform supporting all kinds of AI applications including model training and reasoning inferencing.
Unisound has built the industry-leading GPU/CPU heterogeneous computing and distributed file system, called Atlas. This platform provides AI applications with high-performance computing and data access capabilities at a massive scale. Based on the Kubernetes open source architecture, the Unisound team has developed the core features and successfully built an AI supercomputing platform with a floating-point processing capacity of more than 10 PFLOPS (100 million times per second). The platform supports the main machine learning frameworks, and developers can efficiently research and develop core applications such as voice, NLP, big data, multimodal, etc. The platform also serves external customers such as SMBs and research institutions with customized computing and storage capabilities.
Problems and Challenges
On the Atlas platform, computation is decoupled from storage. At present, the interconnections among the storage servers, the computing servers, and between the computing and storage servers are 100GB InfiniBand.
We have built our own high-performance distributed file system, named Lustre, as the storage layer of the Atlas platform. Lustre consists of several petabytes of model training datasets. The Lustre distributed file system is compatible with the POSIX interface so that various deep learning frameworks can directly read data from Lustre. The separation between computing and storage enables them to scale independently, making overall architecture more flexible. However, such architecture encountered problems such as slow data access and network bandwidth bottlenecks. The challenges are as follows:
The I/O Bottleneck
As the number of users grows, there is a large increase in the bandwidth, metadata workload, and server workload with unchanged storage resources. In the storage cluster, there are many workloads running on the same single GPU node, competing for I/O resources. Thus, the entire training cycle gets longer, which greatly reduces the efficiency of the research and development.
Massive Small Files
The second challenge is about the training data set itself. In the noise reduction training, certain users would process terabytes of small files, which puts great pressure on the metadata service of the distributed file system, making it inefficient in reading data. The slow read lowers the overall utilization of the GPU, which increases the overall model training time.
Many Data Types
Many applications run on the Atlas platform with different data formats and file sizes, making it nearly impossible to use one single configuration to fit all services. Based on user behaviors, we found that the most used data is for model training, with the rest for model inference and CPU-intensive data generation applications.
Data Redundancy
The last challenge is the dataset replication on the platform. When the same dataset is used by different users in the same group or in different groups, multiple copies are stored, resulting in a waste of storage space.
The Previous Solution
We want to find a solution that is not only cost-effective but also requires the least architectural changes to deal with the I/O bottleneck and offload the metadata server. Below are several ways we explored.
Put Limitation on Bandwidth
Massive concurrent reads will push the bandwidth to reach the limit, causing storage system freezes or failures. We limit the bandwidth by putting limitations on the bandwidth of each client on the computing node and the UID/GID of each user. However, this method is not flexible and cannot fully utilize GPU resources. When there are two large I/O training jobs running on the same node, due to the bandwidth limitations, both training jobs are limited on reads. These limitations make GPU less utilized because it does not benefit from reading in parallel. We found that the utilization rate of GPU in this scenario is only 40%, which means hardware resources are wasted.
Aggregate Small Files to Large Files
We also took steps to deal with pressure put on the metadata by the massive number of small files. We estimated the number of small files by collecting the number inode of and the total storage size of each user to limit the quantity of the small files. Then, we implemented a series of data aggregation tools, allowing users to aggregate small files into large file formats such as lmdb and tfrecord.
Customize the Task Scheduler
To avoid too many jobs running on the same node at the same time, we customized the plug-ins of the task scheduler by adding scheduling policies that identify the resource usage of the computing node so that it can schedule jobs to idle nodes to avoid the I/O competition when running multiple jobs on the same node. However, this solution doesn’t work when all the computing nodes are overloaded because competition is inevitable.
Tiered Cache
In order to fully utilize the idle hardware resources and reduce the pressure on the storage system, we developed the V1.0 cache solution as a temporary solution. This solution can relieve the storage pressure to a certain extent, but the data management is not automated yet, so it can only be a temporary solution until we find an ultimate solution and build the new architecture.
The New Architecture
In 2020, the Unisound team started to deploy Alluxio and conducted a series of tests, including functionality tests and performance benchmarking. We found that Alluxio can meet our needs and solve our pain points effectively at a lower cost.
- Alluxio Fuse provides a POSIX file system interface, which allows users to seamlessly use the distributed cache without changing the applications.
- Alluxio supports different types of storage systems, including distributed file systems, object storage, etc. When we introduce new storage to our platform, Alluxio supports it very well, keeping our entire cache architecture stable.
- Alluxio provides intelligent cache management. Alluxio's tiered cache fully makes use of memory, SDD, or HDD, reducing the cost of data-driven applications with elastic scalability.
- Alluxio supports Kubernetes or containers deployment, which is consistent with our existing technology stack;
- Alluxio provides HA support to ensure the high availability of the distributed cache system
Our previous architecture completely decoupled computing and storage. By introducing Alluxio as a cache layer between computing and storage, users can enjoy fast data access because Alluxio brings the underlying storage to memory or local hard drives on each computing node. The entire platform utilizes the resources of both the distributed file system and the local hard disk.
When deploying Alluxio to production, we encountered problems such as access control and data mounting. Fluid provides a more cloud-native way to use Alluxio and manage data. Just like managing cloud resources, Kubernetes schedules and allocates cached data, which is better than the previous V1.0 cache.
Our ultimate architecture is: Alluxio is responsible for data migration and cache management, moving data from the underlying distributed file system to the local cache of the computing node, providing acceleration to the platform applications; Fluid is responsible for the orchestration of caching and applications. Based on Fluid, the platform can perceive caching and intelligently perform cache management without manual operations.
After implementing the new architecture, we integrated Fluid with our self-developed model training task submission tool, atlasctl, to hide the complexity on the user side as much as possible. Users can create a cache dataset by using atlasctl cache create
and specifying parameters such as cache size and cache media, and so on. This tool allows users to pay more attention to the data and the application itself without having to understand the underlying cache infrastructure.
Configurations settings
We introduced Fluid + Alluxio as a new architecture to our platform. During the deployment, we encountered some problems in different scenarios. We immediately reached out to the community, and the community solved our needs in a timely manner. Here are the main features:
hostpath
andnonroot
Support- Multiple Mount Points Support
- Cache Warmup
- Performance Tuning
For more detailed settings and performance tuning, please refer to the full-length whitepaper.
Test Scenarios
We tested three scenarios according to the size of the data set. The first type is small files, with the size of a single file under 1M. The second is medium files in several hundred gigabytes, and the third is terabytes of large files.
Noise Reduction
This test uses the DLSE model generated by the Pytorch framework. The number of files in the dataset is about 500,000 and the total size is 183 GB. Memory is used as the cache of Alluxio.
We use a single machine with 10 GPU cards for this test. Based on Pytorch's native DDP framework for multi-card communication, we compare the two scenarios: read directly from the distributed file system (Lustre), read from the Alluxio cache (Lustre+Alluxio), and read from Alluxio with cache warmup (Alluxio, warm).
We can see that, during the first read, using Alluxio with cache warmup (Alluxio, warm) is nearly 10x faster compared to reading directly from the distributed file system (Lustre). Because when Alluxio reads the first time, it needs to sync the metadata and cache data at the same time, the real advantage of caching is still not reflected yet. In the second read, since the data have all been brought into the cache, the performance depends on the cache hit rate of Alluxio. From the above test results, we can see that there is a significant speedup.
With a faster data read, we also improved overall GPU utilization. By monitoring the utilization, we found that the GPU usage kept at about 90% using Alluxio cache warmup. At the same time, we can store cached data in the memory, effectively offloading the underlying storage.
Optical Character Recognition (OCR)
The second test is a CRNN-based character recognition model based on the Pytorch framework. For more information, please refer to the full-length whitepaper.
Conclusion
By introducing the new architecture of Fluid + Alluxio, we have gained the following benefits:
- Speedup model training: through the test results, we can see a significant speedup on training jobs. By achieving data locality, the platform can avoid network bottleneck or resource competition, thereby effectively accelerating the data access in the model training process.
- Offload the underlying storage: by enabling local cache, the new architecture brings a better IOPS (Input/Output Operations Per Second), greatly reduces the workload pressure of the underlying storage system, and effectively improves its availability.
- Increase GPU cluster utilization: by efficient I/O read, the platform eliminates the data read bottleneck, and also avoids GPU idling while waiting for data, thus improving the GPU utilization rate of the entire cluster.
- Avoid I/O competition on the same node: the new architecture fully solves our previous pain points of I/O resource competition on the same node, the storage system I/O bottleneck, and the slow model training.
- More efficient cache management: the new architecture is a more cloud-native way to manage the cache. Engineers have changed from simply loading data in memory to managing and monitoring the cache. Kubernetes scheduling can perceive the cache and perform corresponding policy allocations, making jobs more efficient and easier to manage.
Next Steps
Fluid + Alluxio has brought us a lot of benefits. We will keep working closely with the community and contribute to the following work in the future:
- We will continuously provide feedback to the community with more test results in more scenarios, and iterate and optimize the performance of Alluxio.
- We will test more data types, summarize findings, and provide best practices on performance tuning to the community.
- We will add an intelligent cache scheduling feature to Fluid.
About the Authors
Dongdong Lv, Platform Architect at Unisound
- Many years of experience in cloud-native open source communities
- Responsible for ML platform architecture design, deep learning and AI modeling optimizations and accelerations
- Research areas: high-performance computing, distributed file system, distributed caching, etc
Qingsong Liu, Algorithm Researcher at Unisound
- Responsible for the R&D of ML platforms and algorithms
- Research areas: cloud-native architecture, high-performance computing, voice and vision applications, machine learning algorithms, etc.
Blog
We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.
We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.
PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.