RedNote
RedNote Accelerates Model Training & Distribution with Alluxio

About RedNote

RedNote is a popular and rapidly growing e-commerce and social media platform in Asia with more than 150 million daily active users.

The company operates a multi-cloud architecture to support its large-scale operations and diverse user base. RedNote's core services, such as search, recommendations, and community content, are highly dependent on the efficiency of its machine learning (ML) platforms.

RedNote’s Search and Recommendation Machine Learning (ML) Platform is instrumental in delivering fresh, personalized content to RedNote’s millions of users on a daily basis, making the platform one of the company's most critical systems. 

Challenges

RedNote was facing two major challenges with its Search and Recommendation ML Platform:

  1. Missing Nightly 6-hour Model Update SLA
    • Use Case: The recommendation model is fine-tuned nightly using data collected from user interactions throughout the day. The SLA requires the updated recommendation model be ready each morning at 6AM. The number of daily active users drops when models are not refreshed on time.
    • Data Size: Given RedNote’s large user base, input datasets are massive with nightly incremental data updates reaching hundreds of terabytes daily with total dataset sizes in the petabyte range.
    • Challenge:
      RedNote’s cloud object storage bandwidth was being throttled at 1 TB/s, leading to:
      • Training jobs timing out
      • CPU utilization dropping to 30%
      • End-to-end training taking nearly 10 hours to complete
         
  2. Slow Model Distribution Across Multiple Clouds
    • Use Case: The recommendation models, typically terabytes in size, need to be quickly distributed to multiple cloud environments after the daily fine-tuning completes.
    • Challenge:
      • RedNote’s model distribution relied on Alibaba Cloud Disk, which was expensive and slow.

Alluxio's Solution

  1. Accelerated Nightly Training Jobs by 41% to Meet 6-hour SLA
    RedNote deployed a large-scale Alluxio cluster (150-200 nodes) with a 400TB Alluxio Distributed Cache, resulting in:
    • Eliminating storage bottlenecks and cloud storage throttling
    • 45% increase in CPU utilization
    • Reduced training time to 5.5 hours, a 41% improvement, and ensuring models are updated within 6-hour SLA

  2. Accelerated Model Distribution and Lower Costs
    RedNote stores freshly trained models on Alluxio Distributed Cache, resulting in:
    • 10X faster model download speeds by index servers
    • 80% cost savings for model distribution compared to Alibaba Cloud Disk

Conclusion

Alluxio enabled RedNote to meet its nightly model training and distribution SLAs and lower model distribution costs. By leveraging Alluxio Distributed Cache, RedNote eliminated storage bottlenecks causing model training time to exceed SLAs, accelerated cross-cloud model distribution, and lowered model distribution costs. With Alluxio, RedNote can now efficiently train their Search and Recommendation Models and distribute models seamlessly, and at a lower cost, across their multi-cloud architecture, making Alluxio a vital part of their ML infrastructure.

Read case study

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer