Effective Data Engineering in the Cloud World

June 21, 2019

This blog is authored by Dipti Borkar originally posted on medium.

Cloud has changed the dynamics of data engineering as well as the behavior of data engineers in many ways. This is primarily because a data engineer on premise only dealt with databases and some parts of the hadoop stack.

In the cloud, things are a bit different. Data engineers suddenly need to think different and broader. Instead of being purely focused on data infrastructure, you are now almost a full stack engineer (leaving out the final end application perhaps). Compute, containers, storage, data movement, performance, network — skills are increasing needed across the broader stack. Here are some design concept and data stack elements to keep in mind.

1. The disaggregated data stack — pick a compute, a catalog, a buffer pool, a storage.

Historically databases were tightly integrated with all core components built together. Hadoop changed that with co-located compute and storage in a distributed system instead of being in a single or a few boxes. Cloud changed that. Today, it is a fully disaggregated stack with each core element of the database management system being its own layer. Pick each component wisely.

Vertically integrated legacy data systems

2. Orchestrate, orchestrate, orchestrate

Cloud has created a need for and enabled mass orchestration — whether is Kubernetes for containers, Alluxio for data, Istio for APIs, Kafka for events, Terraform for scripting.

Efficiency dramatically increasing by abstracting and orchestration. Since now a data engineer for the cloud has full stack concerns, orchestration can be a data engineers best kept secret.

3. Copying data creates more problems than it solves

Fundamentally, once data lands into the enterprise, it should not be copied around unless of course for backup, recovery, disaster recovery scenarios. Making this data accessible to as many business units, data scientists & analysts with as few new copies created is THE data engineering puzzle to solve.

This is where in the legacy DBMS world, a buffer pool helped, making sure the compute (query engine) always had access to data stored in a consistent, performant way in a format that was suitable for the query engine to process versus a format optimized for storage. Technologies like Alluxio can dramatically simplify life bringing data closer to compute making it more performant and accessible.

4. S3-compatible in the cloud, S3-compatible on premise

Because of the popularity of AWS S3, object stores in general will be the next dominant storage system — at least for a few years (5–8 year cycle typically). Think forward see pick a storage tier that will last for sometime and S3-compatible object stores should be your primary choice. While they are not great at all data-driven workloads many technologies help remove their deficiencies.

5. SQL and structured data is still in!

While SQL has existed since the 1970s, it still is the easiest way for analysts to understand and do something with data. AI models will continue to evolve, but SQL has lasted close to 50 years. Pick 2, at most 3 frameworks to bet on and invest in. But build a platform that will over time support as many as needed. Currently presto sql is turning into a popular query engine pick for the disaggregated stack.

Share this post

Blog

Alluxio Enterprise for Data Analytics Scales to New Heights

We are thrilled to announce the general availability of Alluxio Enterprise for Data Analytics 3.2! With data volumes continuing to grow at exponential rates, data platform teams face challenges in maintaining query performance, managing infrastructure costs, and ensuring scalability. This latest version of Alluxio addresses these challenges head-on with groundbreaking improvements in scalability, performance, and cost-efficiency.

Introducing Rapid Alluxio Deployer On AWS: Experience The Benefits Of Alluxio Enterprise AI In A Few Clicks

We’re excited to introduce Rapid Alluxio Deployer (RAD) on AWS, which allows you to experience the performance benefits of Alluxio in less than 30 minutes. RAD is designed with a split-plane architecture, which ensures that your data remains secure within your AWS environment, giving you peace of mind while leveraging Alluxio’s capabilities.

Six Tips To Optimize PyTorch for Faster Model Training

PyTorch is one of the most popular deep learning frameworks in production today. As models become increasingly complex and dataset sizes grow, optimizing model training performance becomes crucial to reduce training times and improve productivity.

‍

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo