A light primer on incremental data storage pipelines

Posted by Venkatesh Subramanian on May 06, 2023 · 7 mins read

Data storage is at the heart of Data engineering, and it can be overwhelming to comprehend the multitude of options available today. Data architecture typically consists of several components such as Operational Data Store (ODS), Data Lake, Data Warehouse, and Data Marts. These components have different roles in storing and managing data, and together they form a robust data architecture. In this post I will attempt to demystify these storages and give you a beginner’s guide to understanding what each option entails and what scenarios they are best suited for.

Operational Data Store (ODS) is a real-time database that stores data from different sources. It is typically used for transactional data, which includes data from different business processes. ODS is optimized for high-speed read and write operations, which makes it ideal for applications that require real-time data processing. These are optimized for high speed data transactions with read-write ratios ranging from 1:1 to 1:4. However, ODS may not be suitable for analytical applications as it lacks the ability to store large volumes of data.

Data Lake is a large, centralized repository that stores data in its raw, unprocessed form. It is designed to store structured and unstructured data from various sources such as social media, sensors, and weblogs. Data Lake enables organizations to store large amounts of data without the need for predefined schemas, which makes it highly flexible. Data Lake is suitable for big data analytics, machine learning, and data exploration applications.

Data Warehouse is a centralized repository that stores data from different sources after processing it. The data in a data warehouse is typically structured and is organized into different subject areas. It is designed to support complex analytical queries and provide business intelligence. Data Warehouse typically employs Extract, Transform, and Load (ETL) processes to extract data from different sources, transform it into a standardized format, and load it into the warehouse. Data Warehouse is suitable for business intelligence applications, such as dashboards and reporting. It is optimized for efficient querying, aggregation and typically read-heavy workloads with a read-write ratio of 10:1 or higher.

Data Marts are subsets of a data warehouse that are designed to support specific business functions. They contain a subset of the data in the data warehouse and are optimized for query performance. Data marts are typically organized by business functions such as finance, sales, and marketing. They provide a more focused and efficient way to retrieve data than a data warehouse.

Let’s consider a hypothetical example of an e-commerce company that wants to analyze customer behavior to improve their sales and marketing strategies. The company collects data from various sources such as website clicks, mobile app usage, social media interactions, purchase history, and customer service interactions.

Here is a step-by-step guide on how you can architect an e-commerce platform incrementally and cost-effectively on the cloud:

  1. Start with a Cloud SQL instance: As a starting point, you can use a Cloud SQL instance to store and manage your data. You can store transactional data such as customer orders, shipping details, and payment information in Cloud SQL. This will help you maintain the integrity of your data and ensure that it is consistent and accurate.
  2. Add a Data Lake: As your data grows beyond what Cloud SQL can handle efficiently, you can add a Google Cloud Storage (GCS) bucket to store unstructured or semi-structured data. This can serve as your data lake, and you can use BigQuery to query the data in GCS. You can store clickstream data, mobile app logs, and social media interactions in GCS. This will enable you to perform analytics on large volumes of data and derive insights to improve your business.
  3. Migrate to BigQuery: As your data grows even further, you can migrate your data from Cloud SQL to BigQuery, a serverless data warehouse that can handle petabyte-scale data. You can store historical transactional data in BigQuery, along with data from your data lake. This will allow you to perform complex queries and analysis on large volumes of data, including machine learning and artificial intelligence applications.
  4. Implement Data Marts: As your business grows, you may need to build data marts to support specific business functions. You can use BigQuery to build data marts by creating views that extract and transform the data relevant to each business function. For example, you can create a data mart to analyze customer behavior across different channels, such as website, mobile app, and social media. This will help you gain a deeper understanding of your customers and improve your marketing strategies.
  5. Implement a Data Warehouse: Finally, as your data volumes continue to grow, you may need to build a data warehouse that can store and manage all your data in one place. You can use BigQuery to build your data warehouse, and you can use features such as partitioning and clustering to optimize query performance. This will enable you to perform complex analytics and derive insights from all your data, including customer transactional data, clickstream data, mobile app logs, and social media interactions.

By following these steps, you can build an e-commerce platform incrementally and cost-effectively on the cloud. You can start small with a Cloud SQL instance and add more services as your data grows and your business needs evolve.

So as we can see, the solution to use a combination of storage options is a spectrum that ranges from a single storage to mult-storage approach.

As an architect, you must evaluate the your data requirements and determine the optimal storage solutions based on the data characteristics and processing needs. The decision to use a particular storage solution or combination of solutions should also take into account the long-term data strategy, scalability, and flexibility requirements.


Subscribe

* indicates required

Intuit Mailchimp