Diverse Data Stacks Across Four Unique Companies
Written on
Chapter 1: Understanding Data Stacks
In the realm of data engineering, not every organization adopts the conventional "Modern Data Stack" or leans heavily on Spark technologies.
Many data engineers believe that they must align themselves with companies utilizing the standard modern stack or a data-centric environment focused on Spark. However, your primary objective is to enhance the company’s data value and construct a robust data foundation that meets the specific needs of stakeholders, rather than merely replicating another's data stack that may not address your data challenges. Concentrate on selecting tools that are relevant to both your current circumstances and anticipated future scenarios.
To illustrate this point, I will outline the data stacks utilized in the various companies I've been involved with, each having distinct requirements ranging from quality to budget constraints and avoidance of vendor lock-in.
Chapter 2: The Role of BI and AI in Data Stacks
The data architecture for a Business Intelligence (BI) or Artificial Intelligence (AI) project can significantly differ, particularly when handling unstructured data.
BI projects often leverage large-scale data warehouses like BigQuery, while AI initiatives that involve computer vision typically necessitate a data lake for image storage and a database to manage its metadata. The solution doesn’t always have to be a sophisticated setup like BigQuery or Snowflake. For our specific use case, where frequent updates on records (e.g., tracking processed images) are required, a Postgres database proves to be an efficient choice due to its capability for quick row-based writes.
Section 2.1: Consultancy Experience #1: Focus on BI
My inaugural position as a Data Engineer was with a consultancy that partnered with prominent Dutch brands, ranging from car dealerships to amusement parks, grocery chains, and fast-food franchises.
The projects I undertook involved storing data in BigQuery, which needed to be optimized for dashboard applications. For clients new to cloud services with limited budgets, we utilized Google Cloud functions to extract data from DynamoDB and their APIs, subsequently storing it in BigQuery. For clients further along their cloud journey, we implemented Google Cloud Composer to automate jobs for data retrieval from APIs or loading files from Google Cloud Storage Buckets into BigQuery. This role provided an excellent foundation, allowing me to enhance my SQL and Python skills while delivering substantial value to clients.
Section 2.2: Consultancy Experience #2: Merging AI and BI
Simultaneously managing multiple projects expanded my knowledge base, as I was not confined to a narrow skill set.
I worked on two projects from the ground up, both devoid of any pre-existing code. The first project, involving 8 hours a week, aimed to consolidate various systems to improve cost visibility across plants (pertaining to a large agricultural firm) and to monitor field workers' performance (tracking batch completion rates). This project necessitated substantial backfills due to inconsistent source data. Using Prefect’s hybrid version allowed the company to handle backfills autonomously by specifying dates, freeing me up to refine data using SQL.
The second project, where I dedicated 32 hours weekly, marked my entry into the AI domain. Collaborating with computer vision engineers, I constructed a data architecture in Azure to manage terabytes of image data. We opted for Virtual Machines over Azure App Services due to their superior performance in handling API calls. For efficient image processing, we employed RabbitMQ, enabling one-by-one image processing while optimizing the worker size per queue for various computer vision models.
Chapter 3: Developing Digital Marketing Tools
At another organization, the focus was on developing an internal ETL tool to mitigate rising digital marketing costs, particularly for data acquisition from Facebook ads.
The company sought to create a SaaS application for its customers. While a third-party team handled the front-end development, my team, consisting of three junior data engineers, established the necessary API connections. We hosted FastAPI on Google Cloud Run to manage incoming requests and stored intermediate data in Google Cloud Storage before transferring it to other storage solutions like BigQuery and Azure Blob Storage.
I opted for a monolithic git design to allow different microservices to utilize shared helper functions, simplifying maintenance. The design adhered to a consistent approach, utilizing configuration files for API options and enforcing a data schema for data integrity.
Chapter 4: Health Tech AI Challenges
Currently, at a small startup, we analyze hospital data to predict the likelihood of post-surgical infections.
A significant hurdle we face is processing datasets that exceed our available resources by 5-10 times. In specific instances, we must handle files over 160GB on machines with less than 16GB RAM. Instead of augmenting our RAM, we require ample disk storage for data spill, utilizing DuckDB, Spark, or Pandas through Modin backed by Ray. Our extraction layer must integrate with various sources, sometimes querying databases directly or receiving files with inconsistent structures, necessitating specific transformations to align with our product's data contract.
To address these challenges, we developed a submodule to manage utility code for data conversions, ensuring compatibility with Pandas and Spark for seamless switching based on data sizes. This submodule is designed to be reusable across different hospital datasets and operates with Mage.ai for orchestration.
Conclusion: Choosing the Right Tools
When comparing AI and BI, it becomes evident that distinct toolkits are necessary. AI thrives with data lakes, while BI is more effectively supported by data warehouses or lakehouses.
This doesn’t imply that either solution cannot be utilized for both purposes, but if AI is the primary focus, constructing it from a BI perspective might not be worthwhile. Resource limitations, as faced by smaller companies, demand tailored solutions capable of managing data volumes far exceeding available resources.
Embarking on your data engineering journey through consultancies presents a valuable opportunity to engage with diverse clients, each employing different data stacks and possessing unique business needs, ultimately enabling you to cultivate a wide array of skills.
Always prioritize the most suitable tools for your current and anticipated challenges.