2020-07-16 –, Bay Area Meetup [Sessions start: Thursday 16.07 9am (Thursday 16..07 9am PDT)]
Data Infrastructures look differently between small, mid, and large sized companies. Yet, most content out there is for large and sophisticated systems. And almost none of it is on migrating a legacy, on-prem, databases over to the cloud.
We'll begin with the fundamentals of building a modern Data Infrastructure from the ground up through a hierarchy of needs. The hierarchy has a (subjective) 7 levels, ranging from Automation to Data Streaming.
Different Companies have Different Data Infra Needs
- Small-Sized Companies
- Mid
- Large
The Hierarchy
-
Automate - Moving from scripts and manual process over to a transparent ETL software; e.g. Airflow.
-
Extract - Without Extraction, there is no data.
-
Load - Storage is cheap. Data Loss is expensive. Load first.
-
Transform - Only SQL allowed for maintainable reasons.
-
Optimize - Spark.
-
Machine Learning - Integrate ML for automation from ingestion to modeling. Use Airflow.
-
Streaming - Near/Real-Time Data and Transactions.
I am a first generation white-collar worker and Salvadoran immigrant. I am currently a Data Engineer working on automating ETL pipelines for Data Analysts/Scientists. My focus is reproducible data infrastructure as code that is easy to stand-up and troubleshoot.