Data Engineering Hierarchy of Needs
2020-07-16 , Bay Area Meetup [Sessions start: Thursday 16.07 9am (Thursday 16..07 9am PDT)]

Data Infrastructures look differently between small, mid, and large sized companies. Yet, most content out there is for large and sophisticated systems. And almost none of it is on migrating a legacy, on-prem, databases over to the cloud.

We'll begin with the fundamentals of building a modern Data Infrastructure from the ground up through a hierarchy of needs. The hierarchy has a (subjective) 7 levels, ranging from Automation to Data Streaming.


Different Companies have Different Data Infra Needs

  1. Small-Sized Companies
  2. Mid
  3. Large

The Hierarchy

Picture here

  1. Automate - Moving from scripts and manual process over to a transparent ETL software; e.g. Airflow.

  2. Extract - Without Extraction, there is no data.

  3. Load - Storage is cheap. Data Loss is expensive. Load first.

  4. Transform - Only SQL allowed for maintainable reasons.

  5. Optimize - Spark.

  6. Machine Learning - Integrate ML for automation from ingestion to modeling. Use Airflow.

  7. Streaming - Near/Real-Time Data and Transactions.

I am a first generation white-collar worker and Salvadoran immigrant. I am currently a Data Engineer working on automating ETL pipelines for Data Analysts/Scientists. My focus is reproducible data infrastructure as code that is easy to stand-up and troubleshoot.