Welcome to Data Develop & Engineer
Disclaimer: This docs add-on my opinion from Data Engineer experience and experiment around ~5 years (Since 2019).
Important
I do not have much proper English grammar because I am in the middle level of trying to practice writing and reading. Please understand this problem and open your mind before continue this documents
This project will deliver all Practice and Knowledge of Data Developer and Engineer area.
Getting Started
First, Data Engineering is a critical part of the Data Lifecycle that enables organizations to manage and process large volumes of data efficiently and reliably3.
By these concepts, Data Engineer should design and implement Data Pipeline and Data Management Strategy that meet the requirements and KPI of their organizations and ensure that your data was managed Consistently and Reliably.
What is DE do?
Data Engineer is who able to Develop, Implement, Operate, and Maintain any tools on the current Data Infrastructure that your organization use, either On-premises or Cloud providers, comprising databases, storages, compute engines, and pipelines.1
Fundamentals of Data Engineering
Data Engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning.
Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering.
A Data Engineer manages the Data Engineering Lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.
— Joe Reis and Matt Housley in Fundamentals of Data Engineering
You will see that stages of the cycle include Data Ingestion, Data Transformation, Data Serving, and Data Storage components.
Best practice | Importance |
---|---|
Proactive data monitoring | Regularly checks datasets for anomalies to maintain data integrity. This includes identifying missing, duplicate, or inconsistent data entries. |
Schema drift management | Detects and addresses changes in data structure, ensuring compatibility and reducing data pipeline breaks. |
Continuous documentation | Manages descriptive information about data, aiding in discoverability and comprehension. |
Data security measures | Controls and monitors access to data sources, enhancing security and compliance. |
Version control and backups | Tracks change to datasets over time, aiding in reproducibility and audit trails. |
Since I started on this role, I got the idea about the future of my responsibilities. I know the Data Engineering tools shifts so fast because the last three year I started with the Map-Reduce processing on the Hadoop HDFS but nowadays, it changes to In-Memory processing like Impala or Spark. The knowledge I gained from Map-Reduce will be wasted .
The right picture, the 2023 MAD (ML/AI/Data) Landscape 2, that show about how many possibility tools that able to use on your project. It has many area that you should to choose which one that match with the current architect or fit with your cost planing model.
Finally, the below diagram shows how the focus areas of Data Engineering Shift as the analytics organization evolves. That mean Data Engineer does not create a part of data ingestion or serving only. When data engineering tools change very quickly, The focus of data engineers has changed as well.
Based upon this illustration, we can observe three distinct focus areas for the role:
-
Data Infrastructure: One example of a problem being solved in this instance might be setting up a spark cluster for users to issue HQL queries against data on S3.
-
Data Integration: An example task would be creating a dataset via SQL query, joining tens of other datasets, and then scheduling the query to run daily using the orchestration framework.
-
Data Accessibility: An example could be enabling end-users to analyze significant metrics movements in a self-serve manner.
The trend of Modern Data Stack will make a data engineering process so easy to implement and maintenance that making you have the time to focus on business problem instead technical problem.
In the another side, Business users able to use less of technical knowledge to interact the serving data in their data contract platform. It decrease SLA to require the data engineer for need support a lot!
You can follow the modern data stack on the below topics:
Roles
In the future, if I do not in love with communication or management skill that make me be Lead Data Engineer, I will go to any specialize roles such as,
-
Data Platform Engineer
Data Platform Engineer
-
DataOps Engineer
DataOps Engineer
-
MLOps Engineer
MLOps Engineers Build and Maintain a platform to enable the development and deployment of machine learning models. They typically do that through standardization, automation, and monitoring.
MLOps Engineers reiterate the platform and processes to make the machine learning model development and deployment quicker, more reliable, reproducible, and efficient.
-
Analytic Engineer
Analytic Engineer is who make sure that companies can understand their data and use it to Solve Problems, Answer Questions, or Make Decisions.
The role from above, I reference from Types of Data Professionals4.
Communities
This below is the list of Communities that you must join for keep update knowledge for Developer and Data Engineer trends.
-
The Medium Tag for Data Engineering knowledge and solutions
-
An Area of Discussing Blog for Data Engineer like talk to your close friend at the Cafe
-
The Medium Group that believes software development should be joyful and advocates deliberate practice
-
Community Driven Roadmaps, Articles and Resources for developers in Thailand
-
Learn to build high-quality web apps with best practices
-
My inspiration Data Engineering document website.
-
Information of this quote reference from What is Data Engineering? ↩
-
Unlocking the Power of Data: A Beginner’s Guide to Data Engineering ↩
-
Types of Data Professionals, credit to Kevin Rosamont Prombo for creating the Infographic ↩