Tools
Warning
I will filter Data Engineering Tools on this session that do not dynamically and flexibility for the most Data Architect and Modern Data Strack.
Note
This session groups any Open-Soure Tools base on Modern Data Stack concept. Some topic I found the tools from the ReStack
This tools topic, I will focus with below contents:
- Setting Connections
- Implement its Features
- Tuning & Optimization
Tools Stacks
The tools stacks choice for each Data Architecture that fit with cost and easy to implement for small to large scale.
Tools Comparison
- Five Apache projects you probably didn’t know about
- Airflow vs. Prefect vs. Kestra — Which is Best for Building Advanced Data Pipelines?
- Converting Huge CSV Files to Parquet with Dask, DuckDB, Polars, Pandas.
- Hot Take — Apache Hudi, Delta Lake, Apache Iceberg are Divergent
Open Table
File Format
- Comparing Performance of Big Data File Formats: A Practical Guide
- https://medium.com/@turkelturk/data-file-formats-in-data-engineering-5ba0db8c2c16
- Compressing Your Data: A Comparison of Popular Algorithms
Data Ingestion
Modern Data Stack: Reverse ETL
Computing
Dataframe API
DuckDB vs Polars
Read More: Benchmarking Python Processing Engines: Who’s the Fastest?
Data Quality
https://medium.com/@brunouy/a-guide-to-open-source-data-quality-tools-in-late-2023-f9dbadbc7948