Apache Iceberg

What is Apache Iceberg

Apache Iceberg has carved a niche as a high-performance open-source table format, enabling ACID transactions on petabyte-scale SQL tables. It aims to provide a compelling alternative for data lake management solutions.

Iceberg surpasses traditional formats like Parquet or ORC by offering distinct advantages:

Schema Evolution: Allows modifications to table schemas without rewriting the entire table.
Snapshot Isolation: Ensures data consistency by preventing readers and writers from interfering with each other.
Efficient Metadata Management: Utilizes metadata to manage large-scale tables efficiently, minimizing overhead associated with vast datasets.
Partition Pruning: Automatically prunes irrelevant partitions during queries, optimizing performance.

Note

PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM.

Getting Started

Basic Architecture

Table Format: Employs a manifest list and manifest files to track metadata, facilitating efficient handling of large datasets.
Snapshot Management: Each table maintains a history of snapshots, enabling time travel and rollback capabilities.
Partitioning: Leverages hidden partitioning to simplify partition management for users, enhancing performance.

Use Cases

Financial Services: Ideal for handling large-scale transactional data with ACID guarantees, making it suitable for financial applications.
E-commerce Analytics: Can effectively manage vast amounts of user data, enabling advanced analytics and personalized recommendations.

Examples

https://medium.com/@MarinAgli1/learning-apache-iceberg-storing-the-data-to-minio-s3-56670cef199d