Skip to content

Data Developer & Engineer

With Iceberg

ddeutils/ddedocs

Data Developer & Engineer

ddeutils/ddedocs

Home
Home
- Methodology
- Requirement Gathering
- Transform Spec
- Daily Work
- Data Storytelling
- Abstraction
  Abstraction
  - Data Management
    
    Data Management
    
    Data Model
    
    Data Modeling
    
    Data Modeling
    
    Inmon
    Inmon
    
    Abstract
    
    Kimball
    Kimball
    
    Abstract
    
    Implement
    
    Slowly Changing Dim
    
    Rapidly Changing Dim
    
    Techniques
    
    Data Vault
    Data Vault
    
    Abstract
    
    Implement
    
    Buzz Act Schema
    
    Anchor
    
    One Big Table
    
    Data Integration
    
    Data Transformation
    
    Data Quality
    
    Data Storage
    Data Storage
    
    Data Warehouse
    
    Data Mart
    
    ODS
    
    Data Lifecycle
    
    Normalization
    
    De-Normalization
    
    SCD
  - Data Governance
    
    Data Governance
    
    Implement
    
    Data Quality Framework
    
    Modern Team
  - Data Architecture
    
    Data Architecture
    
    Data Lakehouse
    
    Data Lakehouse
    
    PBAC
    
    Modern Data Stack
    
    Secure Data Platform
    
    Event-Driven Data Architect
    
    Lambda & Kappa
  - Data Mesh
    
    Data Mesh
    
    Data as a Product
    
    Data Domain Usage Monitoring
  - DataOps
    
    DataOps
    
    Data Product
    
    Data CICD
  - Data Pipeline
    
    Data Pipeline
    
    Declarative
  - Data Observability
    
    Data Observability
    
    Data Orchestration
    
    Data Quality Metric
    
    Data Quality Pyramid
    
    Data Lineage
    
    Data Consistency
    
    Viable Monitoring System
  - Data Strategy
    
    Data Strategy
    
    Semantic Layer
    
    Data Driven
    
    Sensitive Data
- Advance
  Advance
  - MLOps
    
    MLOps
    
    Challenge
    
    CICD
  - Emerging Trends
    
    Emerging Trends
- Roles
  Roles
  - Lead Data Engineer
Services
Services
- Cloud Provider
  Cloud Provider
  - Azure
    
    Azure
    
    OAuth
    
    VNet
    
    Storage
    
    KeyVaults
    
    ServiceBus
    
    Database
    Database
    
    Auth
    
    Monitoring
    
    Batch
    
    Batch
    
    Start Task
    
    Auto Scalable
    
    Run Pyspark
    
    Connections
    Connections
    
    Azure
    
    Google
    
    Dockerize
    Dockerize
    
    Docker
    
    Docker inside Node
    
    Function App
    Function App
    
    Introduction
    
    Function V2
    
    Dockerize
    
    Connections
    Connections
    
    Azure
    
    Databricks
    
    Databricks
    
    Init Script
    
    Mount Storage
    
    Secrets
    
    Connections
    Connections
    
    To Azure
    
    To Synapse
    
    To Google
    
    To AWS
    
    Unity Catalog
    Unity Catalog
    
    Setup
    
    Privileges
    
    Event Hubs
    
    Data Factory
    
    Data Factory
    
    Share IR
    
    Link Services
    
    Synapse
    Synapse
    
    Auth
    
    External Data Source
    
    Partition View
    
    Monitoring
    
    Low-Level Security
    
    Date & Timezone
    
    DevOps
    DevOps
    
    For Loop
    
    Multi Repo
    
    Self Hosted
    
    Fabric
    
    Fabric
  - AWS
    
    AWS
    
    IAM
    
    VPC
    
    IoT Core
    IoT Core
    
    Rules
    Rules
    
    to S3
    
    to Kinesis
    
    to Timestream Grafana
    
    S3
    S3
    
    Filter Content
    
    Trigger Lambda
    
    Transform Lambda
    
    EC2
    EC2
    
    Domain with Route53
    
    ECS
    ECS
    
    With Fargate
    
    Lambda
    
    Lambda
    
    With Docker
    
    CICD
    
    Step Functions
    Step Functions
    
    Getting Started
    
    State Machine Language
    
    Combine Parallel Results
    
    Glue
    
    Glue
    
    Data Quality
    
    With Iceberg With Iceberg
    Table of contents
    
    Getting Started
    
    Define the important libraries
    
    Define Spark and Glue context
    
    Read the source Glue table and write into a destination Glue
    
    Read Mores
    
    Local Env
    
    Athena
    Athena
    
    With DeltaLake
    
    EMR
    EMR
    
    Compare Databricks
    
    Kinesis
    Kinesis
    
    Data Streams
    
    Data Firehose
    
    Secret
    Secret
    
    Across Account
  - Google
    
    Google
    
    OAuth
    
    OIDC
    
    Cloud Functions
    Cloud Functions
    
    To Managing Secrets
    
    BigQuery
    BigQuery
    
    Getting Started
    
    With Iceberg
    
    Utility Funcs
- Data Processing
  Data Processing
  - Databricks
    
    Databricks
    
    Dynamically Workflow
    
    With FastAPI to Serverless
    
    Custom Python Docker
    
    AWS Orchestration
    
    Deploy with AWS
    
    Custom Policy
    
    Row & Column Level Filter
    
    Data Quality
    
    Custom Spark Connector
    
    Workspace
    Workspace
    
    Migration Workspace
    
    Functional Workspace Organization
    
    UDF
    
    SQL Params
    
    SQL Script
  - Snowflake
    
    Snowflake
    
    Data Wash
- IaC & Infra
  IaC & Infra
  - Ansible
    
    Ansible
  - Terraform
    
    Terraform
    
    Manage Secret
    
    Providers
    Providers
    
    Databricks
    
    Azure Databricks
    
    AWS Glue
  - OpenTofu
    
    OpenTofu
  - Infisical
    
    Infisical
- Server & Container
  Server & Container
  - Server
    
    Server
    
    SSH
    
    SSL/TSL
    
    SFTP
  - Docker
    
    Docker
    
    Dockerfile
    
    Commands
    Commands
    
    Management
    
    Composes
    Composes
    
    Postgres
  - Kubernetes
    
    Kubernetes
    
    Pod Scheduling
    
    Networking
    
    RBAC
    
    State Phase
Tools
Tools
- Common
  Common
  - Git
    
    Git
    
    Scenarios
    
    Branching Strategies
    
    Commit Release
    
    Hooks
- Programing Langs
  Programing Langs
  - Shell
    
    Shell
    
    App
  - PowerShell
    
    PowerShell
    
    Batch File
    
    RestAPI
  - SQL
    
    SQL
    
    Optimizing SQL Queries
  - Python
    
    Python
    
    Wheel
    
    Sync Multi-processes
    
    Data Structure for DE
    
    Threading
    
    Libraries
    Libraries
    
    Pytest
    
    Pre-Commit
    
    Joblib
    
    Pydantic
    
    SQLAlchemy
    
    Functional Programing
    Functional Programing
    
    Monad
    
    Toolz
    
    Rust
    Rust
    
    With Rust
    
    Migration
    
    Versions
    Versions
    
    Python 3.12
  - GO
    
    GO
    
    Command
    
    Unittest
    
    Connect Database
    
    Tools
    Tools
    
    Connect Kafka
    
    Connect Redis
    
    Hexagonal Architect
  - Scala
    
    Scala
    
    Command
    
    Advance Feature
    
    Collection
    
    OOP Concept
  - Rust
    
    Rust
    
    Learning
    
    From Python
    
    CLI Application
Blogs
Blogs
- Datetime
  Datetime
  - March 2024
- Categories
  Categories
  - Knowledge

With Iceberg

Getting Started

Define the important libraries

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
catalog_nm = "glue_catalog"

# The Glue Database Name which has the source table
in_database="<glue-database-input>"

# The input Glue Table which we will be using as a source for the Iceberg data
in_table_name="covid_19_data"

# The Glue Database Name which will be used to create an output Iceberg table
database_op='database_ib'

# The Glue Table Name which will be used as a destination for Iceberg table
table_op='covid_dataset_iceberg'

# The S3 path which will be used to store the Iceberg files as output
s3_output_path='s3://<your-destination-bucket-name>/iceberg-output/'

table = str(catalog_nm)+ '.`' + str(database_op) + '`.' + str(table_op)

print("\nINPUT Database : " + str(in_database))
print("\nINPUT Table : " + str(in_table_name))
print("\nOUTPUT IceBerg Database : " + str(database_op))
print("\nOUTPUT IceBerg Table : " + str(table))
print("\nOUTPUT IceBerg S3 Path : " + str(s3_output_path))

In line with the script we need to define a important job parameter in the glue which will indicate the Glue job executer to leverage the Iceberg table format as output for the data. For this you need to define a parameter named as

--datalake-formats : iceberg

Define Spark and Glue context

def create_spark_iceberg(catalog_nm: str = "glue_catalog"):
    """
    Function to initialize a session with iceberg by default
    :param catalog_nm:
    :return spark:
    """
    from pyspark.sql import SparkSession
    # You can set this as a variable if required
    warehouse_path = s3_output_path

    spark = (
        SparkSession.builder
            .config(f"spark.sql.catalog.{catalog_nm}", "org.apache.iceberg.spark.SparkCatalog")
            .config(f"spark.sql.catalog.{catalog_nm}.warehouse", warehouse_path)
            .config(f"spark.sql.catalog.{catalog_nm}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
            .config(f"spark.sql.catalog.{catalog_nm}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
            .config(f"spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
            .getOrCreate()
    )
    return spark

ibspark = create_spark_iceberg(catalog_nm)
ibsc = ibspark.sparkContext
ibglueContext = GlueContext(ibsc)
ibjob = Job(ibglueContext)
ibjob.init(args["JOB_NAME"], args)

Read the source Glue table and write into a destination Glue

#Read the Glue inout table from thr Catalog using a Glue DynamicFrame
InputDynamicFrameTable = (
    ibglueContext.create_dynamic_frame
        .from_catalog(database=in_database, table_name=in_table_name)
)

#Convert the Glue DynamicFrame into a Spark DataFrame
InputDynamicFrameTable_DF = InputDynamicFrameTable.toDF()

#Register the Spark DataFrame as TempView
InputDynamicFrameTable_DF.createOrReplaceTempView("InputDataFrameTable")
ibspark.sql("select * from InputDataFrameTable LIMIT 10").show()

#Filter the source table with country as 'Australia'
colname_df = ibspark.sql("SELECT * FROM InputDataFrameTable WHERE country='Australia'")
colname_df.createOrReplaceTempView("OutputDataFrameTable")

#Write the filtered Data into an ICEBERG table format in Glue destination table
ib_Write_SQL = f"""
    CREATE OR REPLACE TABLE {catalog_nm}.{database_op}.{table_op}
    USING iceberg
    TBLPROPERTIES ("format-version"="2", "write_compression"="gzip")
    AS SELECT * FROM OutputDataFrameTable;
    """

#Run the Spark SQL query
ibspark.sql(ib_Write_SQL)

Read Mores

Deploy Apache Iceberg Data Lake on Amazon S3 using AWS Glue Spark job