Skip to content

Data Developer & Engineer

Media Files

korawica/ddedocs

Data Developer & Engineer

korawica/ddedocs

Home
Home
- Methodology
- Requirements Gathering
- Daily Work
- Data Storytelling
- Abstraction
  Abstraction
  - Data Management
    
    Data Management
    
    Data Model
    
    Data Modeling
    
    Data Modeling
    
    Inmon
    Inmon
    
    Abstract
    
    Kimball
    Kimball
    
    Abstract
    
    Implement
    
    Slowly Changing Dim
    
    Rapidly Changing Dim
    
    Techniques
    
    Data Vault
    Data Vault
    
    Abstract
    
    Implement
    
    Buzz Act Schema
    
    Anchor
    
    One Big Table
    
    Data Integration
    
    Data Transformation
    
    Data Quality
    
    Data Storage
    Data Storage
    
    Data Warehouse
    
    Data Mart
    
    ODS
    
    Data Lifecycle
    
    Normalization
    
    De-Normalization
    
    SCD
  - Data Governance
    
    Data Governance
    
    Implement
    
    Data Quality Framework
    
    Modern Team
  - Data Architecture
    
    Data Architecture
    
    Data Lakehouse
    
    Data Lakehouse
    
    PBAC
    
    Modern Data Stack
    
    Secure Data Platform
    
    Event-Driven Data Architect
    
    Lambda & Kappa
  - Data Mesh
    
    Data Mesh
    
    Data as a Product
    
    Data Domain Usage Monitoring
  - DataOps
    
    DataOps
    
    Data Product
    
    Data CICD
  - Data Pipeline
    
    Data Pipeline
    
    Declarative
  - Data Observability
    
    Data Observability
    
    Data Orchestration
    
    Data Quality Metric
    
    Data Quality Pyramid
    
    Data Lineage
    
    Data Consistency
    
    Viable Monitoring System
  - Data Strategy
    
    Data Strategy
    
    Semantic Layer
    
    Data Driven
    
    Sensitive Data
- Advance
  Advance
  - MLOps
    
    MLOps
    
    Challenge
    
    CICD
  - Emerging Trends
    
    Emerging Trends
- Roles
  Roles
  - Lead Data Engineer
Services
Services
- Cloud Provider
  Cloud Provider
  - Azure
    
    Azure
    
    OAuth
    
    VNet
    
    Storage
    
    KeyVaults
    
    ServiceBus
    
    Database
    Database
    
    Auth
    
    Monitoring
    
    Batch
    
    Batch
    
    Start Task
    
    Auto Scalable
    
    Run Pyspark
    
    Connections
    Connections
    
    Azure
    
    Google
    
    Dockerize
    Dockerize
    
    Docker
    
    Docker inside Node
    
    Function App
    Function App
    
    Function V2
    
    Dockerize
    
    Connections
    Connections
    
    Azure
    
    Databricks
    
    Databricks
    
    Init Script
    
    Mount Storage
    
    Secrets
    
    Connections
    Connections
    
    To Azure
    
    To Synapse
    
    To Google
    
    To AWS
    
    Unity Catalog
    Unity Catalog
    
    Setup
    
    Privileges
    
    Event Hubs
    
    Data Factory
    
    Data Factory
    
    Share IR
    
    Link Services
    
    Synapse
    Synapse
    
    Auth
    
    External Data Source
    
    Partition View
    
    Monitoring
    
    Low-Level Security
    
    Date & Timezone
    
    DevOps
    DevOps
    
    For Loop
    
    Multi Repo
    
    Self Hosted
    
    Fabric
    
    Fabric
  - AWS
    
    AWS
    
    IAM
    
    VPC
    
    IoT Core
    IoT Core
    
    Rules
    Rules
    
    to S3
    
    to Kinesis
    
    to Timestream Grafana
    
    S3
    S3
    
    Filter Content
    
    Trigger Lambda
    
    Transform Lambda
    
    EC2
    EC2
    
    Domain with Route53
    
    ECS
    ECS
    
    With Fargate
    
    Lambda
    
    Lambda
    
    With Docker
    
    CICD
    
    Step Functions
    Step Functions
    
    Getting Started
    
    State Machine Language
    
    Combine Parallel Results
    
    Glue
    
    Glue
    
    Data Quality
    
    With Iceberg
    
    Local Env
    
    Athena
    Athena
    
    With DeltaLake
    
    EMR
    EMR
    
    Compare Databricks
    
    Kinesis
    Kinesis
    
    Data Streams
    
    Data Firehose
    
    Secret
    Secret
    
    Across Account
  - Google
    
    Google
    
    OAuth
    
    OIDC
    
    Cloud Functions
    Cloud Functions
    
    To Managing Secrets
    
    BigQuery
    BigQuery
    
    Getting Started
- Data Processing
  Data Processing
  - Databricks
    
    Databricks
    
    Dynamically Workflow
    
    With FastAPI to Serverless
    
    Custom Python Docker
    
    AWS Orchestration
    
    Deploy with AWS
    
    Custom Policy
    
    Row & Column Level Filter
    
    Data Quality
    
    Workspace
    Workspace
    
    Migration Workspace
    
    Functional Workspace Organization
  - Snowflake
    
    Snowflake
    
    Data Wash
- IaC & Infra
  IaC & Infra
  - Ansible
    
    Ansible
  - Terraform
    
    Terraform
    
    Manage Secret
    
    Providers
    Providers
    
    Databricks
    
    Azure Databricks
    
    AWS Glue
  - Pulumi
    
    Pulumi
  - OpenTofu
    
    OpenTofu
  - Infisical
    
    Infisical
- Server & Container
  Server & Container
  - Server
    
    Server
    
    SSH
    
    SSL/TSL
    
    SFTP
  - Docker
    
    Docker
    
    Dockerfile
    
    Commands
    Commands
    
    Management
    
    Composes
    Composes
    
    Postgres
  - Kubernetes
    
    Kubernetes
    
    Pod Scheduling
    
    Networking
    
    RBAC
    
    State Phase
Tools
Tools
- Common
  Common
  - Git
    
    Git
    
    Scenarios
    
    Branching Strategies
    
    Commit Release
    
    Hooks
- Orchestration
  Orchestration
  - Airflow
    
    Airflow
    
    Sensor
    
    Repeatable DAGs
    
    Cost Optimize
    
    Pool
    
    Unittest
    
    Implements
    Implements
    
    CICD
    
    On K8s
    
    Connections
    Connections
    
    To Kafka
    
    To DBT
    
    To Minio
  - Dagster
    
    Dagster
    
    Dynamic Partition
    
    Connections
    Connections
    
    To DLT
  - Bacalhau
    
    Bacalhau
- Ingestion
  Ingestion
  - Airbyte
    
    Airbyte
    
    With Terraform
  - DltHub
    
    DltHub
- Extract & Transform & Load
  Extract & Transform & Load
  - Pandas
    
    Pandas
    
    SQL Conversion
  - Polars
    
    Polars
    
    Data Pipeline
    
    DeltaLake SCD2
    
    Connections
    Connections
    
    To Synapse
    
    To DeltaLake
  - DBT
    
    DBT
    
    Kimbal Modeling
    
    LakeHouse
    
    DBT Loom
    
    DBT Mesh
    
    Connections
    Connections
    
    To Synapse
    
    To Trino
    
    To Athena
  - Spark
    
    Spark
    
    IO
    
    UDFs
    
    RDD
    RDD
    
    Foreach & Foreach Partition
    
    Map & Map Partition
    
    Optimizations
    
    Optimizations
    
    Bucketing
    
    Joining
    
    Repartitioning
    
    Nested Data Types
    
    Serialization
    
    Data Skew
    
    Data Spill
    
    Shuffling
    
    Storage
    
    Pyspark
    Pyspark
    
    Select & SelectExpr
    
    Data Wrangling Functions
    
    Dynamic Json
    
    SCD2
    
    RegExp
    
    Media Files Media Files
    Table of contents
    
    Basic Features
    
    Image Files
    
    Audio Files
    
    Video Files
    
    Additional Features provided by PySpark API
    
    Image Files
    
    Audio Files
    
    Video Files
    
    Read Mores
    
    Unittest
    
    Avoid These at Any Cost
    
    Structured Stream
    
    Structured Stream
    
    Aggregate
    
    Read Files
    
    Multi Query
    
    ForEach Batch
    
    Deploy
    Deploy
    
    On Local
    
    On Docker
    
    On Kubernetes
    
    On Local with DataProc Serverless
    
    Updated
    Updated
    
    Spark 3.4
    Spark 3.4
    
    Parameterised SQL
  - DuckDB
    
    DuckDB
    
    Database
    
    Connections
    Connections
    
    To DeltaLake
- Streaming
  Streaming
  - Kafka
    
    Kafka
    
    With Zookeeper
    
    Use-Cases
    Use-Cases
    
    Agoda
  - Flink
    
    Flink
- Open Table
  Open Table
  - DeltaLake
    
    DeltaLake
    
    Deletion Vector
    
    Liquid Clustering
    
    Partition Z-Order Cluster
    
    Universal Format
    
    Merge
    
    Pyspark API
    Pyspark API
    
    Auto Schema Evolution
    
    Image Files
    
    SCD2
    
    Star Schema
    
    Keeping Fast & Clean
    
    Best Practice
    
    Handling Concurrent Write
    
    Stream Data
  - Iceberg
    
    Iceberg
    
    Reduce Full Scan
    
    Small Files
    
    Concurrent Write
    
    With Pyspark
  - Hudi
    
    Hudi
- Storage
  Storage
  - Hadoop
    
    Hadoop
    
    On Mac
  - MinIO
    
    MinIO
- Quality
  Quality
  - Great Expectations
    
    Great Expectations
    
    With Databricks
    
    With Spark
- CICD & Monitoring
  CICD & Monitoring
  - Jenkins
    
    Jenkins
  - Jira
    
    Jira
    
    JQL
  - Vault
    
    Vault
- ML & BI
  ML & BI
  - MLflow
    
    MLflow
  - Pinot
    
    Pinot
  - Trino
    
    Trino
  - Superset
    
    Superset
    
    Deploy
    Deploy
    
    Kubernetes
- Programing Languages
  Programing Languages
  - Shell
    
    Shell
    
    App
  - PowerShell
    
    PowerShell
    
    Batch File
    
    RestAPI
  - SQL
    
    SQL
    
    Optimizing SQL Queries
  - Python
    
    Python
    
    Wheel
    
    Sync Multi-processes
    
    Data Structure for DE
    
    Threading
    
    Libraries
    Libraries
    
    Pytest
    
    Pre-Commit
    
    Joblib
    
    Pydantic
    
    SQLAlchemy
    
    Functional Programing
    Functional Programing
    
    Monad
    
    Toolz
    
    Rust
    Rust
    
    With Rust
    
    Migration
    
    Versions
    Versions
    
    Python 3.12
  - GO
    
    GO
    
    Command
    
    Unittest
    
    Connect Database
    
    Tools
    Tools
    
    Connect Kafka
    
    Connect Redis
    
    Hexagonal Architect
  - Scala
    
    Scala
    
    Command
    
    Advance Feature
    
    Collection
    
    OOP Concept
  - Rust
    
    Rust
    
    Learning
    
    From Python
    
    CLI Application
Blogs
Blogs
- Datetime
  Datetime
  - March 2024
- Categories
  Categories
  - Knowledge

Handling Media Files

Pyspark provides several APIs to deal with image, audio, and video files. In this article we will discuss some ways to handle these files in PySpark.

Basic Features

It just basic ways to handle these files. Depending on the specific use case, you may need to perform additional operations such as resizing images, extracting audio features, or processing video frames.

Image Files

from pyspark.ml.image import ImageSchema
from PIL import Image

# Read image file
image = Image.open("path/to/image.jpg")

# Convert to PySpark DataFrame
df = ImageSchema.readImages("path/to/image.jpg")

Audio Files

from pyspark.sql.functions import udf
from pyspark.sql.types import BinaryType
from pydub import AudioSegment

# Define a UDF to read audio file
@udf(returnType=BinaryType())
def read_audio_file(path):
    audio = AudioSegment.from_file(path)
    return audio.export(format="wav").read()

# Read audio file and convert to PySpark DataFrame
df = (
    spark.read.format("binaryFile")
        .load("path/to/audio.mp3")
        .selectExpr("path", "read_audio_file(content) as audio_data")
)

Video Files

import cv2
from pyspark.sql.functions import udf
from pyspark.sql.types import BinaryType

# Define a UDF to read video file
@udf(returnType=BinaryType())
def read_video_file(path):
    cap = cv2.VideoCapture(path)
    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    frames = []
    for i in range(frame_count):
        ret, frame = cap.read()
        if not ret:
            break
        frames.append(frame)
    cap.release()
    return frames

# Read video file and convert to PySpark DataFrame
df = (
    spark.read.format("binaryFile")
        .load("path/to/video.mp4")
        .selectExpr("path", "read_video_file(content) as video_data")
)

Additional Features provided by PySpark API

In addition to reading and converting image, audio, and video files to PySpark DataFrames, there are several other operations that you can perform on these files in PySpark.

Image Files

Resize images

from pyspark.ml.image import ImageSchema
from PIL import Image

# Read image file
image = Image.open("path/to/image.jpg")

# Resize image
resized_image = image.resize((224, 224))

# Convert to PySpark DataFrame
df = ImageSchema.readImages("path/to/image.jpg")

Convert images to different formats

from pyspark.ml.image import ImageSchema
from PIL import Image

# Read image file
image = Image.open("path/to/image.jpg")

# Convert to PNG format
image.save("path/to/image.png")

# Convert to PySpark DataFrame
df = ImageSchema.readImages("path/to/image.png")

Extract image features

from pyspark.ml.image import ImageSchema
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input

# Read image file
df = ImageSchema.readImages("path/to/image.jpg")

# Load pre-trained VGG16 model
model = VGG16(weights="imagenet", include_top=False)

# Preprocess input image
df = df.select("image.origin", preprocess_input("image.data").alias("features"))

# Extract image features
df = model.transform(df)

Audio Files

Extract audio features

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, DoubleType
from pyAudioAnalysis import audioFeatureExtraction

# Define a UDF to extract MFCC features from audio file
@udf(returnType=ArrayType(DoubleType()))
def extract_mfcc_features(audio_data):
    return audioFeatureExtraction.stFeatureExtraction(
        audio_data, 44100, 44100, 0.050*44100, 0.025*44100
    )[0].tolist()

# Read audio file and convert to PySpark DataFrame
df = (
    spark.read.format("binaryFile")
        .load("path/to/audio.wav")
        .selectExpr("path", "content")
)

# Extract MFCC features
df = df.select("path", extract_mfcc_features("content").alias("features"))

Convert audio files to different formats

from pyspark.sql.functions import udf
from pyspark.sql.types import BinaryType
from pydub import AudioSegment

# Define a UDF to convert audio file to MP3 format
@udf(returnType=BinaryType())
def convert_to_mp3(audio_data):
    audio = AudioSegment.from_file(audio_data, format="wav")
    return audio.export(format="mp3").read()

# Read audio file and convert to PySpark DataFrame
df = spark.read.format("binaryFile").load("path/to/audio.wav").selectExpr("path", "content")

# Convert to MP3 format
df = df.select("path", convert_to_mp3("content").alias("audio_data"))

Remove noise from audio files

You can use techniques such as bandpass filtering, low-pass filtering, or high-pass filtering to remove noise from audio files.

Video Files

Extract video frames

import cv2
import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, BinaryType

# Define a UDF to extract video frames from video file
@udf(returnType=ArrayType(BinaryType()))
def extract_video_frames(video_data):
    cap = cv2.VideoCapture(video_data)
    frames = []
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame = np.asarray(frame)
        frames.append(frame.tobytes())
    return frames

# Read video file and convert to PySpark DataFrame
df = spark.read.format("binaryFile").load("path/to/video.mp4").selectExpr("path", "content")

# Extract video frames
df = df.select("path", extract_video_frames("content").alias("frames"))

Apply video filters

import cv2
import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import BinaryType
from PIL import Image, ImageFilter

# Define a UDF to apply a Gaussian blur filter to video frames
@udf(returnType=BinaryType())
def apply_gaussian_blur(frame_data):
    # Convert bytes to NumPy array
    frame = np.frombuffer(frame_data, dtype=np.uint8).reshape((480, 640, 3))

    # Apply Gaussian blur filter
    img = Image.fromarray(frame)
    img = img.filter(ImageFilter.GaussianBlur(radius=5))
    frame = np.asarray(img)

    # Convert back to bytes
    return frame.tobytes()

# Read video file and convert to PySpark DataFrame
df = spark.read.format("binaryFile").load("path/to/video.mp4").selectExpr("path", "content")

# Apply Gaussian blur filter to video frames
df = df.select("path", apply_gaussian_blur("content").alias("frame_data"))

Perform object detection

import cv2
import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, BinaryType
from tensorflow.keras.models import load_model

# Load pre-trained object detection model
model = load_model("path/to/object_detection_model.h5")

# Define a UDF to perform object detection on video frames
@udf(returnType=ArrayType(BinaryType()))
def perform_object_detection(frame_data):
    # Convert bytes to NumPy array
    frame = np.frombuffer(frame_data, dtype=np.uint8).reshape((480, 640, 3))

    # Perform object detection
    detections = model.detect(frame)

    # Draw bounding boxes on the frame
    for detection in detections:
        x, y, w, h = detection["box"]
        cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)

    # Convert back to bytes
    return frame.tobytes()

# Read video file and convert to PySpark DataFrame
df = spark.read.format("binaryFile").load("path/to/video.mp4").selectExpr("path", "content")

# Perform object detection on video frames
df = df.select("path", perform_object_detection("content").alias("frame_data"))

Read Mores

https://blog.devgenius.io/handling-media-files-in-pyspark-image-audio-video-files-8e3bcd7a5c4e