GCP Study Hub

Big Data

Cloud Dataflow

Fully managed stream and batch data processing — Apache Beam

AWS equivalent

Kinesis Data Analytics / AWS Glue

ETLStreamingBatchApache Beam

Architecture Diagram

GCP Big Data Pipeline

🗄️
Data SourcesDBs, Files, APIs
📨
Pub/SubReal-time ingest
⚙️
DataflowETL / Transform
📊
BigQueryData Warehouse
📈
Looker StudioDashboards

💡 Batch alternative: GCS → Dataflow → BigQuery  | Hadoop migration path: On-prem HDFS → GCS → Dataproc → BigQuery

🔄

AWS → GCP: Key Differences

  • Based on Apache Beam: write one pipeline that runs as both batch AND streaming — unified programming model.

  • Closer to Kinesis Data Analytics for streaming, and Glue for batch ETL — but Dataflow unifies both.

  • Fully serverless: no clusters to manage. Autoscales workers up and down based on load.

📌

Key Concepts to Know

  • 1

    Apache Beam: write pipelines in Python, Java, or Go. Run on Dataflow (or Spark, Flink, locally).

  • 2

    Windowing: group streaming data by time windows (tumbling, sliding, session).

  • 3

    Templates: pre-built pipeline templates (Pub/Sub → BigQuery, GCS → BigQuery) — no code needed.

  • 4

    Cost: charged per vCPU-hour and GB-RAM-hour of worker execution.

💡

DCE Interview Tips

  • Classic pipeline: Pub/Sub (ingest) → Dataflow (transform/enrich) → BigQuery (store/analyze).

  • 'For your IoT factory scenario, Dataflow processes millions of sensor events per second, applies business logic, and writes clean data to BigQuery for dashboards.'

  • Dataflow templates are a great selling point: 'You don't need to write code to set up a Pub/Sub to BigQuery pipeline — just use the pre-built template.'

⚠️

Common Gotchas

  • !

    Dataflow has a startup time for batch jobs (a few minutes to provision workers).

  • !

    Debugging streaming pipelines is harder than batch — use Cloud Monitoring and watermark metrics.