Big Data

Cloud Dataflow

Fully managed stream and batch data processing — Apache Beam

AWS equivalent

Kinesis Data Analytics / AWS Glue

ETLStreamingBatchApache Beam

Architecture Diagram

GCP Big Data Pipeline

🗄️

Data SourcesDBs, Files, APIs

📨

Pub/SubReal-time ingest

⚙️

DataflowETL / Transform

📊

BigQueryData Warehouse

📈

Looker StudioDashboards

💡 Batch alternative: GCS → Dataflow → BigQuery | Hadoop migration path: On-prem HDFS → GCS → Dataproc → BigQuery

🔄

▸
Based on Apache Beam: write one pipeline that runs as both batch AND streaming — unified programming model.
▸
Closer to Kinesis Data Analytics for streaming, and Glue for batch ETL — but Dataflow unifies both.
▸
Fully serverless: no clusters to manage. Autoscales workers up and down based on load.

📌

1
Apache Beam: write pipelines in Python, Java, or Go. Run on Dataflow (or Spark, Flink, locally).
2
Windowing: group streaming data by time windows (tumbling, sliding, session).
3
Templates: pre-built pipeline templates (Pub/Sub → BigQuery, GCS → BigQuery) — no code needed.
4
Cost: charged per vCPU-hour and GB-RAM-hour of worker execution.

💡

★
Classic pipeline: Pub/Sub (ingest) → Dataflow (transform/enrich) → BigQuery (store/analyze).
★
'For your IoT factory scenario, Dataflow processes millions of sensor events per second, applies business logic, and writes clean data to BigQuery for dashboards.'
★
Dataflow templates are a great selling point: 'You don't need to write code to set up a Pub/Sub to BigQuery pipeline — just use the pre-built template.'

⚠️

!
Dataflow has a startup time for batch jobs (a few minutes to provision workers).
!
Debugging streaming pipelines is harder than batch — use Cloud Monitoring and watermark metrics.