Big Data
Cloud Dataflow
Fully managed stream and batch data processing — Apache Beam
AWS equivalent
Kinesis Data Analytics / AWS Glue
Architecture Diagram
GCP Big Data Pipeline
💡 Batch alternative: GCS → Dataflow → BigQuery | Hadoop migration path: On-prem HDFS → GCS → Dataproc → BigQuery
AWS → GCP: Key Differences
- ▸
Based on Apache Beam: write one pipeline that runs as both batch AND streaming — unified programming model.
- ▸
Closer to Kinesis Data Analytics for streaming, and Glue for batch ETL — but Dataflow unifies both.
- ▸
Fully serverless: no clusters to manage. Autoscales workers up and down based on load.
Key Concepts to Know
- 1
Apache Beam: write pipelines in Python, Java, or Go. Run on Dataflow (or Spark, Flink, locally).
- 2
Windowing: group streaming data by time windows (tumbling, sliding, session).
- 3
Templates: pre-built pipeline templates (Pub/Sub → BigQuery, GCS → BigQuery) — no code needed.
- 4
Cost: charged per vCPU-hour and GB-RAM-hour of worker execution.
DCE Interview Tips
- ★
Classic pipeline: Pub/Sub (ingest) → Dataflow (transform/enrich) → BigQuery (store/analyze).
- ★
'For your IoT factory scenario, Dataflow processes millions of sensor events per second, applies business logic, and writes clean data to BigQuery for dashboards.'
- ★
Dataflow templates are a great selling point: 'You don't need to write code to set up a Pub/Sub to BigQuery pipeline — just use the pre-built template.'
Common Gotchas
- !
Dataflow has a startup time for batch jobs (a few minutes to provision workers).
- !
Debugging streaming pipelines is harder than batch — use Cloud Monitoring and watermark metrics.