Big Data
Cloud Dataproc
Managed Hadoop and Spark — for existing Big Data workloads
AWS equivalent
EMR (Elastic MapReduce)
AWS → GCP: Key Differences
- ▸
Both are managed Hadoop/Spark. Dataproc clusters start in ~90 seconds vs EMR's 5-10 minutes.
- ▸
Cheaper when used ephemerally: spin up, run job, delete.
- ▸
Dataproc supports: Hadoop, Spark, Hive, Pig, HBase.
Key Concepts to Know
- 1
Ephemeral clusters: create cluster → run job → delete cluster. Pay only for job duration.
- 2
Preemptible VMs for worker nodes: cut costs by 70-80% for fault-tolerant Spark jobs.
- 3
Dataproc Metastore: shared Hive metastore across clusters and BigQuery.
- 4
Autoscaling: automatically add/remove worker nodes based on YARN metrics.
DCE Interview Tips
- ★
Key DCE talking point: 'If you have existing Hadoop or Spark jobs, you can run them on Dataproc with zero or minimal code changes. Long-term, migrate analytics to BigQuery.'
- ★
The migration path for Thai enterprises: on-prem Hadoop → Dataproc (lift-and-shift) → BigQuery (modernize).
Common Gotchas
- !
Dataproc stores HDFS data on local disks by default — data is lost when cluster is deleted. Use GCS as the data layer.
- !
Dataproc is not the long-term answer for analytics — it's a migration bridge.