GCP Study Hub

Big Data

Cloud Dataproc

Managed Hadoop and Spark — for existing Big Data workloads

AWS equivalent

EMR (Elastic MapReduce)

HadoopSparkHiveManaged

🔄

AWS → GCP: Key Differences

▸
Both are managed Hadoop/Spark. Dataproc clusters start in ~90 seconds vs EMR's 5-10 minutes.
▸
Cheaper when used ephemerally: spin up, run job, delete.
▸
Dataproc supports: Hadoop, Spark, Hive, Pig, HBase.

📌

Key Concepts to Know

1
Ephemeral clusters: create cluster → run job → delete cluster. Pay only for job duration.
2
Preemptible VMs for worker nodes: cut costs by 70-80% for fault-tolerant Spark jobs.
3
Dataproc Metastore: shared Hive metastore across clusters and BigQuery.
4
Autoscaling: automatically add/remove worker nodes based on YARN metrics.

💡

DCE Interview Tips

★
Key DCE talking point: 'If you have existing Hadoop or Spark jobs, you can run them on Dataproc with zero or minimal code changes. Long-term, migrate analytics to BigQuery.'
★
The migration path for Thai enterprises: on-prem Hadoop → Dataproc (lift-and-shift) → BigQuery (modernize).

⚠️

Common Gotchas

!
Dataproc stores HDFS data on local disks by default — data is lost when cluster is deleted. Use GCS as the data layer.
!
Dataproc is not the long-term answer for analytics — it's a migration bridge.