GCP Study Hub

Big Data

Cloud Dataproc

Managed Hadoop and Spark — for existing Big Data workloads

AWS equivalent

EMR (Elastic MapReduce)

HadoopSparkHiveManaged
🔄

AWS → GCP: Key Differences

  • Both are managed Hadoop/Spark. Dataproc clusters start in ~90 seconds vs EMR's 5-10 minutes.

  • Cheaper when used ephemerally: spin up, run job, delete.

  • Dataproc supports: Hadoop, Spark, Hive, Pig, HBase.

📌

Key Concepts to Know

  • 1

    Ephemeral clusters: create cluster → run job → delete cluster. Pay only for job duration.

  • 2

    Preemptible VMs for worker nodes: cut costs by 70-80% for fault-tolerant Spark jobs.

  • 3

    Dataproc Metastore: shared Hive metastore across clusters and BigQuery.

  • 4

    Autoscaling: automatically add/remove worker nodes based on YARN metrics.

💡

DCE Interview Tips

  • Key DCE talking point: 'If you have existing Hadoop or Spark jobs, you can run them on Dataproc with zero or minimal code changes. Long-term, migrate analytics to BigQuery.'

  • The migration path for Thai enterprises: on-prem Hadoop → Dataproc (lift-and-shift) → BigQuery (modernize).

⚠️

Common Gotchas

  • !

    Dataproc stores HDFS data on local disks by default — data is lost when cluster is deleted. Use GCS as the data layer.

  • !

    Dataproc is not the long-term answer for analytics — it's a migration bridge.