GCP Study Hub

Big Data

BigQuery

Serverless, highly scalable data warehouse — GCP's crown jewel

AWS equivalent

Redshift

Data WarehouseServerlessSQLCore Service

Architecture Diagram

GCP Big Data Pipeline

🗄️
Data SourcesDBs, Files, APIs
📨
Pub/SubReal-time ingest
⚙️
DataflowETL / Transform
📊
BigQueryData Warehouse
📈
Looker StudioDashboards

💡 Batch alternative: GCS → Dataflow → BigQuery  | Hadoop migration path: On-prem HDFS → GCS → Dataproc → BigQuery

🔄

AWS → GCP: Key Differences

  • Serverless: no cluster to provision, patch, resize, or manage. Redshift requires cluster management.

  • Pay-per-query (on-demand): $5 per TB scanned. Only pay when you actually run queries.

  • Scales automatically: queries can use thousands of slots without any configuration.

  • Separates compute and storage: you can query data in GCS via BigLake without loading it.

📌

Key Concepts to Know

  • 1

    Columnar storage: only reads the columns needed for a query — key reason for speed and cost efficiency.

  • 2

    Partitioning: partition by date/timestamp dramatically reduces query cost.

  • 3

    Clustering: physically sort data within partitions — further reduces scan cost.

  • 4

    BigQuery ML: train and run ML models directly in SQL — no Python, no infrastructure.

  • 5

    Streaming inserts: ingest data in real-time via Pub/Sub + Dataflow.

  • 6

    Row-level and column-level security: restrict access to specific rows or columns per user group.

💡

DCE Interview Tips

  • The single most important GCP service for this role. Know it deeply.

  • Migration pitch: 'Redshift requires you to size and pay for a cluster 24/7. BigQuery scales automatically and you only pay when you run queries.'

  • For Thai bank analytics: 'Replace your on-prem Hadoop/Hive with BigQuery. Same SQL skills, no cluster management, 100x faster on cold data.'

  • Know the pipeline: Pub/Sub → Dataflow → BigQuery → Looker Studio.

⚠️

Common Gotchas

  • !

    SELECT * is your enemy — it scans ALL columns. Always SELECT only needed columns.

  • !

    On-demand pricing can surprise customers: a single bad query scanning 100TB costs $500. Use cost preview and set per-query cost limits.

  • !

    Slot reservations (flat-rate) required for predictable costs in production.