GCP Study Hub

Big Data

BigQuery

Serverless, highly scalable data warehouse — GCP's crown jewel

AWS equivalent

Redshift

Data WarehouseServerlessSQLCore Service

Architecture Diagram

GCP Big Data Pipeline

🗄️

Data SourcesDBs, Files, APIs

📨

Pub/SubReal-time ingest

⚙️

DataflowETL / Transform

📊

BigQueryData Warehouse

📈

Looker StudioDashboards

💡 Batch alternative: GCS → Dataflow → BigQuery | Hadoop migration path: On-prem HDFS → GCS → Dataproc → BigQuery

🔄

AWS → GCP: Key Differences

▸
Serverless: no cluster to provision, patch, resize, or manage. Redshift requires cluster management.
▸
Pay-per-query (on-demand): $5 per TB scanned. Only pay when you actually run queries.
▸
Scales automatically: queries can use thousands of slots without any configuration.
▸
Separates compute and storage: you can query data in GCS via BigLake without loading it.

📌

Key Concepts to Know

1
Columnar storage: only reads the columns needed for a query — key reason for speed and cost efficiency.
2
Partitioning: partition by date/timestamp dramatically reduces query cost.
3
Clustering: physically sort data within partitions — further reduces scan cost.
4
BigQuery ML: train and run ML models directly in SQL — no Python, no infrastructure.
5
Streaming inserts: ingest data in real-time via Pub/Sub + Dataflow.
6
Row-level and column-level security: restrict access to specific rows or columns per user group.

💡

DCE Interview Tips

★
The single most important GCP service for this role. Know it deeply.
★
Migration pitch: 'Redshift requires you to size and pay for a cluster 24/7. BigQuery scales automatically and you only pay when you run queries.'
★
For Thai bank analytics: 'Replace your on-prem Hadoop/Hive with BigQuery. Same SQL skills, no cluster management, 100x faster on cold data.'
★
Know the pipeline: Pub/Sub → Dataflow → BigQuery → Looker Studio.

⚠️

Common Gotchas

!
SELECT * is your enemy — it scans ALL columns. Always SELECT only needed columns.
!
On-demand pricing can surprise customers: a single bad query scanning 100TB costs $500. Use cost preview and set per-query cost limits.
!
Slot reservations (flat-rate) required for predictable costs in production.