Big Data
BigQuery
Serverless, highly scalable data warehouse — GCP's crown jewel
AWS equivalent
Redshift
Architecture Diagram
GCP Big Data Pipeline
💡 Batch alternative: GCS → Dataflow → BigQuery | Hadoop migration path: On-prem HDFS → GCS → Dataproc → BigQuery
AWS → GCP: Key Differences
- ▸
Serverless: no cluster to provision, patch, resize, or manage. Redshift requires cluster management.
- ▸
Pay-per-query (on-demand): $5 per TB scanned. Only pay when you actually run queries.
- ▸
Scales automatically: queries can use thousands of slots without any configuration.
- ▸
Separates compute and storage: you can query data in GCS via BigLake without loading it.
Key Concepts to Know
- 1
Columnar storage: only reads the columns needed for a query — key reason for speed and cost efficiency.
- 2
Partitioning: partition by date/timestamp dramatically reduces query cost.
- 3
Clustering: physically sort data within partitions — further reduces scan cost.
- 4
BigQuery ML: train and run ML models directly in SQL — no Python, no infrastructure.
- 5
Streaming inserts: ingest data in real-time via Pub/Sub + Dataflow.
- 6
Row-level and column-level security: restrict access to specific rows or columns per user group.
DCE Interview Tips
- ★
The single most important GCP service for this role. Know it deeply.
- ★
Migration pitch: 'Redshift requires you to size and pay for a cluster 24/7. BigQuery scales automatically and you only pay when you run queries.'
- ★
For Thai bank analytics: 'Replace your on-prem Hadoop/Hive with BigQuery. Same SQL skills, no cluster management, 100x faster on cold data.'
- ★
Know the pipeline: Pub/Sub → Dataflow → BigQuery → Looker Studio.
Common Gotchas
- !
SELECT * is your enemy — it scans ALL columns. Always SELECT only needed columns.
- !
On-demand pricing can surprise customers: a single bad query scanning 100TB costs $500. Use cost preview and set per-query cost limits.
- !
Slot reservations (flat-rate) required for predictable costs in production.