Interview Prep
Data Engineering & Analytics
The Data Interview
SQL · Pipelines · Data Modeling · Python · Cloud Platforms
Data interviews are unique: SQL fluency is table stakes, but the questions that separate candidates are about modeling decisions, pipeline design, and knowing how to reason about data at scale. This guide covers all of it.
What the interview process looks like
1
Take-home SQL or Python challenge — 1–3 hours
A dataset with 3–5 SQL questions or a Python data processing task. This is extremely common. They want to see clean, readable code — not just a working answer.
2
Live SQL coding — 45–60 min
SQL problems in a shared editor, often increasing in complexity. Window functions and CTEs are almost guaranteed at mid-senior level. They're watching how you think, not just whether you get the answer.
3
Data modeling or pipeline design — 60 min
Open-ended: "How would you model this business domain?" or "Design a pipeline to ingest this data source." Shows whether you think beyond queries to architecture.
4
Case study / scenario — 45 min
A business scenario: "Our pipeline is failing intermittently — how do you debug it?" or "Stakeholders are asking why the revenue numbers differ across reports." Real-world reasoning.
5
Technical discussion — 45 min
Past projects, tech stack depth, opinions on tooling. They want to know if you've actually used these tools in anger or just read about them.
SQL — The Core Test
SQL is non-negotiable. You will write it live, under pressure. If you hesitate on a GROUP BY or forget how PARTITION BY works, you've lost the room. Fluency here is the baseline, not the differentiator.
Foundations you must know cold
- JOINs — INNER, LEFT, RIGHT, FULL OUTER. Know when each returns NULLs and why. Be able to explain the result set of each without running the query.
- Aggregations — GROUP BY, HAVING vs WHERE (HAVING filters after aggregation, WHERE filters before), COUNT with NULLs.
- Subqueries vs JOINs — subqueries are often readable, but JOINs are usually faster. Know when the planner treats them equivalently.
- NULL handling — NULLs are not zero. NULL != NULL. Use IS NULL / IS NOT NULL. Know how NULLs propagate in aggregations and comparisons.
- CTEs (WITH clauses) — write readable multi-step logic without nested subqueries. Know how to chain multiple CTEs.
Window functions — expected at every mid–senior interview
Window functions perform calculations across a set of rows related to the current row. Unlike GROUP BY, they don't collapse rows.
- ROW_NUMBER() — unique sequential number per partition. Use for deduplication (keep one row per user per event).
- RANK() / DENSE_RANK() — rank within a partition. RANK skips numbers on ties; DENSE_RANK doesn't.
- LAG() / LEAD() — access a value from the previous or next row. Use for calculating deltas, time-between-events.
- SUM() / AVG() OVER() — running totals, rolling averages. Combine with ROWS BETWEEN for windowed aggregations.
- PARTITION BY vs ORDER BY inside OVER() — PARTITION BY resets the window per group; ORDER BY controls the frame within the window.
Query performance
- Indexes speed up reads at the cost of write performance. Know when adding an index helps vs hurts.
- Avoid functions on indexed columns in WHERE clauses — they prevent index usage.
- EXPLAIN / EXPLAIN ANALYZE — know how to read an execution plan and spot sequential scans where an index scan was expected.
- For large data warehouses: understand partitioning (query only the partitions you need) and clustering/sorting (collocate related data).
- Avoid SELECT * in production — it breaks on schema changes and pulls unnecessary data.
Classic SQL interview patterns to practise
- "Find the second highest salary" — MAX with subquery, or DENSE_RANK() = 2
- "Find users who logged in on 3 consecutive days" — LAG() to get previous login date, then filter where gap = 1 day
- "Calculate a 7-day rolling average of daily revenue" — AVG() OVER(ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
- "Find customers who purchased in both January and February" — two CTEs or subqueries, then INNER JOIN on customer ID
- "Find the first purchase per user" — ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY purchase_date) = 1
- "Pivot rows to columns" — conditional aggregation: SUM(CASE WHEN month = 'Jan' THEN revenue END)
Data Modeling
Modeling questions are where junior and senior data engineers diverge. Anyone can write a query. Not everyone can design a warehouse that's still usable in two years.
Dimensional modeling
- Fact tables — records of events or transactions. Narrow, tall, lots of rows. Contains metrics (amounts, durations) and foreign keys to dimensions.
- Dimension tables — the context for facts. Wider, shorter. Customer, product, date, geography.
- Star schema — fact table at the centre, dimensions on the edges. Simple, fast for analytics queries. Denormalised dimensions are intentional.
- Snowflake schema — normalised dimensions (dimensions with sub-dimensions). More storage-efficient but more complex queries.
- Grain — the most important modeling decision. What does one row in this fact table represent? Get this wrong and the model is wrong forever.
Slowly Changing Dimensions (SCDs)
What happens when a dimension value changes? A customer moves city. A product changes category. You need a strategy.
- Type 1 — overwrite. No history kept. Simple. Use when history doesn't matter.
- Type 2 — add a new row with effective/expiry dates and a current flag. Full history preserved. Harder to query.
- Type 3 — add a "previous value" column. Limited history, simple queries. Rarely the right choice.
- In modern data warehouses (dbt, BigQuery), Type 2 SCDs are handled with snapshots. Know what a dbt snapshot does.
Modern approaches
- Wide tables — one large denormalised table per entity. Popularised by the analytics engineering movement. Simpler for end users, more storage.
- dbt — transforms data inside the warehouse using SQL + Jinja. Handles modeling, testing, documentation, and lineage. Know the materialisation types: table, view, incremental, ephemeral.
- Incremental models — only process new/changed data. Critical for large tables. Understand how to handle late-arriving data.
Pipeline Architecture
ETL vs ELT
- ETL — Extract, Transform, Load. Transform before loading into the warehouse. The classical approach when compute was expensive in the warehouse.
- ELT — Extract, Load, Transform. Load raw data first, transform inside the warehouse. The modern approach — cloud warehouses (BigQuery, Snowflake, Redshift) have the compute to transform at scale.
- ELT is now dominant because raw data is preserved for reprocessing, transformations are version-controlled SQL, and warehouses are optimised for this pattern.
Core pipeline principles
- Idempotency — running the pipeline multiple times produces the same result. Essential for safe reruns after failures. Achieve with MERGE/UPSERT logic or partition overwrites.
- Handling late-arriving data — events arrive out of order. Decide on a cutoff (e.g., accept events up to 3 days late). Reprocess affected partitions.
- Data quality checks — row count assertions, null checks, range validations, referential integrity. Run these before publishing data downstream. In dbt: use tests.
- Batch vs streaming — batch for daily/hourly aggregations; streaming (Kafka, Kinesis, Pub/Sub) for real-time requirements. Don't default to streaming — it's more complex and usually unnecessary.
- Orchestration — Airflow, Dagster, Prefect. Know what a DAG is. Understand retries, sensors, and task dependencies. Know why Airflow's scheduler can be a bottleneck at scale.
Python for Data
What comes up in interviews
- pandas — groupby, merge, apply, pivot_table. Know how to chain operations. Know that vectorised operations are faster than apply(). Know when to use pandas vs SQL.
- List comprehensions & generators — write clean, Pythonic code. Generators for memory-efficient processing of large datasets.
- File handling — reading and writing CSV, JSON, and Parquet. Know why Parquet is better for analytics (columnar, compressed, schema-enforced).
- Error handling in pipeline code — try/except with specific exception types, logging with context, raising meaningful errors upstream.
- API integration — requests library, pagination patterns, rate limiting, handling auth (API key, OAuth). Very common in data engineering roles.
Cloud Data Platforms
What you need to know about each
- BigQuery — serverless, columnar, billed per bytes scanned. Partitioning (reduces scan cost) and clustering (collocates related data) are the main optimisation levers. Know how to avoid accidentally scanning terabytes.
- Snowflake — virtual warehouses (compute) are separate from storage. Time travel (query data as it was at a point in time). Zero-copy cloning for dev environments. Auto-suspend to manage compute cost.
- Redshift — based on PostgreSQL, columnar. Sort keys (how data is physically stored) and dist keys (how data is distributed across nodes) are key for performance. COPY command for bulk loading.
- Databricks / Spark — for large-scale distributed processing. Know what a DataFrame is, how lazy evaluation works, and why reading from Delta Lake is better than raw Parquet in production.
Common questions
Write a query to find users who logged in on 3 or more consecutive days.
Use LAG() to get the previous login date. Filter where date diff = 1. Then group and count streaks.
What's the difference between a fact table and a dimension table?
Facts are measurements of events (transactional). Dimensions are the context (who, what, where, when). The grain defines one row in a fact table.
How would you design a pipeline to process 10 million events per day?
Cover source system, ingestion (streaming vs batch), landing zone, transformation, quality checks, and serving layer. Justify each choice.
How do you handle schema changes in a data pipeline without breaking downstream consumers?
Schema registries, backwards-compatible changes, contract testing, versioned models in dbt.
What's the difference between RANK() and DENSE_RANK()?
RANK skips numbers after ties (1, 2, 2, 4). DENSE_RANK does not (1, 2, 2, 3). Usually DENSE_RANK is what you want.
A stakeholder says the revenue number in your dashboard doesn't match the number in finance's spreadsheet. How do you debug this?
Classic case study. Trace the data lineage. Find where the divergence occurs. Check grain, joins, filter conditions, timezone handling, and whether they're measuring the same thing.
Tips from the hiring side
What strong data candidates do differently
- SQL fluency is the baseline. Practice daily for two weeks before interviews. Window functions must feel natural.
- Know how to explain modeling decisions and not just implement them. "Why star schema here?" is as important as drawing the schema.
- Prepare a story about a pipeline you designed and the real problems you solved — late data, quality issues, schema drift, stakeholder trust.
- Know at least one cloud platform deeply. "I've used BigQuery" is very different from "I reduced our query costs by 60% by switching to partitioned tables."
- Don't default to streaming. If asked to design a pipeline, batch is usually the right answer unless real-time is an explicit requirement. Justify it.
- The revenue discrepancy question reveals how you think about data quality and stakeholder communication — not just SQL. Treat it seriously.