L4 / IC3 · 3–5 years

Data Engineer interview prep — what to expect

5 rounds3–5 weeks9 sample questions$145–175k base

Data Engineer interviews lean SQL-heavy and system-design-heavy, with less product-sense framing than DS and less algorithmic coding than SWE. The loops centre on three things: deep SQL fluency, pipeline / warehouse design, and understanding the modern data stack (Spark, dbt, Snowflake / BigQuery / Databricks, Airflow / Dagster).

A typical L4 DE loop is: recruiter screen, an SQL deep-dive (60 minutes of joins, window functions, optimisation), a Python coding round usually involving data manipulation, a pipeline / warehouse system design, and behavioural. Some companies merge SQL and Python into a single technical screen. Startup loops compress to 3–4 rounds; FAANG and major-tech run 5–6.

The L4 bar is owning a pipeline end-to-end: ingestion, transformation, monitoring, on-call.

Personalised version

This guide covers general expectations for Data Engineer interviews. For a free report tailored to your specific job description — with predicted questions, comp benchmark, and experience-gap analysis — paste the JD into the free scan.

Run a free scan on your JD →

What you'll be expected to do

Build and maintain data pipelines from source systems into the warehouse — typically in Python or Scala with Spark / Airflow / dbt
Own data models in the warehouse: dimensional design, slowly-changing dimensions, incremental loads
Partner with DS, analytics, and product engineering on data contracts and instrumentation
Set up monitoring and alerting for pipeline freshness and data quality
Participate in data on-call rotations; debug stalled pipelines or broken downstream metrics
Optimise query performance in the warehouse — partitioning, clustering, materialised views

Typical interview process

Most companies follow a similar shape for Data Engineer interviews. Total calendar time: 3–5 weeks from recruiter screen to offer.

Recruiter screen

30-min phone call

Background, role calibration, motivation, comp expectations

SQL deep-dive

60-min

Joins, window functions, CTEs, query optimisation, dimensional modelling, slowly-changing dimensions. Live coding with a real-ish schema

Python / coding screen

60-min

Data manipulation problems in Python (often pandas or PySpark), or algorithmic LeetCode-medium. Sometimes a dbt or SQL-modelling exercise

Pipeline / warehouse system design

60-min

Design an end-to-end data pipeline: ingestion, transformation, warehouse model, monitoring. Common prompts: clickstream pipeline, finance reporting, event-tracking infra

Behavioural / hiring manager

45-min

On-call incidents, cross-functional partnership with DS / analytics, ambiguity in data contracts, communication with non-technical stakeholders

Sample questions you should be ready for

Representative of what companies ask at this level — not a complete list. For predicted questions tied to a specific job posting, run the free scan above.

Technical / coding

“Given a `transactions` table with user_id, amount, and event_timestamp, write a query to compute the rolling 30-day revenue per user. Then explain how you'd optimise it for a billion rows.”
“Walk me through how you'd design an incremental load from a source system that doesn't have a reliable updated_at timestamp.”
“Design a customer dimension table that supports point-in-time queries (e.g. "what was the customer's address on 2024-03-15?") without bloating storage indefinitely. Walk through your schema and how you'd write the query for a typical analytics use case.”

System design

“Design an end-to-end clickstream pipeline from the web SDK to the warehouse. Cover ingestion, schema, and how you'd handle late-arriving events.”
“Design a finance-reporting data pipeline with a strict freshness SLA (data must be in the warehouse within 30 minutes of source events). Walk through ingestion, transformation, and monitoring.”
“Design a feature pipeline for an ML team that needs both batch and real-time features. Cover storage, the consistency contract, and how you'd test it.”

Behavioural (STAR method)

“Tell me about a pipeline that failed in production. Walk through the incident from detection to root cause to fix.”
“Describe a time a downstream stakeholder (DS, analytics, finance) was blocked by a data issue you owned. How did you handle it?”
“Tell me about a data-modelling decision you made that you'd reverse today. What changed your mind?”

Compensation benchmark

Median compensation for Data Engineer at major US tech companies, headline numbers in USD. London / Berlin / Singapore typically pay 30–50% less in base terms; equity ratios vary by company stage.

Base salary$145–175k (SF/NYC)

Equity (annual vest)$50–110k/yr

Bonus10–15%

FAANG L4 Data Engineer total comp at 50th percentile is $230–300k. Comp tracks L4 SWE band with a slight discount (5–10%) at most companies; equivalent or higher at data-infra companies (Snowflake, Databricks, Confluent).

How to prep — five tactical tips

Lead behavioural answers with the STAR method — Situation, Task, Action, Result. The tactical tips below build on that structure for this specific role.

Drill 60+ SQL questions with a focus on window functions, recursive CTEs, and query optimisation. SQL fluency is the most-tested skill at this level
Know one warehouse cold (Snowflake, BigQuery, or Databricks) — partitioning, clustering, materialised views, cost optimisation. Loops often probe the warehouse the team actually uses
Read 'Designing Data-Intensive Applications' (Kleppmann) chapters 1–6 — the foundational reference for the system design round
Practise dbt or SQL modelling for Type-1 and Type-2 dimensions, incremental models, snapshots
Have 5–6 STAR stories with specific numbers: pipeline volume (rows / GB / events per day), SLAs, on-call incidents you led

Where Data Engineer candidates fail

A few common mistakes that get Data Engineer candidates rejected even when they're otherwise strong. Worth spotting in a mock interview before they show up in a real one.

Designing the data pipeline without thinking about late-arriving or out-of-order events.

Why it fails

DE system design at L4 grades on whether you understand real-world data messiness. Pipelines never get clean, in-order events from the source — there are retries, delays, clock skew, schema drift. Candidates who design assuming pristine data signal "I've worked with the warehouse but not the ingestion side." The interviewer is waiting for you to ask about late arrivals or mention windowing.

Fix

In the first 5 minutes of any pipeline question, ask: how late can events arrive, are they ordered, what's the duplication contract, what happens if the source schema changes. Then design around the answer. Even one question about late arrivals tells the interviewer you've operated in production.

Writing SQL that works but doesn't acknowledge cost or performance at scale.

Why it fails

L4 DE interviewers grade SQL on two axes: correctness and efficiency. A query that joins three tables without thinking about index / partition / clustering keys signals you'd write expensive queries in production. The senior DE answer thinks about distribution, partition pruning, broadcast joins, materialisation — even when the question is just about correctness.

Fix

After any SQL answer, narrate the optimisation lens: which columns I'd partition on, what indices would help, when I'd materialise vs query at read time. If the data is a billion rows, what changes. Even one sentence about scale tells the interviewer you think about cost.

Discussing past pipelines without naming the SLA, volume, or what broke in production.

Why it fails

DE interviewers calibrate against ownership of real pipelines. "I built a pipeline that loaded data into the warehouse" tells the interviewer nothing. "I own the clickstream pipeline at ~200M events/day with a 15-minute freshness SLA, and we had three SLA breaches last quarter — here's what caused them and what we changed" lets them peg you immediately.

Fix

For each pipeline you've owned, attach three numbers: volume (events / rows / GB per day), SLA (freshness in minutes / hours), and reliability (incidents per quarter, time to detect, time to recover). Even rough numbers ground the story in real operations.

Recommended resources

Books, courses, and tools that come up most often in Data Engineer prep. No affiliate links.

01
Designing Data-Intensive Applications (Kleppmann) →The canonical reference for the pipeline / warehouse system design round. Read chapters 1–6 cover-to-cover.
02
DataLemur →SQL practice with real DE / DS interview problems. Focus on window functions and CTEs.
03
The Analytics Engineering Guide (dbt Labs) →Modern data-stack patterns. Helpful for the warehouse-modelling questions.
04
Snowflake — Performance Optimisation Docs →Even if you don't use Snowflake at work, the cost-and-performance patterns transfer. Worth a skim before the system design round.
05
Kimball Group — Dimensional Modelling Techniques →The reference for SCD types, fact / dimension design, and warehouse architecture.

Frequently asked questions

Is this guide useful if I'm a SWE moving into Data Engineering, or a BI / analyst transitioning?

Yes — the L4 / IC3 bar described here applies whether you came from backend engineering, analytics, or directly from a DE role. SWE-to-DE transitions usually have a strong coding base but need to drill SQL window functions and warehouse-specific concepts (partitioning, slowly-changing dimensions). Analyst-to-DE transitions have strong SQL but need to build the pipeline / system-design intuition. Prep the gap that's actually your weak side.

How long should I prep before my Data Engineer onsite?

The process takes 3–5 weeks. Add 4–6 weeks of prep — SQL drilling and one canonical pipeline design problem are the highest-leverage. Don't skip the warehouse-specific docs for the company's stack (Snowflake / BigQuery / Databricks).

What's the most common mistake candidates make at the Data Engineer bar?

Under-investing in system design. Many candidates with strong SQL get filtered because they treat the pipeline design round as a casual chat instead of a 60-minute structured discussion. Practise the design framework: ingestion → schema → transformation → storage → monitoring, with explicit trade-offs at each step.

What if my interview process is different from what's listed?

Most variation is at the edges. Major tech companies (FAANG, scale-ups, mid-size SaaS) follow processes within 1–2 rounds of what's described. Smaller startups often run fewer rounds (3–4) but the bar at each round is similar; less-tech-mature companies sometimes skip system design or behavioural rounds entirely. Read the JD and ask the recruiter at the screen — they'll tell you what's coming.

How does this guide compare to running a free scan?

This guide covers the general bar at L4 / IC3. The free scan reads your specific job description and returns predicted questions for that exact role + company, a calibrated comp benchmark, and (with your CV) experience-gap analysis and an ATS resume check. PDF emailed.

Ready to prep for a real role?

Paste any Data Engineer JD or job URL, get a personalised report.

Drop a LinkedIn, Greenhouse, Lever, or Levels.fyi link — or paste the JD text directly. Predicted questions for that company, your specific experience gaps, and a compensation benchmark calibrated to the role and location. PDF emailed to you.

Run a free scan →