L4 / IC3 · 3–5 years

Data Engineer interview prep, what to expect

5 rounds3–5 weeks9 sample questions$145–175k base

If you're heading into a Data Engineer loop, expect it SQL-heavy and system-design-heavy, less product-sense framing than DS and less algorithmic coding than SWE. The loops centre on three things: deep SQL fluency, pipeline / warehouse design, and understanding the modern data stack (Spark, dbt, Snowflake / BigQuery / Databricks, Airflow / Dagster).

A typical L4 DE loop is: recruiter screen, an SQL deep-dive (60 minutes of joins, window functions, optimisation), a Python coding round usually involving data manipulation, a pipeline / warehouse system design, and behavioural. Some companies merge SQL and Python into a single technical screen. Startup loops compress to 3–4 rounds; FAANG and major-tech run 5–6.

The L4 bar is owning a pipeline end-to-end: ingestion, transformation, monitoring, on-call.

Personalised version

This guide covers the general bar at Data Engineer. The Chrome extension runs the same prep on every JD you open, predicted questions for that company, voice practice with your AI coach on each answer, comp benchmark, gap analysis, plus cover-letter and intro drafts. Free to install with a preview on every posting; unlock the full report from $3.99. Or run a one-off scan on a single JD without installing.

Start free in your browser →Add to ChromeOr scan one JD →

2026 update

This guide covers the general bar at Data Engineer. A few things have changed in 2026, AI is now allowed in coding rounds at Canva and Meta, detection has improved at companies that still ban it, comp has split at staff+, and the post-onsite wait got longer. Read what changed in 2026 →

What you'll be expected to do

Typical interview process

Most companies follow a similar shape for Data Engineer interviews. Total calendar time: 3–5 weeks from recruiter screen to offer.

01
Recruiter screen
30-min phone call
Background, role calibration, motivation, comp expectations
02
SQL deep-dive
60-min
Joins, window functions, CTEs, query optimisation, dimensional modelling, slowly-changing dimensions. Live coding with a real-ish schema
03
Python / coding screen
60-min
Data manipulation problems in Python (often pandas or PySpark), or algorithmic LeetCode-medium. Sometimes a dbt or SQL-modelling exercise
04
Pipeline / warehouse system design
60-min
Design an end-to-end data pipeline: ingestion, transformation, warehouse model, monitoring. Common prompts: clickstream pipeline, finance reporting, event-tracking infra
05
Behavioural / hiring manager
45-min
On-call incidents, cross-functional partnership with DS / analytics, ambiguity in data contracts, communication with non-technical stakeholders
Bar chart of interview rounds by tech role for 2026, showing where Data Engineer sits among comparable roles.
Data Engineer runs 5 rounds. See where every role lands in the 2026 Tech Interview Report.

Sample questions you should be ready for

Representative of what companies ask at this level, not a complete list. Run the free scan above for predicted questions tied to a specific job posting. The Chrome extension adds voice practice with AI coaching on every answer (technical, system design, behavioural, motivation).

Technical / coding
  • Given a `transactions` table with user_id, amount, and event_timestamp, write a query to compute the rolling 30-day revenue per user. Then explain how you'd optimise it for a billion rows.
  • Walk me through how you'd design an incremental load from a source system that doesn't have a reliable updated_at timestamp.
  • Design a customer dimension table that supports point-in-time queries (e.g. "what was the customer's address on 2024-03-15?") without bloating storage indefinitely. Walk through your schema and how you'd write the query for a typical analytics use case.
System design
  • Design an end-to-end clickstream pipeline from the web SDK to the warehouse. Cover ingestion, schema, and how you'd handle late-arriving events.
  • Design a finance-reporting data pipeline with a strict freshness SLA (data must be in the warehouse within 30 minutes of source events). Walk through ingestion, transformation, and monitoring.
  • Design a feature pipeline for an ML team that needs both batch and real-time features. Cover storage, the consistency contract, and how you'd test it.
Behavioural (STAR method)
  • Tell me about a pipeline that failed in production. Walk through the incident from detection to root cause to fix.
  • Describe a time a downstream stakeholder (DS, analytics, finance) was blocked by a data issue you owned. How did you handle it?
  • Tell me about a data-modelling decision you made that you'd reverse today. What changed your mind?

Compensation benchmark

Median compensation for Data Engineer at major US tech companies, headline numbers in USD. Pay in markets like London, Berlin and Singapore tends to be meaningfully lower in base terms, and equity ratios vary by company stage.

Base salary$145–175k (SF/NYC)
Equity (annual vest)$50–110k/yr
Bonus10–15%

FAANG L4 Data Engineer total comp at 50th percentile is $230–300k. Comp tracks L4 SWE band with a slight discount (5–10%) at most companies; equivalent or higher at data-infra companies (Snowflake, Databricks, Confluent).

How to prep, five tactical tips

Lead behavioural answers with the STAR method, Situation, Task, Action, Result. The tactical tips below build on that structure for this specific role.

  1. Drill 60+ SQL questions with a focus on window functions, recursive CTEs, and query optimisation. SQL fluency is the most-tested skill at this level
  2. Know one warehouse cold (Snowflake, BigQuery, or Databricks), partitioning, clustering, materialised views, cost optimisation. Loops often probe the warehouse the team actually uses
  3. Read 'Designing Data-Intensive Applications' (Kleppmann) chapters 1–6, the foundational reference for the system design round
  4. Practise dbt or SQL modelling for Type-1 and Type-2 dimensions, incremental models, snapshots
  5. Have 5–6 STAR stories with specific numbers: pipeline volume (rows / GB / events per day), SLAs, on-call incidents you led

Where Data Engineer candidates fail

A few common mistakes that get Data Engineer candidates rejected even when they're otherwise strong. Worth spotting in a mock interview before they show up in a real one.

01

Designing the data pipeline without thinking about late-arriving or out-of-order events.

Why it fails

DE system design at L4 grades on whether you understand real-world data messiness. Pipelines never get clean, in-order events from the source, there are retries, delays, clock skew, schema drift. Candidates who design assuming pristine data signal "I've worked with the warehouse but not the ingestion side." The interviewer is waiting for you to ask about late arrivals or mention windowing.

Fix

In the first 5 minutes of any pipeline question, ask: how late can events arrive, are they ordered, what's the duplication contract, what happens if the source schema changes. Then design around the answer. Even one question about late arrivals tells the interviewer you've operated in production.

02

Writing SQL that works but doesn't acknowledge cost or performance at scale.

Why it fails

L4 DE interviewers grade SQL on two axes: correctness and efficiency. A query that joins three tables without thinking about index / partition / clustering keys signals you'd write expensive queries in production. The senior DE answer thinks about distribution, partition pruning, broadcast joins, materialisation, even when the question is just about correctness.

Fix

After any SQL answer, narrate the optimisation lens: which columns I'd partition on, what indices would help, when I'd materialise vs query at read time. If the data is a billion rows, what changes. Even one sentence about scale tells the interviewer you think about cost.

03

Discussing past pipelines without naming the SLA, volume, or what broke in production.

Why it fails

DE interviewers calibrate against ownership of real pipelines. "I built a pipeline that loaded data into the warehouse" tells the interviewer nothing. "I own the clickstream pipeline at ~200M events/day with a 15-minute freshness SLA, and we had three SLA breaches last quarter, here's what caused them and what we changed" lets them peg you immediately.

Fix

For each pipeline you've owned, attach three numbers: volume (events / rows / GB per day), SLA (freshness in minutes / hours), and reliability (incidents per quarter, time to detect, time to recover). Even rough numbers ground the story in real operations.

Recommended resources

Books, courses, and tools that come up most often in Data Engineer prep. No affiliate links.

Common scenarios

I'm a DBA / data warehouse engineer with 6 years of experience, mostly on SQL Server and SSIS. How do I retool for modern DE roles at companies running Snowflake, dbt and Airflow?

The hardest part isn't the tools, it's the mental shift from "the warehouse is the system" to "the warehouse is one component of a pipeline". Old-stack DBAs often nail the SQL round and then bomb the system-design round because they think in terms of stored procedures and ETL jobs rather than DAGs and event streams. Six weeks of focused work usually closes most of it. Build a small end-to-end project on the modern stack: ingest a public dataset (NYC taxi, GitHub events) into Snowflake or BigQuery via Airflow or Dagster, transform it with dbt, document the lineage. The point isn't portfolio polish; it's having vocabulary and intuition for the patterns interviewers will ask about. Read the dbt docs end-to-end, "sources", "models", "tests", "materialisations" are the lingua franca of modern DE. Your SQL depth is still the strength; lean on it in the coding round, and frame your stored-procedure / SSIS work as "the precursor patterns to modern transformations" rather than apologising for it.

I'm a backend engineer with 5 years of experience considering a move to Data Engineering. Is this a downlevel risk, and how do I prep when I've never owned a pipeline?

Title and level usually translate one-to-one (Senior SWE → Senior DE, L5 → L5) at companies with mature data orgs, Stripe, Airbnb, Shopify, Netflix all hire backend engineers into senior DE roles regularly. The risk isn't level; it's the system-design round, which tests a different shape of design from backend services. Backend DEs designing pipelines from scratch tend to forget about late-arriving events, schema evolution, backfill strategy, and how to make a pipeline idempotent, all the things real DEs have been burned by. Three weeks pre-interview, build intuition: read Designing Data-Intensive Applications chapters 5, 11 and 12; spec out one pipeline end-to-end (e.g., clickstream → real-time aggregations → daily warehouse loads) with explicit decisions on watermarking, replay, schema registry. The SQL round is usually less bad than backends expect, drill window functions and CTEs for two weeks and you'll be fine. The interview signal you most need to broadcast: "I think about data quality and lineage, not just data movement".

I'm a data scientist who's tired of modelling and wants to move into Data Engineering. Is it a step backwards, and how do I prep?

Not a step backwards, it's a sideways move into a more durable career path, and a lot of senior DS folks make exactly this choice once they've watched their fifth model get killed by upstream data quality. Levelling usually translates straight across (Senior DS → Senior DE). Where DS-to-DE candidates get filtered is the system-design round, because DS interviews don't test pipeline design and the gap shows. Two months of focused prep: build one production-grade pipeline end-to-end on the modern stack (Airflow / dbt / Snowflake or whatever your target company uses), with monitoring, alerting, and a documented backfill strategy. Read the dbt docs, the Airflow best practices guide, and one warehouse vendor's optimisation docs cover-to-cover. Your SQL is already strong, that buys you the coding round. In behavioural, your story is that you've operated on the consumer side of bad data for years and now you want to fix it upstream. That's a better narrative than "I want to do less ML", which signals burnout to interviewers.

I've worked exclusively on Apache Spark for 4 years at a Hadoop-era company. How do I interview at a Snowflake-native company without the warehouse experience they expect?

Same field, different dialect, and the gap is genuinely smaller than recruiters often make it sound. Snowflake's mental model is closer to a traditional warehouse than Spark is, so if you came up on Hive / Spark you may actually have a *deeper* understanding of distributed compute than Snowflake-native candidates. Where the friction shows is the system-design round, where the company expects you to reason in their idiom: warehouse-first transformations (dbt), micro-partitions and clustering keys, copy-into vs streams, COST not CPU. Two weeks of Snowflake docs (focus on "performance considerations", "clustering", "materialised views", "streams and tasks") plus a small hands-on project covers the vocab gap. In the interview, frame your Spark background as "production-scale distributed compute" and acknowledge once that the warehouse-native idioms are newer to you. Don't oversell familiarity with Snowflake specifics if you don't have them, interviewers can tell, and it's better to demonstrate fast learning than fake fluency.

I'm a DE at a 50-person Series B startup, owning the whole data platform solo. How do I prep for a FAANG DE interview where I'll be one of 50 DEs on a specialised team?

Generalist depth is the strength; the gap is depth-of-specialisation language. At a startup, your pipeline volume might be 10M events a day; FAANG DE interviews assume you reason about petabyte-scale storage, hundreds of billions of events a day, multi-region replication, query cost in dollars not seconds. Pre-interview, drill napkin-math: "this pipeline processes 5B events / day, at ~2KB each, so 10TB / day raw; with 3x replication and 30-day retention, that's 900TB hot storage". The system-design round is the hardest delta: practise designing pipelines where you size out every component (Kafka throughput, S3 storage, Spark / Snowflake compute cost, latency budget at p99). Drop the "I built the whole platform" framing, at FAANG that reads as "shallow at every layer". Reframe as "I made the architecture choice between [X] and [Y] for [specific reason], and here's what the trade-off bought us". The behavioural round is where startup DEs often shine, speed of shipping, ownership, cross-functional fluency. Lean into it.

Frequently asked questions

Is this guide useful if I'm a SWE moving into Data Engineering, or a BI / analyst transitioning?

Yes, the L4 / IC3 bar described here applies whether you came from backend engineering, analytics, or directly from a DE role. SWE-to-DE transitions usually have a strong coding base but need to drill SQL window functions and warehouse-specific concepts (partitioning, slowly-changing dimensions). Analyst-to-DE transitions have strong SQL but need to build the pipeline / system-design intuition. Prep the gap that's actually your weak side.

How long should I prep before my Data Engineer onsite?

The process takes 3–5 weeks. Add 4–6 weeks of prep, SQL drilling and one canonical pipeline design problem are the highest-leverage. Don't skip the warehouse-specific docs for the company's stack (Snowflake / BigQuery / Databricks).

What's the most common mistake candidates make at the Data Engineer bar?

Under-investing in system design. Many candidates with strong SQL get filtered because they treat the pipeline design round as a casual chat instead of a 60-minute structured discussion. Practise the design framework: ingestion → schema → transformation → storage → monitoring, with explicit trade-offs at each step.

What if my interview process is different from what's listed?

Most variation is at the edges. Major tech companies (FAANG, scale-ups, mid-size SaaS) follow processes within 1–2 rounds of what's described. Smaller startups often run fewer rounds (3–4) but the bar at each round is similar; less-tech-mature companies sometimes skip system design or behavioural rounds entirely. Read the JD and ask the recruiter at the screen, they'll tell you what's coming.

How does this guide compare to running a free scan?

This guide covers the general bar at L4 / IC3. The free scan reads your specific job description and returns predicted questions for that exact role + company, a calibrated comp benchmark, and (with your CV) experience-gap analysis and an ATS resume check. PDF emailed.

Ready to prep for a real role?

Paste any Data Engineer JD, meet your coach in under 30 seconds.

Drop a LinkedIn, Greenhouse, Lever, or Levels.fyi link, or paste the JD text. Your coach predicts the questions for that company, surfaces your specific experience gaps, and calibrates a compensation benchmark to the role and location. PDF emailed to you. Voice practice with AI feedback on each answer lives in the Chrome extension.

Free to start · Free reports + first mock free · Paid plans from $3.99

Data Engineer Interview Prep — Calibrd