L4 / IC3 · 3–6 years

Site Reliability Engineer interview prep, what to expect

5 rounds4–6 weeks9 sample questions$160–195k base

If you're prepping for an SRE loop, expect it systems-heavy, less algorithmic coding than pure SWE and more focus on debugging, observability, and production operations. The bar is whether you can keep a complex distributed system running, not just whether you can build one.

A typical L4 SRE loop is: recruiter screen, a coding round (often combining algorithms with systems / scripting), a live debugging round, a system-design round centred on reliability (multi-region, observability, failure modes), and behavioural with hiring manager. FAANG SRE loops typically run 5–7 rounds and stay close to the SWE coding bar; smaller companies compress to 3–4 rounds with more open-ended discussion.

The L4 bar is owning the operational health of a service or sub-system: on-call rotation, incident response, observability, automation.

Personalised version

This guide covers the general bar at SRE. The Chrome extension runs the same prep on every JD you open, predicted questions for that company, voice practice with your AI coach on each answer, comp benchmark, gap analysis, plus cover-letter and intro drafts. Free to install with a preview on every posting; unlock the full report from $3.99. Or run a one-off scan on a single JD without installing.

Start free in your browser →Add to ChromeOr scan one JD →

2026 update

This guide covers the general bar at SRE. A few things have changed in 2026, AI is now allowed in coding rounds at Canva and Meta, detection has improved at companies that still ban it, comp has split at staff+, and the post-onsite wait got longer. Read what changed in 2026 →

What you'll be expected to do

Typical interview process

Most companies follow a similar shape for SRE interviews. Total calendar time: 4–6 weeks from recruiter screen to offer.

01
Recruiter screen
30-min phone call
Background, role calibration, on-call experience, motivation
02
Coding screen
60-min
Algorithmic problem at LeetCode-medium level plus a systems / scripting question (e.g. "parse this log format and emit X"). Some loops do bash + Python combined
03
Debugging round
60-min
Live debugging on a real-ish broken system: read logs, trace, hypothesise, narrow down root cause. Some companies do this as a take-home with a Docker-compose stack
04
Reliability system design
60-min
Design an observability stack, a multi-region failover strategy, or a high-availability service. Probing on SLOs, error budgets, failure modes, blast radius
05
Behavioural / hiring manager
45-min
On-call war stories, incident response, partnership with product engineering, handling toil vs feature pressure
Bar chart of interview rounds by tech role for 2026, showing where SRE sits among comparable roles.
SRE runs 5 rounds. See where every role lands in the 2026 Tech Interview Report.

Sample questions you should be ready for

Representative of what companies ask at this level, not a complete list. Run the free scan above for predicted questions tied to a specific job posting. The Chrome extension adds voice practice with AI coaching on every answer (technical, system design, behavioural, motivation).

Technical / coding
  • Walk me through how you'd debug a service whose p99 latency just doubled, no obvious deploy correlation. What's your first 15 minutes?
  • Given a stream of structured log lines, walk me through how you'd compute the per-endpoint error rate over a 5-minute sliding window. Choose your language; cover what changes at 100k events/second.
  • Implement a rate limiter as a class in Python. Walk through how you'd extend it to handle distributed rate-limiting across 50 service instances.
System design
  • Design the observability stack for a 100-service microservices architecture. Cover metrics, logs, traces, and the SLO framework.
  • Design a deployment system that lets engineers ship to prod 50 times a day safely. Cover canaries, rollbacks, feature flags, and the blast-radius story.
  • Design a multi-region active-active service with sub-100ms p99 latency. Walk through replication, failover, and what breaks when a region goes down.
Behavioural (STAR method)
  • Tell me about an incident you led the response on. Walk through detection → diagnosis → mitigation → postmortem.
  • Describe a time you pushed back on a product team that wanted to ship something you thought was unsafe. How did the conversation go?
  • Tell me about a high-toil situation in your previous role. What did you do to reduce it?

Compensation benchmark

Median compensation for SRE at major US tech companies, headline numbers in USD. Pay in markets like London, Berlin and Singapore tends to be meaningfully lower in base terms, and equity ratios vary by company stage.

Base salary$160–195k (SF/NYC)
Equity (annual vest)$80–160k/yr
Bonus10–15%

FAANG L4 SRE total comp at 50th percentile is $260–340k. SRE typically tracks the SWE band at the same level, sometimes a small premium at companies where SRE is a separate ladder (Google, where SRE is a co-equal track).

How to prep, five tactical tips

Lead behavioural answers with the STAR method, Situation, Task, Action, Result. The tactical tips below build on that structure for this specific role.

  1. Drill 50+ LeetCode mediums focused on hash maps, graphs, and streams, the SRE coding bar is closer to SWE than DE
  2. Read the Google SRE Book (free online) cover-to-cover, it's the lingua franca of SRE interviews and gets referenced directly
  3. Practise reading logs and metrics fast. Spin up a local Prometheus + Grafana stack and break things deliberately to build debugging intuition
  4. Know one IaC tool deeply (Terraform or Pulumi) and one container orchestrator (Kubernetes) at the level of "I could write the manifests from scratch"
  5. Have 5–6 STAR stories with concrete incident details: detection time, time to mitigation, scope of impact, what changed in postmortem

Where SRE candidates fail

A few common mistakes that get SRE candidates rejected even when they're otherwise strong. Worth spotting in a mock interview before they show up in a real one.

01

Designing a reliability solution without naming SLOs, error budgets, or what "good enough" means.

Why it fails

SRE system design grades explicitly on whether you can reason about reliability as a budget, not an absolute. "Five nines" isn't an answer; the answer is "three nines availability, 200ms p99 latency, error budget of X minutes per quarter, here's how we'd spend it." Candidates who design for 100% uptime signal "haven't thought in SLO terms."

Fix

Open every reliability design by naming the SLOs you'd target and why. Then frame trade-offs against the error budget: "if we spend half the budget on this feature launch, that means we need to lock down everything else this quarter." The SLO framing is the SRE signal.

02

Doing the debugging round by guessing causes instead of forming hypotheses from the data.

Why it fails

Debugging rounds at SRE bar grade on whether you read the signal before you reach for the cause. Jumping to "it's probably the database" without checking metrics, logs, or recent changes signals "I'd waste an hour during a real incident chasing the wrong thing." The senior pattern is: check the obvious signal first (deploys, dashboards, error rate by endpoint), then narrow.

Fix

Structure debugging answers as a checklist: what changed recently (deploys, config), where's the signal narrowest (which endpoint, which region, which version), what would falsify each hypothesis. Even one or two minutes of structured triage at the start tells the interviewer you've done this in real incidents.

03

Describing past on-call work as "I respond to pages and fix issues" without specific incident detail.

Why it fails

SRE interviewers calibrate against incident ownership. Generic on-call descriptions tell them nothing. "We had a 40-minute outage last month: the alert fired at 02:14, I diagnosed it by 02:28 using the request-rate dashboard, mitigated by failing over to region us-east-2, RCA the next day pinned it on a config push" lets them peg you immediately.

Fix

For your top 3–4 incident stories, attach four numbers: detection time (when alert fired), diagnosis time (when you identified the issue), mitigation time, and scope of impact (users affected, dollars lost, SLA hit). Rough numbers beat no numbers.

Recommended resources

Books, courses, and tools that come up most often in SRE prep. No affiliate links.

Frequently asked questions

Is this guide useful if I'm a SWE moving into SRE, or coming from DevOps / Ops?

Yes, the L4 / IC3 bar described here applies whether you came from backend engineering, DevOps, or sysadmin / ops. SWE-to-SRE transitions usually have a strong coding base but need to build on-call and reliability intuition (SLOs, error budgets, blast radius). DevOps-to-SRE transitions have strong infrastructure skills but need to drill the coding round, which sits closer to the SWE bar than people expect.

How long should I prep before my SRE onsite?

The process takes 4–6 weeks. Add 6–8 weeks of prep, LeetCode mediums, the Google SRE Book, and one observability deep-dive are the highest-leverage. Don't underestimate the coding round.

What's the most common mistake candidates make at the SRE bar?

Under-investing in the coding round. Many candidates from DevOps backgrounds have great infra skills but get filtered at the coding screen because they assumed SRE = bash + Terraform. The coding bar is closer to SWE than DevOps; drill LeetCode mediums for 4+ weeks.

What if my interview process is different from what's listed?

Most variation is at the edges. Major tech companies (FAANG, scale-ups, mid-size SaaS) follow processes within 1–2 rounds of what's described. Smaller startups often run fewer rounds (3–4) but the bar at each round is similar; less-tech-mature companies sometimes skip system design or behavioural rounds entirely. Read the JD and ask the recruiter at the screen, they'll tell you what's coming.

How does this guide compare to running a free scan?

This guide covers the general bar at L4 / IC3. The free scan reads your specific job description and returns predicted questions for that exact role + company, a calibrated comp benchmark, and (with your CV) experience-gap analysis and an ATS resume check. PDF emailed.

Ready to prep for a real role?

Paste any SRE JD, meet your coach in under 30 seconds.

Drop a LinkedIn, Greenhouse, Lever, or Levels.fyi link, or paste the JD text. Your coach predicts the questions for that company, surfaces your specific experience gaps, and calibrates a compensation benchmark to the role and location. PDF emailed to you. Voice practice with AI feedback on each answer lives in the Chrome extension.

Free to start · Free reports + first mock free · Paid plans from $3.99

Site Reliability Engineer Interview Prep — Calibrd