Last reviewed 2026-05-30 · ~9 min read

The first hour of an incident

The first hour after an alert is the one most often re-litigated in the post-incident review. It is also the hour most often handled badly — not because the operator was incompetent, but because the muscle memory people most need in that hour is the muscle memory most rarely trained for.

What follows is the version we wish we had read before we needed it. It applies to small-team environments — companies of fewer than fifty people with on-call rotations of one to four engineers. Larger organisations will have an incident commander, a communications lead, a technical lead, and a published runbook. They should follow that runbook. This piece is for everyone else.

The first ten minutes — establish what kind of incident this is

The most expensive mistake in the first ten minutes is to start fixing before you know what you are fixing.

Three questions, in order, before anything else.

Is this real? Not every alert is an incident. Many are noise from a flaky monitor, a stale threshold, or a deployment side-effect. Spend two minutes confirming the signal independently. Look at a second source — a dashboard, a customer-facing endpoint, a separate log query. If you cannot confirm the symptom from a second source within two minutes, the bar for declaring an incident drops because uncertainty itself is a risk factor; if you can confirm the symptom from a second source, you now know it is real and the bar for action rises.

Is this an outage or a compromise? An outage means something stopped working; a compromise means something is working but should not be — an account that should be locked is accessing data, a process that should not be running is running, an outbound connection that should not exist is exfiltrating. The first response to an outage is to restore the service. The first response to a compromise is the opposite: preserve evidence and isolate, then restore. Treating a compromise like an outage is the most common first-hour mistake we see; it destroys the evidence trail and shortens the attacker's window to vanish.

Who needs to know in the next ten minutes? Not who needs to know eventually — who needs to know now. For most small teams the answer is one person: the other on-call engineer, or the founder if it is solo on-call. Do not start a Slack channel of twelve people yet. The first ten minutes are for the responder and one backstop, no more.

Minutes ten to thirty — open the log

Before you do anything to the affected system, open a log. The log is the single most important artefact the post-incident review will read. It is also the document most often started thirty minutes late, when the responder realises they will not remember the sequence of actions they have already taken.

The log can be a plain text file. It can be a markdown file on the responder's laptop. It can be a private Slack channel that gets exported later. The format does not matter. What matters is:

Every entry has a timestamp. Local time is fine if the team is single-timezone; UTC is better if multiple time-zones are involved. The timestamp matters more than the precision — "14:23" is fine; "14:23:17" is fine; what is not fine is timestamps that drift relative to each other because they were added retrospectively.
Every entry says what was done, not what was concluded. "Ran kubectl get pods -n prod; observed pod web-7 in CrashLoopBackOff" is a log entry. "Web pod crashed" is not. The difference matters because the post-incident review needs to retrace the actual observations, not the conclusions drawn from them. Conclusions get refined as the incident unfolds; observations do not change.
Every decision has a brief rationale. "Decided to restart pod web-7 (CrashLoopBackOff for 12 min, business-hours, expected fast)" is enough. The rationale matters because the post-incident review will ask "why did we not isolate first instead?" and "we were 12 minutes in, business-hours, expected fast" is a defensible answer; silence is not.

Start the log first. Add to it as you go. Resist the temptation to "I'll write it up afterwards" — afterwards is when the sequencing falls apart.

Minutes thirty to forty-five — the containment decision

By thirty minutes in, you should have enough signal to make the containment decision. The containment decision is the most consequential decision in the first hour, and it is the decision most often deferred because every option is uncomfortable.

The three options:

Do nothing yet, observe more. Defensible when the impact is small, the symptom is contained to a non-customer-facing system, and you do not yet understand the cause well enough to act. Not defensible when the impact is customer-facing and getting worse, or when you suspect a compromise — observation while an attacker exfiltrates is not a strategy.

Isolate the affected component. For an outage: take the failing service out of rotation. For a compromise: cut the affected machine off the network, but do not power it off (powering off destroys volatile evidence in memory). Isolation is the right answer more often than people pick it, because the failure of isolation is reversible (you can put the service back in rotation) while the failure of "let it run and see" is not (lost data and prolonged compromise are not reversible).

Fail over to the redundant. Defensible if you actually have a tested failover. Be honest with yourself about this. A failover that has not been tested in the last six months is a hypothesis, not a plan. If your only failover is theoretical, isolate instead and accept the downtime.

The decision should be made by the responder. If there is genuine uncertainty between two options, escalate to the backstop person — but make the call within forty-five minutes of the alert. Deferring the containment decision past forty-five minutes is itself a decision, and almost always the wrong one.

Minutes forty-five to sixty — communications and the clock check

By forty-five minutes, the containment action is in flight or has completed. Now: who needs to know?

The mistake here is to over-communicate too early and then under-communicate later. The correct pattern is the inverse: under-communicate in the first hour (one short factual update to the backstop and one short factual update to any customers who are clearly affected) and over-communicate in the second hour and beyond (regular updates as the picture clarifies).

A first-hour customer communication, if needed, is two sentences. "We are aware of an issue affecting [system]. We will post an update by [time, 30 minutes hence]." That is the entire message. No speculation. No promises of root cause. No apologies that pre-judge fault. The discipline to keep the first communication short is what makes the second communication credible.

Then check the regulatory clocks. Three clocks that may have started without you noticing:

The General Data Protection Regulation 72-hour clock (Article 33). Starts at "awareness" — the moment of reasonable certainty that a breach affecting personal data has occurred. Note: awareness, not discovery, not incident time. If the incident touches personal data of European Union data subjects, the clock has likely already started; write down the time you became aware.

The European Union Cyber Resilience Act 24-hour clock (Article 14). Starts at awareness of an actively exploited vulnerability in a product placed on the European Union market. Notification goes to the Computer Security Incident Response Team of record. Followed by a 72-hour update. If you ship software to the European Union and this is a product-security incident, note the date: this duty applies from 11 September 2026 — before then it is one to prepare for, not yet a running legal clock.

Your cyber-insurance carrier's notification window. Read your policy. Many require notification within a small number of hours (some 24, some 72) and may invalidate coverage if notification is late. The hour to find out what your policy requires is not now; it is during a calm afternoon. If you have not done that work in advance, do it as soon as the immediate incident is contained.

For most small teams none of these clocks apply to most incidents. For some incidents they apply to all small teams. The discipline is to ask the question every time rather than only when the answer feels obvious.

What not to do in the first hour

A short list, drawn from incident reviews that did not go well.

Do not "just try a few things to see if it fixes itself." Trial-and-error in production is rarely diagnostic and frequently destructive. Each undocumented action makes the post-incident review harder and may make the symptom harder to reproduce.

Do not call the all-hands meeting. A twelve-person Slack channel staring at the responder makes the responder slower, not faster. Two people on the operational thread is the right size for the first hour. Bring more people in once the picture is clearer and there is parallelisable work to hand them.

Do not promise a root cause to anyone in the first hour. You do not know it. Any promise made in the first hour will be revisited and held against you. Hedge in writing: "current hypothesis is X; we will update by [time]."

Do not power off the machine you suspect is compromised. Isolate by removing it from the network instead. Volatile evidence in memory disappears when the machine powers down; the attacker's process tree, network connections, and recently-cleared shell history are all in memory.

Do not start the incident-review document in the first hour. Start the log; that is enough. The incident review is a separate artefact written calmly afterwards from the log. Mixing them produces a document that is neither a good operational log nor a good review.

What good looks like at the sixty-minute mark

At the sixty-minute mark, a well-handled incident has:

A timestamped log of every observation, action, and decision so far.
A containment action either in flight or completed.
One short factual communication to the backstop person, and one to any clearly affected customers (if any).
A documented current hypothesis about what is going on, written as a hypothesis not a conclusion.
A confirmed answer to the question "has any regulatory clock started?" — yes or no, with the awareness timestamp recorded if yes.
A plan for the next sixty minutes, written down, agreed with the backstop, time-boxed.

That is the bar. Most incidents in a well-run small team will clear that bar with the responder still operating calmly. Most incidents in a poorly-prepared team will not, not because the team is bad but because nobody trained for the first hour.

The training problem

The reason the first hour is so often handled badly is that nobody trains for it. The way people get better at the first hour of an incident is to practise the first hour of an incident — in a controlled tabletop exercise, on a quiet afternoon, with no real production system at risk.

A useful tabletop format: pick a recent post-incident report from a similar-sized company (many are public). Read it together as a team. Stop at the point of the initial alert. Ask: "what would we do in the next five minutes? Next fifteen? Next forty-five?" Have each person write their answer privately. Then compare. The variance in the room will surprise you. The variance is the muscle memory gap. The tabletop is what closes it.

A team that runs one tabletop a quarter will be markedly better at the first hour than a team that runs none.

When you want this ready to use

Sylvan Assurance's First 4 Hours Incident Response toolkit ships an opinionated written first-hour playbook for small teams, a more detailed commander-tier playbook for teams with designated incident commanders, and a PSIRT CRA-Ready edition that adds the EU Cyber Resilience Act reporting clocks and notification templates. All editions include the step-by-step first-hour runbook, the escalation tree, the printable log template, the pocket reference card, and the Ten Commandments — ten things to do and ten things to never do — printable artefacts for the muscle memory described above. Editions from $49.

The free First 4 Hours triage at sylvanassurance.com/free/first-4-hours asks five quick questions about a current or hypothetical incident and returns a tailored two-page Battle-Card. It runs entirely in your browser. Your answers are never transmitted.

Prefer the long form? The companion Sylvan Press title, The First 4 Hours, covers the same ground in depth.

See where you stand

If something is happening right now — or you want to be ready before it does — the free triage returns a four-hour priority sequence and a do-not-touch list. It runs entirely in your browser — your answers never leave your device.

Take the free First 4 Hours triage