Scoring methodology

What the score feels like

Range

Label

Means

90–100

Excellent

Near-perfect reliability. Brief, infrequent blips at most.

75–90

Good

Normal for a busy production service. Occasional minor incidents.

50–75

Noticeable

Regular disruption. Real user impact on a weekly cadence.

25–50

Rough

Multiple serious incidents or a sustained major outage.

0–25

Broken

Effectively unreliable. Day-long criticals or a flood of failures.

The four things we measure

Reliability is a small handful of distinct questions: was the service up, how often did it break, how bad was each break, and how fast did they fix it? The score blends an answer to each on a 0–100 scale, weighted into one composite number.

1. Uptime (50% weight) — was the service available?

The base rate. Uptime % = 1 − (critical_min + 0.3 × major_min) ÷ window_min. That's the same formula Atlassian Statuspage uses on Claude's, OpenAI's, and DeepSeek's own status pages — full weight for outages a user sees as “down”, a third of the weight for partial outages a user sees as “degraded”. Minors and scheduled maintenance don't count. Two simultaneous incidents get merged so they don't double-bill the same minute.

The factor itself isn't the raw percent though — it's the percent graded on a nines of reliability curve. 99.9% (three nines) and 99.99% (four nines) sound nearly identical in plain numbers but represent 10× less downtime per year (~53 min vs. ~5 min). A linear percent scale would squash that gap; the nines curve separates providers the way users actually feel them:

Uptime

Nines

Factor

99.99%+

four nines

100

99.9%

three nines

99.5%

~2.3 nines

99.0%

two nines

95.0%

~1.3 nines

< 90.0%

< 1 nine

5 (floor)

2. Frequency (20% weight) — how often does it break?

Same uptime can come from very different incident shapes. One four-hour outage and twenty 12-minute blips both burn ~0.5% of a month, but the second one feels much worse. The frequency factor counts incidents and decays exponentially: 100 × exp(−count ÷ 100) at the 30-day window. 100 incidents lands around 37/100; 200 lands near 14. The factor never reaches zero, so even chatty weeks differentiate.

3. Severity (20% weight) — how bad was each one?

Frequency treats every incident as one. Severity weights each by duration × impact and decays the sum exponentially. We use the provider's own classification verbatim:

Minor ×0.5 — “elevated errors”, most users unaffected.
Major ×2 — partial outage, real user impact.
Critical ×6 — service effectively unavailable.

Each incident's contribution is duration-capped per impact tier so a stuck-open minor tag can't torpedo the score, but a multi-day critical actually feels catastrophic in the math:

Impact

Cap

Why

Minor

Often left open after the problem is actually gone.

Major

12h

Usually fixed within a business day.

Critical

48h

A 2-day critical should look catastrophic in the score — and now does.

4. Time to recover (10% weight) — how fast was the fix?

Same uptime, same incident count, same severity mix — but one provider resolves in 12 minutes and the other limps for 90. MTTR captures the gap. We use the median incident duration (not the mean) so one anomalous all-day outage doesn't dominate — that's already what severity is measuring. A 120-minute median zeroes the factor; 30 minutes lands at 75. Ongoing incidents count toward the median using their elapsed time, so an unresolved 7-hour outage doesn't get a free pass while we wait for it to close.

Worked example

A provider spent 30 days at 99.9% uptime, had 15 incidents (mostly minor, one 3-hour major), and resolved the median incident in 45 minutes.

Factor

Calc

Score

× weight

Uptime (nines)

99.9% = 3 nines

= 33.5

Frequency

exp(−15/100)

= 17.2

Severity

one 3h major ≈ 360 sev-min

= 17.8

Time to recover

45-min median

= 6.3

Composite

74.8 / 100

The formula

score = 0.5 × uptime_factor
      + 0.2 × frequency_factor
      + 0.2 × severity_factor
      + 0.1 × mttr_factor

uptime_factor    = clamp(5, 100, sqrt((nines − 1) ÷ 3) × 100)
                   nines = −log10(1 − uptime% / 100)
                   (below 90% → floor 5; 4 nines → 100)

uptime%          = 100 − (critical_min + 0.3 × major_min) / window_min × 100
                   (Statuspage formula: minors + maintenance excluded)

frequency_factor = 100 × exp(−incident_count ÷ e_fold)
                   e_fold[30d] = 100, e_fold[7d] = 40, e_fold[1h] = 6

severity_minutes = Σ min(duration, cap[impact]) × impact_mult
                   cap: minor 6h · major 12h · critical 48h
                   mult: minor ×0.5 · major ×2 · critical ×6

severity_factor  = 100 × exp(−severity_minutes ÷ scale)
                   scale[30d] = 3000, scale[7d] = 1500

mttr_factor      = max(0, 100 − median_resolve_minutes ÷ 1.2)

Severity is per-provider

We read each provider's severity classification verbatim. Never escalate or de-escalate.

Provider

Source severity tier

Mapped to

Claude

Statuspage minor

minor

Claude

Statuspage major

major

Claude

Statuspage critical

critical

OpenAI

Statuspage minor

minor

OpenAI

Statuspage major

major

OpenAI

Statuspage critical

critical

Mistral

MINOR

minor

Mistral

MEDIUM (degraded)

minor

Mistral

MAJOR

major

Mistral

CRITICAL

critical

Grok (xAI)

(no field — title keywords)

inferred

DeepSeek

Statuspage minor

minor

DeepSeek

Statuspage major

major

DeepSeek

Statuspage critical

critical

Kimi (Moonshot)

Statuspage critical (only tier the operator uses)

critical

Cohere

incident.io minor

minor

Cohere

incident.io major

major

Cohere

incident.io critical

critical

Data sources

One feed per provider, fetched on page visit (cached for 60s). No background crawler, no independent probing.

Provider

Endpoint

Format

ClaudeAnthropic

https://status.claude.com/api/v2/summary.json

Statuspage v2 JSON

OpenAIOpenAI

https://status.openai.com/api/v2/summary.json

Statuspage v2 JSON

MistralMistral AI

https://status.mistral.ai/_payload.json

Checkly SSR JSON

GrokxAI

https://status.x.ai/feed.xml

Instatus RSS

DeepSeekDeepSeek

https://status.deepseek.com/api/v2/summary.json

Statuspage v2 JSON

KimiMoonshot AI

https://status.moonshot.cn/api/v2/summary.json

Statuspage v2 JSON

CohereCohere

https://status.cohere.com/api/v2/summary.json

incident.io (Statuspage-compatible)

What we don't do

Ping provider APIs ourselves — if a provider's status page is wrong, so are we.
User accounts or alerts.
Latency or answer-quality benchmarks. Reliability only.

score = 0.5 × uptime_factor + 0.2 × frequency_factor + 0.2 × severity_factor + 0.1 × mttr_factor uptime_factor = clamp(5, 100, sqrt((nines − 1) ÷ 3) × 100) nines = −log10(1 − uptime% / 100) (below 90% → floor 5; 4 nines → 100) uptime% = 100 − (critical_min + 0.3 × major_min) / window_min × 100 (Statuspage formula: minors + maintenance excluded) frequency_factor = 100 × exp(−incident_count ÷ e_fold) e_fold[30d] = 100, e_fold[7d] = 40, e_fold[1h] = 6 severity_minutes = Σ min(duration, cap[impact]) × impact_mult cap: minor 6h · major 12h · critical 48h mult: minor ×0.5 · major ×2 · critical ×6 severity_factor = 100 × exp(−severity_minutes ÷ scale) scale[30d] = 3000, scale[7d] = 1500 mttr_factor = max(0, 100 − median_resolve_minutes ÷ 1.2)

How the score works.

1. Uptime (50% weight) — was the service available?

2. Frequency (20% weight) — how often does it break?

3. Severity (20% weight) — how bad was each one?

4. Time to recover (10% weight) — how fast was the fix?

How the score works.

1. Uptime (50% weight) — was the service available?

2. Frequency (20% weight) — how often does it break?

3. Severity (20% weight) — how bad was each one?

4. Time to recover (10% weight) — how fast was the fix?