The industry conversation about grid-scale battery storage reliability often happens at a comfortable level of abstraction: availability percentages, Mean Time Between Failures in marketing decks, vendor-supplied uptime guarantees. The actual incident record is more specific and more instructive. Published NERC reports, CPUC filings, and FERC incident dockets document what BESS outages actually cost — and the picture that emerges is different from the headline numbers in most operator risk models.
What "Outage Cost" Actually Comprises
Before looking at the incident data, it's worth being precise about what counts as an outage cost, because operators and vendors often use the term to mean different things.
A full accounting includes at least five cost categories that show up in different financial buckets and are rarely aggregated into a single incident figure:
- Direct revenue loss — capacity payments and energy arbitrage revenue foregone during the outage window, including any grid services contract penalties for non-performance during curtailment events.
- Emergency repair costs — labor, replacement components, and logistics for unplanned repair, typically running 2–4x the cost of equivalent planned maintenance due to response urgency and parts availability constraints.
- Asset damage — hardware replacement cost when a degradation event reaches a threshold requiring cell module, rack, or in thermal events, full enclosure replacement. For thermal incidents, this category alone can reach $2M per affected enclosure.
- Regulatory and warranty compliance costs — NERC CIP reporting obligations, manufacturer warranty dispute preparation (which requires cell-level data that many operators don't have), and potential FERC/ISO penalties for market commitment failures.
- Opportunity cost on contracted capacity — for storage assets under long-term PPAs or ISO market commitments, the value of committed capacity that wasn't available differs from simple revenue loss calculations because it shapes the position available in future contract renegotiations.
The NERC Electricity Storage Integration Working Group's incident tracking shows that most operators report only the first two categories consistently. The others require more complete accounting that most operational teams don't have the systems to capture in real time.
The Incident Cost Distribution: Where the Money Accumulates
Analysis of publicly available BESS incident reports from CPUC, NERC, and state utility commission filings from 2019 through early 2025 reveals a cost distribution that contradicts the intuition that thermal runaway events dominate the cost picture.
Thermal events — the incidents that generate news coverage and regulatory attention — account for a disproportionately large fraction of per-incident cost when they occur. Average documented asset damage in the thermal runaway incidents with available cost data exceeds $1.5M per event. But thermal events are relatively infrequent. The cost distribution is more informative by volume:
| Incident Category | Share of Incidents (NERC 2019-2024) | Typical Cost Range per Incident | Primary Cost Driver |
|---|---|---|---|
| Capacity fade / unplanned derate | ~38% | $80K–$400K | Revenue loss during derate window + emergency labor |
| BMS fault / protection trip | ~27% | $20K–$150K | Lost grid services revenue + restart time + investigation |
| Thermal management failure (non-runaway) | ~18% | $100K–$600K | Accelerated cell degradation + HVAC repair + preventive replacement |
| Thermal runaway / fire | ~7% | $500K–$3M+ | Asset replacement + regulatory penalties + investigation costs |
| Integration / communications failure | ~10% | $15K–$80K | Revenue loss + engineering investigation time |
Capacity fade and BMS protection events — together comprising roughly 65% of incidents by volume — tend to generate the most consistent drain on operations precisely because they're chronic rather than acute. An asset that derates by 8% over 18 months while operators assume it's performing to spec is accumulating revenue losses that don't show up as incidents in anyone's fault log.
The Invisible Cost: Undetected Degradation
The incident data captures events that crossed some threshold requiring a report or corrective action. It systematically misses the continuous performance shortfall from undetected degradation — which in our experience is often larger than the incident-driven cost category over any multi-year horizon.
A BESS asset contracted to deliver 50 MW of frequency regulation capacity that's actually delivering 46 MW due to cumulative capacity fade isn't experiencing incidents. It's experiencing persistent underperformance that may not be reported to ISO markets or disclosed in capacity payments until a formal audit triggers a recalculation. Depending on the contract structure, this can represent 8–14% of annual contracted revenue evaporating quietly.
The mechanism is almost always the same: the BMS reports state-of-charge and available capacity using parameters calibrated at commissioning, not updated to reflect actual cell health. As cells degrade, the gap between reported and actual capacity widens. Operators run dispatch decisions against the reported number. The actual delivered energy is less.
We've traced this pattern in data from three separate utility-scale BESS sites in the northeastern US grid region, where the gap between BMS-reported capacity and measured discharge capacity exceeded 9% after 18–24 months of operation without any incidents flagged. None of those sites had cell-level health monitoring. All three were under capacity payment contracts with ISO New England.
Restart Time Economics: The Incident Log Doesn't Capture This
When a BESS is taken offline by a protection trip — which can be triggered by a single cell crossing a voltage or temperature threshold — the restart process takes time. That time doesn't usually appear in incident reports because no damage occurred, but the economic impact is real.
Restarting a grid-scale BESS after a protection trip requires: identifying the cell or module that triggered the trip, confirming it's safe to restart, clearing the fault in the BMS, verifying thermal conditions, and re-commissioning the dispatch relationship with the ISO or EMS. On sites with full cell-level monitoring, this process can take 45 minutes to 2 hours. On sites where operators can't identify the triggering cell quickly — which requires cell-level resolution telemetry — the process can take 4–8 hours, particularly if the fault isn't reproducible on demand and the investigation requires reviewing historical telemetry manually.
During that restart window, the asset is unavailable. For a site with a frequency regulation market commitment, a 4-hour unavailability window during peak grid stress can trigger non-performance charges that dwarf the cost of whatever maintenance action would have prevented the trip in the first place.
The cost calculus that operators rarely do: a $15,000 preventive maintenance visit that catches a drifting cell before it causes a protection trip prevents a potential $80,000–$150,000 non-performance event on a high-value grid services day. The preventive visit is a cost center. The non-performance event is buried in contract penalties that get disputed over months. They never appear in the same spreadsheet.
Where Predictive Health Monitoring Changes the Cost Curve
The incident data points to three cost categories where early cell-health visibility has a measurable impact on outcomes.
First, unplanned derate events. Capacity fade events that are detectable 2–6 weeks before the BMS triggers an alert — through internal resistance rise, incremental capacity changes, or cell-to-cell divergence — can be converted from emergency responses to planned maintenance interventions. The cost difference is significant: planned cell module replacement runs $8K–$25K in labor per site visit; emergency responses with partial unavailability run $60K–$200K factoring in lost revenue.
Second, protection trip restart time. Cell-level fault identification that immediately surfaces the triggering cell address and probable cause reduces average restart time from hours to under an hour in well-instrumented sites. At the grid services revenue rates that matter for these assets ($50–$150/MWh for frequency regulation in northeastern US markets), one avoided 4-hour trip during a constrained grid period can justify a year of monitoring costs.
Third, warranty recovery. Of the thermal runaway and severe degradation incidents with documented warranty claims, fewer than 40% result in successful warranty payment to the operator, primarily because operators cannot produce the cell-level evidence required to demonstrate that operation remained within manufacturer-specified parameters. Continuous cell telemetry with tamper-evident logging converts that evidence-assembly problem from a six-week manual project to a one-click export.
The data shows that BESS outage costs are not primarily a thermal safety problem. They're a chronic performance monitoring problem — and the largest cost category is the silent one that doesn't appear in any incident report.
Takeaways
The incident record is clear on a few points: the high-drama events (thermal runaway) are not the primary cost driver by volume. Chronic degradation and BMS protection trips together account for more total economic loss across the installed US fleet than thermal incidents do. The costs are underreported because they accumulate in different financial buckets — revenue, penalties, contract underperformance — that don't get aggregated into "outage cost" figures. And the businesses cases for cell-level health monitoring rest not on preventing fires, but on converting the routine chronic losses into manageable planned maintenance costs.