Insights Safety · Thermal Runaway · Predictive Analytics

Thermal Runaway Precursor Detection: Moving from Alarm to Prediction in Utility BESS

The 2021 Arizona PSE facility fire accelerated BESS safety conversations. Three years on, predictive thermal precursor scoring — not just threshold alarms — is becoming the grid-scale safety standard.

Thermal imaging scan of battery cell array showing temperature gradient distribution

The Limitation of Threshold Alarms

Every BESS installation today has threshold-based alarms. Temperature exceeds 55°C — alarm triggers. Cell voltage drops below 2.5 V — alarm triggers. These thresholds are real and necessary; they provide a last-resort safety boundary. The problem is that by the time a threshold alarm fires, the thermal runaway precursor process is already well advanced. In the more severe NMC thermal runaway sequences studied under UL 9540A test protocols, the time from first threshold breach to full cell thermal runaway can be measured in seconds to a few minutes — too short for meaningful operator intervention in an automated system, too short for safe evacuation decisions in a manned facility.

Predictive precursor detection shifts the timeline. Instead of detecting the emergency when it starts, the goal is to detect the electrochemical and thermal conditions that make an emergency more probable — hours or days in advance. This requires not threshold monitoring but anomaly detection against a continuous background model of normal cell behavior. The three primary observable signals are voltage divergence across cells, temperature delta within a module, and impedance rise over time.

Voltage Divergence as an Early Warning Signal

In a healthy battery pack, cells within a string hold similar voltages during charge and discharge. Small differences in cell-to-cell capacity and internal resistance create minor voltage spread — typically 10–30 mV within a well-balanced string at rest. Larger voltage divergence indicates either cell-level degradation that's outpacing neighbors (accelerated capacity loss in specific cells), developing internal short circuits, or BMS balancing failure.

The pattern that matters for thermal runaway precursor detection is progressive, accelerating divergence in a specific cell or cell group. A cell developing an internal short circuit — from lithium dendrite penetration of the separator, from particle contamination, or from separator mechanical failure — will exhibit a characteristic voltage signature: the affected cell's voltage will drop more rapidly during discharge and recover more slowly during rest, creating a growing divergence from its neighbors. Early-stage internal shorts (sometimes called "soft shorts" — partial separator penetration with high but not zero internal resistance) may produce divergence of only 50–150 mV before progressing. At that level, BMS threshold alarms are not triggered. A statistical process control model — monitoring the rate of change of inter-cell voltage spread — can flag the anomaly weeks before a threshold alarm would fire.

The operational threshold for alerting is not fixed — it depends on baseline variance at that cell's historical operating point, the trend rate, and the chemistry. NMC cells at 95% SoC naturally show higher voltage spread than at 50% SoC; an anomaly detector that doesn't condition on SoC will generate false alarms at high SoC that reduce operator attention to the genuine anomalies. Good precursor models are SoC-conditioned, temperature-conditioned, and trained on the specific pack's baseline behavior rather than population averages.

Temperature Delta: Module-Level Thermal Gradient Monitoring

The second primary precursor signal is thermal gradient — specifically, the delta between cell/module temperatures within a pack compared to the expected temperature profile for the current operating conditions. Under normal operation, cell temperatures within a module follow a predictable pattern: cells in the module center run 2–5°C warmer than edge cells due to reduced convective cooling access. This spatial gradient is consistent and predictable.

A cell developing elevated internal resistance — from SEI growth, lithium plating, or early-stage internal shorting — generates more heat per unit current than its neighbors. The anomaly manifests as an unexplained temperature elevation in that cell relative to the module's expected spatial profile. Early-stage thermal precursors may present as 3–8°C above expected temperature at normal operating C-rates. This is well below any threshold alarm level (typical overheat alarms trigger at 15–25°C above ambient), but it's statistically distinguishable from normal if a per-cell thermal model tracks expected vs. observed temperature at each measurement point.

The measurement requirement is granular. Pack-level temperature sensors — a single thermocouple per module — don't provide the resolution to detect single-cell thermal elevation. Cell-level or sub-module-level temperature sensing, combined with a spatial thermal model, is the prerequisite. This is an infrastructure cost that many BESS installations haven't fully instrumented. IEC 62619 (safety requirements for large-format secondary lithium cells) recommends cell-level temperature monitoring; practical implementation varies across hardware vendors.

Impedance Rise as a Degradation Trajectory Signal

Increasing internal impedance — particularly the ohmic resistance R0 — is a reliable indicator of advancing cell degradation and elevated thermal runaway risk under high current loads. A cell with doubled internal resistance at rated current dissipates four times the heat (P = I²R). At high discharge rates common in FCAS events, a severely degraded cell can reach thermal runaway temperatures even if it's within normal threshold bounds at low current.

Tracking impedance rise over time provides a trajectory signal: not "this cell is in danger right now" but "this cell is on a degradation trajectory that, at current rate, will bring it into elevated-risk territory within X weeks." This is the forward-looking component of precursor detection. Electrochemical impedance spectroscopy (EIS) provides the highest-resolution impedance characterization, but as discussed in our SoH measurement piece, field-deployable EIS requires careful implementation. A simpler but useful impedance proxy — DC internal resistance (DCIR), computed from voltage response to a current step — can be computed from normal operational data at every charge or discharge event, providing a continuous impedance trend without any additional test procedure.

UL 9540A and What Test Data Actually Tells You

UL 9540A is the test method for evaluating thermal runaway propagation in battery energy storage systems. It defines a standardized abuse test procedure: a target cell within a module is driven into thermal runaway (typically by overcharge or external heating), and the test measures whether the thermal runaway propagates to adjacent cells, to the module level, to the enclosure level, and potentially to adjacent enclosures.

UL 9540A test data — which BESS vendors provide as part of the system safety documentation — tells you how a fresh, fully charged pack behaves when a worst-case initiating event occurs. It does not tell you how an aged pack behaves. The thermal runway propagation behavior of a 3-year-old NMC pack at 85% SoH, with elevated cell-to-cell impedance variance and localized capacity fade, is different from the original test configuration. An aged pack with high internal resistance cells will generate more heat from an initiating event; a pack with cell-to-cell SoH spread may see faster propagation if the highest-impedance cells are also at the highest SoC.

This is the limitation of treating UL 9540A compliance as a static safety certification. The test establishes that a specific design, at a specific state, has bounded propagation behavior. Ongoing monitoring — SoH tracking, impedance monitoring, cell-level temperature surveillance — is the operational complement that maintains safety confidence as the pack ages beyond the tested condition.

NFPA 855 Alarm Tiers and Where Predictive Monitoring Fits

NFPA 855 (Standard for the Installation of Stationary Energy Storage Systems), 2023 edition, distinguishes between a Level 1 alarm (pre-alarm, operator notification) and a Level 2 alarm (evacuation trigger). Predictive precursor scoring — voltage divergence trends, thermal anomalies, impedance elevation — maps to Level 1 alarm conditions. The standard explicitly permits proactive early-warning systems that notify operators before any mandatory evacuation or suppression threshold is crossed. This is the regulatory space predictive monitoring occupies: it supplements, not replaces, the mandatory threshold-based Level 2 alarms.

NFPA 855 also specifies gas detection requirements — hydrogen (H₂) and VOCs are off-gassing indicators that precede visible thermal events. The three modalities form a time-layered safety stack: electrical precursor signals (voltage divergence, impedance rise) provide days-to-weeks advance notice; gas detection provides hours; temperature threshold alarms provide minutes. An adequate safety architecture uses all three.

Moving from Alarm to Prediction: What the Architecture Looks Like

Shifting from threshold alarm monitoring to predictive precursor scoring requires three components that most BMS architectures don't natively provide.

First, high-resolution baseline data. Precursor anomaly detection requires a model of normal behavior to detect departures from it. Building that model requires weeks of normal operational data at sub-second or 1-second resolution for voltage and temperature, across all cells. A BMS that samples temperature at 5-minute intervals and doesn't expose cell-level voltage to external systems can't support predictive precursor detection regardless of what analytics is applied downstream.

Second, per-cell statistical modeling. The anomaly detector needs to track the distribution of voltage divergence, temperature delta, and impedance proxy for each cell individually — not just the fleet or pack average. A cell that consistently runs 4°C warmer than its neighbors is normal if that's its characteristic position in the module's thermal gradient. The same cell running 9°C above its own baseline is an anomaly. This distinction requires cell-level history, not just pack-level statistics.

Third, trend rate monitoring with adaptive thresholds. A single anomalous reading has different significance than a consistent trend over 72 hours. Precursor scoring should weight persistent, accelerating trends more heavily than transient spikes, which are often caused by measurement noise, asymmetric thermal exposure, or temporary BMS balancing activity rather than genuine degradation onset.

What We're Not Saying

We're not saying that predictive precursor detection eliminates thermal runaway risk. No monitoring system can prevent a catastrophic initiating event — a manufacturing defect, a severe external short, an extreme mechanical impact. What precursor detection does is narrow the window of undetected degradation between the first signs of abnormal electrochemical behavior and the point at which human intervention becomes necessary. For assets that are inspected monthly by a maintenance contractor but monitored continuously by software, that monitoring interval is the difference between catching a developing problem in week 2 and discovering it at the point of threshold alarm in month 3.

We're also not saying that NFPA 855 compliance alone constitutes an adequate safety posture for high-utilization utility-scale BESS. The standard sets minimum requirements. Operators who run assets at high SoC, in warm climates, with aggressive FCAS dispatch profiles, are operating in conditions that stress cells more than the median installation that informed the standard's prescriptive thresholds. For those operators, predictive monitoring isn't a supplement to compliance — it's the mechanism that keeps the compliance case current as the asset ages into territory that the initial certification didn't characterize.


Interested in applying this to your BESS?

Talk to our engineering team about a pilot deployment at your site.