Thermal runaway does not announce itself. By the time a temperature sensor trips a BMS protection threshold, the electrochemical cascade that drives the event has been in progress for hours — sometimes days. The heat spike that takes a cell from operational to uncontrolled exothermic reaction in under a minute is the end of a chain, not the beginning.
This article focuses on what happens in the 72 hours before that final escalation: the measurable, detectable signals that precede thermal runaway events in grid-scale lithium-ion systems. After several years in field diagnostics for large-format LFP systems, the consistent finding from post-incident analysis is that the data to detect these events earlier was there — it just wasn't being read with the right model.
The Electrochemical Pathway to Thermal Runaway
Thermal runaway in lithium-ion cells follows a relatively well-defined pathway, even though the initiating event can vary. The standard thermal abuse sequence looks like this:
- Initiating trigger: Internal short circuit (from lithium dendrite penetrating separator, manufacturing defect, or particle contamination), external short circuit, mechanical deformation, or extreme overcharge. This generates local heat.
- SEI decomposition (~90–120°C): The solid electrolyte interphase layer begins breaking down exothermically, generating heat and CO₂ gas. Internal gas pressure begins to rise.
- Separator melt (~130°C in PE separators, ~150°C in PP): The separator begins to melt or shrink, accelerating contact between anode and cathode.
- Electrolyte decomposition (~150–200°C): Organic electrolyte solvents vaporize and react, generating flammable gases including ethylene, methane, and hydrogen.
- Cathode decomposition (~200–250°C for NMC, higher for LFP): The cathode releases oxygen, driving the exothermic reaction that produces flame and explosion risk in severe events.
Stages 1 and 2 — trigger through early SEI decomposition — are often electrically detectable before the thermal cascade becomes self-sustaining. The challenge is knowing what to look for.
Voltage Signature Precursors: What 12–72 Hours Out Looks Like
Internal micro-short circuits are the most common precursor to thermal runaway in field-deployed grid storage systems. A micro-short creates a small, continuous discharge path inside the cell — a parasitic drain that the BMS is not accounting for in its SoC calculation.
In cell voltage time series, micro-short progression shows up as three characteristic patterns:
- Accelerated self-discharge at rest. A healthy cell in a stack that has been at rest for several hours should hold voltage closely. A cell with a micro-short will drift lower than its neighbors by a consistent margin — initially 5–10 mV, progressively wider as the short develops. This drift is often detectable 48–72 hours before any temperature anomaly.
- Anomalous recovery spike after discharge. When the cell comes back from a discharge cycle, a cell with an internal short will show an atypical voltage recovery profile — the rebound shape and rate differ from the OCV-SoC curve prediction. Subtle but measurable in high-frequency telemetry data.
- Resistance signature shift. DC internal resistance measured at the cell level increases as a short develops. In EIS measurements, the charge transfer resistance component rises while the Warburg diffusion element changes in shape. Resistance shifts in pulse-test data can serve as a proxy when EIS is unavailable.
The 72-hour window is not a universal constant — it's an approximation based on observed field cases across NMC and LFP chemistries in the 100 Ah to 280 Ah cell format range. In practice we've seen precursor signals appear anywhere from 20 hours to over a week before the thermal event, depending on short circuit severity and ambient temperature. What the 72-hour framing captures is the typical detection window available under normal 10-minute polling intervals if someone is running the right analysis on the raw data.
Gas Generation Indicators
Venting gas is one of the most reliable thermal runaway precursors, but also one of the most inconsistently monitored. When SEI decomposition begins and electrolyte starts to decompose, cells generate gas internally. Prismatic and pouch cells will bulge visibly before they vent; cylindrical cells have built-in pressure relief vents that open at defined pressures.
Some BESS enclosures include gas detection sensors — typically monitoring for hydrogen (H₂), carbon monoxide (CO), and VOCs as a general indicator. When these trip, the cell has already progressed well into stage 2 or beyond. They are a last-resort detection layer, not an early warning system.
The mechanical consequence of internal pressure rise is sometimes detectable through structural impedance measurements in instrumented racks, and the voltage behavior of a gassing cell is subtly different from a non-gassing cell at equivalent SoC. Practically speaking, most operators work with voltage and temperature telemetry only. That narrows useful precursor monitoring to voltage signatures and temperature differential analysis.
Temperature Differential Analysis: The 2°C Signal
Individual cells in a rack are not all at the same temperature. Thermal gradients across a module are normal — cells at the edges run slightly cooler than cells in the center, and cells adjacent to cooling channels run cooler than those farther away. These gradients are predictable and site-specific, and a well-characterized BESS will have a thermal map showing expected steady-state gradients under different loading conditions.
What is not normal is a cell whose temperature is rising faster than its neighbors under the same electrical load. A cell with an active internal short generates heat in addition to the joule heating from charge-discharge current. In moderately good thermal monitoring — cell-level temperature sensors rather than module-level averages — this shows up as a cell consistently 1–3°C hotter than surrounding cells during discharge, even after accounting for gradient predictions.
From our field diagnostics work: a consistent 2°C or greater temperature differential at a single cell relative to its geometric neighbors, not explained by the module's thermal map, is a signal worth investigating immediately. It's not conclusive evidence of impending failure, but it should trigger a priority inspection, not a watch-and-wait.
The practical constraint is that many deployed systems have temperature sensors at the module level, not the cell level. A module with 16 cells reporting a single average temperature cannot resolve a 2°C anomaly at one cell out of 16 — the anomaly is washed out. This is an instrumentation gap that matters for safety monitoring, not just SoH diagnostics.
Chemistry-Specific Patterns: LFP vs. NMC
LFP and NMC cells have different thermal runaway onset temperatures and different precursor profiles. NMC cathodes begin decomposing and releasing oxygen at lower temperatures (~200°C vs. ~270°C for LFP), which means NMC events escalate faster once they start. But LFP's much flatter OCV-SoC curve makes voltage-based anomaly detection harder — the signal-to-noise ratio for a micro-short voltage drift against LFP's flat plateau is lower than in NMC chemistry where the voltage curve is more differentiated.
For LFP systems, precursor monitoring should weight more toward temperature differentials and resistance signature changes, since voltage drift is a weaker signal. For NMC, voltage drift at rest and anomalous recovery profiles are more reliably diagnostic and earlier detectable.
Mixed-chemistry sites — increasingly common as operators consolidate storage portfolios — need chemistry-specific thresholds and detection logic. Running NMC thresholds on LFP data generates false positives from LFP's natural voltage plateau behavior; running LFP thresholds on NMC data will miss early voltage drift that is genuinely anomalous for NMC chemistry.
What a 72-Hour Detection Window Enables Operationally
Seventy-two hours of lead time is enough to do five things that matter:
- Dispatch a field technician for a priority inspection during business hours rather than an emergency callout at 2 AM
- Pre-position a replacement module at the site before the affected one needs to come offline
- Execute a controlled isolation of the suspect rack from the active string, reducing cascade risk to adjacent racks
- Prepare a warranty documentation package capturing the clean pre-event telemetry record rather than scrambling to reconstruct it post-incident
- Notify the grid operator that one rack may be coming offline in a planned maintenance window, avoiding an unplanned outage notification
Each of those actions has a measurable cost difference from the equivalent reactive response. The difference between a planned rack isolation and an emergency thermal event is not just safety — it's the difference between a roughly $15,000 maintenance action and an incident that can cost $2M or more with regulatory consequences and insurance implications.
The precursor signals are in the telemetry. Most deployed BMS platforms log cell voltage at 1–10 Hz, cell temperature at 1 Hz or better, and in some cases resistance measurements at intervals. The 72-hour detection window exists for systems that have an analytical layer reading those streams against calibrated electrochemical models. Without that layer, the signals are invisible — the BMS protection logic was not designed to catch them, and raw historian data sitting in a database does nothing on its own.