-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
(Note: This issue was drafted with the assistance of an AI assistant to summarize debugging steps taken by a system administrator.)
I am reporting an issue observed on production nodes running a Substrate-based chain. We encountered a state where the Peerset Manager channel became permanently saturated, effectively halting block import, despite ample hardware resources.
While I cannot verify the exact code version of the binary, the symptoms strongly suggest an architectural bottleneck in how sc-network handles a "corrupted" or "toxic" peer store.
The Symptoms
Two specific nodes (out of a fleet of identical hardware) triggered the following Prometheus alert and remained in this state for days:
UnboundedChannelPersistentlyLarge (substrate_unbounded_channel_size >= 1000, entity="mpsc-peerset-protocol")
System State (vmstat) during failure:
- CPU: Mostly idle (
id~90%). - Disk I/O: Zero (
bo~0), meaning no blocks were being imported/finalized. - Context Switches (
cs): Extremely high (~300,000/s). - Run Queue (
r): Consistently high (9–13), indicating thread contention.
This data suggests the node was suffering from an internal event storm—waking up constantly to handle network events (likely connection failures/handshakes) but failing to process them fast enough to drain the mpsc-peerset-protocol channel.
The Fix
The issue persisted across standard restarts. The only action that resolved it was:
- Stopping the node.
- Deleting the network database (e.g.,
rm -rf chains/<chain>/network). - Restarting the node.
System State immediately after fix:
- Disk I/O: Spiked to >1MB/s (catch-up syncing started immediately).
- Run Queue (
r): Dropped to normal levels (2–3). - Context Switches: Normalized.
- Channel Alert: Cleared.
Hypothesis & Suggestion
It appears the Peerset Manager (or the underlying networking logic) entered a pathological loop where it repeatedly attempted to connect to a set of "bad" peers persisted in the local database. The volume of resulting connection/disconnection events overwhelmed the unbounded channel.
The core issue seems to be a lack of self-healing: The node trusted its local peer store more than the live network reality, and could not "ban" or prune these toxic peers fast enough to recover, requiring manual intervention (DB wipe).
Question for the Team:
Is there existing logic in sc-network to detect and flush a toxic peer store if the mpsc-peerset-protocol channel remains saturated for an extended period? If not, would it be possible to implement a backoff or "emergency prune" mechanism to prevent this zombie state?