sc-network: Peerset Manager saturation loop caused by toxic peer store (UnboundedChannelPersistentlyLarge)

*(Note: This issue was drafted with the assistance of an AI assistant to summarize debugging steps taken by a system administrator.)*

I am reporting an issue observed on production nodes running a Substrate-based chain. We encountered a state where the Peerset Manager channel became permanently saturated, effectively halting block import, despite ample hardware resources.

While I cannot verify the exact code version of the binary, the symptoms strongly suggest an architectural bottleneck in how `sc-network` handles a "corrupted" or "toxic" peer store.

### The Symptoms
Two specific nodes (out of a fleet of identical hardware) triggered the following Prometheus alert and remained in this state for days:
`UnboundedChannelPersistentlyLarge (substrate_unbounded_channel_size >= 1000, entity="mpsc-peerset-protocol")`

**System State (`vmstat`) during failure:**
*   **CPU:** Mostly idle (`id` ~90%).
*   **Disk I/O:** Zero (`bo` ~0), meaning no blocks were being imported/finalized.
*   **Context Switches (`cs`):** Extremely high (~300,000/s).
*   **Run Queue (`r`):** Consistently high (9–13), indicating thread contention.

This data suggests the node was suffering from an internal event storm—waking up constantly to handle network events (likely connection failures/handshakes) but failing to process them fast enough to drain the `mpsc-peerset-protocol` channel.

### The Fix
The issue persisted across standard restarts. The only action that resolved it was:
1.  Stopping the node.
2.  Deleting the **network database** (e.g., `rm -rf chains/<chain>/network`).
3.  Restarting the node.

**System State immediately after fix:**
*   **Disk I/O:** Spiked to >1MB/s (catch-up syncing started immediately).
*   **Run Queue (`r`):** Dropped to normal levels (2–3).
*   **Context Switches:** Normalized.
*   **Channel Alert:** Cleared.

### Hypothesis & Suggestion
It appears the Peerset Manager (or the underlying networking logic) entered a pathological loop where it repeatedly attempted to connect to a set of "bad" peers persisted in the local database. The volume of resulting connection/disconnection events overwhelmed the unbounded channel.

**The core issue seems to be a lack of self-healing:** The node trusted its local peer store more than the live network reality, and could not "ban" or prune these toxic peers fast enough to recover, requiring manual intervention (DB wipe).

**Question for the Team:**
Is there existing logic in `sc-network` to detect and flush a toxic peer store if the `mpsc-peerset-protocol` channel remains saturated for an extended period? If not, would it be possible to implement a backoff or "emergency prune" mechanism to prevent this zombie state?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sc-network: Peerset Manager saturation loop caused by toxic peer store (UnboundedChannelPersistentlyLarge) #10643

The Symptoms

The Fix

Hypothesis & Suggestion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sc-network: Peerset Manager saturation loop caused by toxic peer store (UnboundedChannelPersistentlyLarge) #10643

Description

The Symptoms

The Fix

Hypothesis & Suggestion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions