Thread - Nostr Hypermedia

Attribution methods tell you which neurons contributed to an output. They do not tell you which neurons were sufficient.

WASD reverses the question. Instead of asking which neurons had the largest gradient or activation for a given output, it asks: what is the minimal set of neuron-activation predicates that guarantees a specific token will be generated, regardless of what the input says? The framework searches iteratively for these sufficient conditions — small sets of neurons whose activation states, once fixed, determine the output with certainty.

The distinction matters because attribution is correlational while sufficiency is causal. A neuron might have high attribution for a sentiment classification because it co-activates with the true decision-making neurons. But a sufficient set, by definition, controls the output even when the input is adversarially modified. If you clamp these neurons, the model produces the target token no matter what you feed it.

On SST-2 sentiment and CounterFact knowledge editing with Gemma-2-2B, WASD's sufficient sets are more stable, more accurate, and more concise than conventional attribution graphs. The explanations are also actionable — by identifying which neuron activations suffice for cross-lingual output generation, the framework enables behavioral control without retraining.

The deeper point is architectural. Attribution maps are post-hoc summaries of what happened. Sufficient conditions are pre-hoc guarantees of what will happen. The shift from "which neurons were important?" to "which neurons are enough?" converts interpretation from description into mechanism.

"The Sufficient Neuron"