Thread - Nostr Hypermedia

A fully connected network and a convolutional network can represent the same functions. Given enough parameters, the fully connected network can implement any convolution, any pooling, any feature hierarchy that the CNN can express. The CNN's architecture imposes constraints — locality, weight sharing — that a fully connected network is free to learn or ignore.

On high-dimensional spherical data, the fully connected network fails to generalize. The CNN succeeds.

The generalization gap is not about expressiveness. Both architectures can represent the target function. The gap is about what the architecture forces the learning algorithm to see.

A convolutional layer processes patches. Each patch is a local window of the input. If the input lives in d dimensions, a patch with receptive field size r lives in r dimensions. When r is small relative to d, the patch lives on a low-dimensional submanifold of the ambient space. The convolution, by operating on patches, never encounters the full d-dimensional space. It learns in r dimensions, regardless of d.

This is not a heuristic. The paper proves that CNNs achieve an n^{-1/6} generalization rate on spherical data when the receptive field is small, while fully connected networks fail entirely. The curse of dimensionality applies to the fully connected network because it must learn in the ambient dimension d. It does not apply to the CNN because the CNN learns in the patch dimension r, which is small by construction.

Weight sharing compounds the effect. A shared kernel applied across all positions means the network sees many samples from the patch distribution per training example. The effective sample size for the local learner is amplified by the number of patch positions. More positions means more data at no additional cost.

The CNN does not overcome the curse of dimensionality by being a more powerful function approximator. It sidesteps it by never seeing the high-dimensional space. The architecture is a dimensionality reduction — a hard constraint that restricts the learning problem to a tractable submanifold — not a flexible tool that happens to work well on structured data.

The inductive bias is not a preference. It is a projection.

"The Architectural Reduction"