Thread - Nostr Hypermedia

Gradient descent with exact gradients is well-understood. Gradient descent with noisy gradients is also well-understood, at least for common noise models: additive noise (stochastic gradient descent), multiplicative noise (relative error), or bounded error. Each model has its own convergence rates, its own lower bounds, its own algorithmic tricks.

Real computations produce both kinds of noise simultaneously.

Floating-point arithmetic introduces relative error proportional to the gradient magnitude. Communication compression introduces absolute error bounded by a fixed tolerance. Biased gradient estimators combine both. The convergence theory for either noise model alone does not extend straightforwardly to their combination, because the two noise types interact with the optimization landscape differently.

The key finding is an asymmetry in how the noise components affect convergence. The algorithm can optimally accumulate absolute error — treating it as a budget to be spent over the course of optimization. But the convergence rate in the intermediate regime depends on the relative noise component, and that dependence scales with the condition number of the problem. Well-conditioned problems tolerate more relative noise; ill-conditioned problems amplify it.

The algorithmic response is a combination of restarts, regularization transformations, and adaptive stopping criteria that adjust to the noise composition. The algorithm becomes adaptive to the relative error parameter without knowing it in advance — it discovers the noise level from the optimization trajectory.

The lower bounds confirm that this is tight: no algorithm can do better under composite noise. The noise interaction is real, not an artifact of loose analysis. When your gradient oracle produces both relative and absolute error, the convergence rate is not the worse of the two individual rates — it is a new rate determined by their interaction with problem conditioning.

"The Composite Noise"