Manifold-Constrained Hyper-Connections: Stabilizing Deep Networks Beyond ResNets
A deep dive into why residual connections work, how Hyper-Connections generalize them, and why constraining learned skip paths to doubly stochastic matrices solves the instability problem
In this article, we’ll trace the evolution from basic neural networks to residual networks, then to Hyper-Connections, and finally to manifold-constrained Hyper-Connections (mHC). The goal is to understand not just what these techniques do, but why each step was necessary and what problems it solves.
1. The Signal Propagation Problem in Deep Networks
Before diving into residual connections, let’s understand the fundamental problem they were designed to solve.
1.1 A Simple Deep Network
Consider the simplest possible neural network where each layer multiplies the input by a weight:
Let’s trace what happens with an input and weights across three layers.
Forward pass (layer by layer):
- Layer 1:
- Layer 2:
- Layer 3:
The signal shrinks at every layer. After many layers, it approaches zero—this is the vanishing signal problem.
1.2 The Opposite Problem: Exploding Signals
What if the weight is slightly larger? With :
- Layer 1:
- Layer 2:
- Layer 3:
The signal explodes—this is the exploding signal problem.
The key insight here is that any weight not exactly equal to 1 will cause problems when applied repeatedly across many layers. This makes training very deep networks extremely difficult—you’re essentially trying to balance on a knife’s edge where weights need to be exactly right to maintain signal magnitude.
2. Residual Networks: The Identity Path Solution
ResNets introduced an elegantly simple fix: add a skip connection that preserves the original input.
2.1 The Residual Connection Formula
Instead of , ResNets use:
Where represents any transformation (convolution, MLP, etc.). The critical addition is the term—the identity path.
2.2 Why the Identity Path Works
Let’s trace through with the same setup: , , using the residual form .
Forward pass (layer by layer):
- Layer 1:
- Layer 2:
- Layer 3:
This looks like growth, but it’s controlled growth, not collapse. The identity path ensures the original signal is always present. Even if the learned transformation produces small or noisy outputs, the core information flows through unchanged.
2.3 The Two-Path Architecture
A residual block has two parallel paths:
┌─────────────┐
x_l ───►│ identity (x)│──┐
└─────────────┘ │
├──► x_{l+1}
┌─────────────┐ │
x_l ───►│ F(x) │──┘
└─────────────┘
Path 1 (Identity) sends with no change. Path 2 (Residual) applies the learned transformation . The outputs are summed. That straight line at the top is the identity path—it’s what makes deep training stable.
2.4 Why Information is Preserved
In a network without skip connections, when becomes , the original 10 is gone forever. The signal is replaced by a transformed version. With skip connections, the original information flows through unchanged while the residual path learns what to add. The network learns corrections rather than replacements. This is a subtle but profound shift in how we think about what each layer does.
3. Hyper-Connections: Learning the Skip Path
ResNets made a fixed choice: always add the full identity. But what if different features should be passed with different strengths? What if the optimal “skip behavior” varies across the network?
3.1 The Motivation
In very deep transformers, each layer has multiple residual streams:
- Attention streams
- MLP streams
- Skip connections across blocks
- Sometimes dozens of interacting paths ResNet’s fixed identity treats all features equally. Hyper-Connections ask a natural question: What if we could learn how much each previous stream contributes?
3.2 The Hyper-Connection Formula
Instead of a fixed identity, Hyper-Connections introduce a learned mixing matrix :
Here replaces the identity path (which was previously just ), and is the usual learned transformation.
3.3 What is ?
is a learned matrix that replaces the identity path. In ResNet, identity equals and is fixed. In Hyper-Connections, identity becomes and is learned. When , you get exact ResNet behavior. When , you get learned routing.
For a 2-neuron example, might look like:
This means neuron 1 mostly keeps its value (0.8) with some mixing from neuron 2 (0.2), and neuron 2 mostly keeps its value (0.9) with some mixing from neuron 1 (0.1). The network can now learn cross-feature interactions in the skip path itself.
3.4 Why is Learnable
In a neural network, anything multiplied with the input can be learned through backpropagation. For two neurons, we have where:
Each is just a scalar parameter—stored like any other weight, updated by gradient descent. There’s nothing special about it; it’s simply another weight matrix that happens to sit in the skip path.
3.5 How is Initialized
At the start of training, we want . Why?
- Identity is stable
- Training starts safely
- The model behaves like ResNet initially A common initialization strategy is:
Where is small random noise, giving something like:
The matrix starts very close to identity, then learning adjusts it based on what the task requires.
Why not initialize randomly? If you start with something like:
Then even before training begins, will cause explosion or collapse. The eigenvalues of this matrix aren’t equal to 1, so repeated multiplication across 20 layers amplifies small deviations into catastrophic instability. This is why HC initialization must be identity-biased.
3.6 The Appeal of Hyper-Connections
Hyper-Connections allow:
- Adaptive routing: Different features can take different paths
- Dynamic information flow: The network decides what to preserve
- Richer expressiveness: More flexibility than fixed skip connections The idea is compelling: let the network decide how identity should behave rather than hard-coding it.
4. The Hidden Problem: Why Hyper-Connections Are Unstable
Here’s where things get interesting. Despite the appealing flexibility, Hyper-Connections have a fundamental mathematical problem.
4.1 The Core Issue
Residual connections work because . The identity matrix raised to any power is still the identity. Applying it across 100 layers changes nothing.
But with Hyper-Connections, . After many layers, even if starts close to identity, repeated multiplication causes one dimension to dominate while another vanishes, and gradients explode or die. This happens even with just 2 neurons.
4.2 Why ResNet Doesn’t Have This Problem
ResNet never learns the identity. It’s fixed: . Hyper-Connections learn it: . Learning identity is numerically dangerous when repeated across many layers because any deviation from perfect identity gets amplified exponentially.
4.3 What Goes Wrong During Training
Even if at initialization, training updates the matrix:
Let me break down what this gradient notation means. The term represents the derivative of the loss with respect to the matrix . Since is a matrix of numbers:
We have four derivatives: , , ,
Each derivative answers a simple question: If I slightly change this number, does the loss go up or down? For example, if , it means increasing decreases the loss—a good direction to move.
We write this as a matrix to match the shape of :
We subtract because the gradient points uphill, and we want to go downhill to minimize loss.
4.4 The Eigenvalue Problem
Small updates from gradient descent can nudge eigenvalues slightly above 1 or slightly below 1. After layers, either explodes or vanishes. This isn’t a training bug—it’s math. Any matrix with eigenvalues not exactly equal to 1 will cause problems when raised to large powers. The instability is baked into the structure of the problem.
5. Manifold-Constrained Hyper-Connections (mHC)
The solution isn’t to abandon learned skip paths, but to constrain them to a “safe space” of matrices that remain stable under repeated multiplication.
5.1 The Core Idea
mHC doesn’t change backpropagation. Instead, it changes where is allowed to live. The principle is simple: Learn freely, then project back to a safe space.
Mathematically, this means (1) taking an unconstrained gradient step, then (2) projecting onto the set of safe matrices.
5.2 What Matrices Are “Safe” to Repeat?
We want a matrix such that:
- It doesn’t amplify values
- It doesn’t shrink values
- Repeating it many times stays stable The safest operation on numbers is averaging. Consider . Averaging never explodes, never vanishes—it just redistributes.
5.3 Averaging with Matrices
A matrix performs averaging if all entries are non-negative and each output is a weighted average of inputs. This happens when rows sum to 1. For example:
Apply this to :
Each output is a weighted average—no scaling, no explosion. The total “mass” of information is preserved.
5.4 Why Rows Alone Aren’t Enough
The forward pass uses , but the backward pass (for computing gradients) uses . So row sums equaling 1 gives us a stable forward pass, but we also need column sums equaling 1 for a stable backward pass. To protect both directions, we need both constraints.
6. Doubly Stochastic Matrices
6.1 Definition
A matrix is doubly stochastic if:
- All entries are greater than or equal to zero
- Each row sums to 1
- Each column sums to 1
A simple example:
This is perfect averaging—forward stable and backward stable.
6.2 Why Doubly Stochastic Matrices Behave Like Identity
These matrices have remarkable properties:
- The largest eigenvalue equals 1
- All other eigenvalues are at most 1
- The set is closed under multiplication (the product of two doubly stochastic matrices is itself doubly stochastic)
This means does not explode or vanish regardless of how large becomes. Repeated application behaves like identity plus smoothing—exactly what we want for skip paths.
6.3 The Birkhoff Polytope
The Birkhoff polytope is simply the set of all doubly stochastic matrices. For 2×2 matrices, it forms a diamond-shaped region in parameter space.
A beautiful fact from linear algebra is that every doubly stochastic matrix is a weighted average of permutation matrices. This gives us deep intuition about what these matrices actually do.
6.4 What Are Permutation Matrices?
A permutation matrix just reorders neurons without changing magnitudes. For 2 neurons, there are only two possibilities.
The identity permutation does nothing:
This sends Neuron 1 to Neuron 1 and Neuron 2 to Neuron 2.
The swap permutation exchanges them:
This sends Neuron 1 to Neuron 2 and Neuron 2 to Neuron 1.
Permutation matrices never change magnitude—they only move information around. This is why they’re perfectly stable.
6.5 Doubly Stochastic = Soft Permutation
Take the two permutation matrices and compute a weighted average with :
This is doubly stochastic. The beautiful intuition is that every safe is a soft permutation of features. It’s not doing hard routing (this neuron goes there), but soft routing (70% of this neuron stays here, 30% goes there).
7. The Projection Algorithm: Sinkhorn-Knopp
7.1 The Problem
Gradient descent gives us , but is not doubly stochastic. Entries may be negative, and row/column sums are wrong. We need to push it back into the Birkhoff polytope. That push is accomplished by the Sinkhorn-Knopp algorithm.
7.2 The Modified Update Rule
Instead of the standard update , mHC uses:
Where represents projection onto doubly stochastic matrices. This is the entire mathematical change—everything else remains standard.
7.3 Sinkhorn-Knopp: Step by Step
The algorithm turns any positive matrix into a doubly stochastic one through iterative normalization.
Step 0 makes entries positive using exponentiation: . Now all entries are greater than 0.
Step 1 normalizes rows by dividing each row by its sum.
Step 2 normalizes columns by dividing each column by its sum.
Step 3 repeats steps 1 and 2, alternating between row and column normalization until convergence.
After a few iterations, rows sum to approximately 1 and columns sum to approximately 1. The algorithm:
- Converges fast
- Is differentiable (so we can backpropagate through it)
- Is computationally efficient
7.4 Numeric Example
Start with:
Row normalize to get:
Now check column sums: Column 1 sums to , and Column 2 sums to . We’re already doubly stochastic after one iteration. In practice, it usually takes just a handful of iterations to converge to machine precision.
8. The Complete mHC Pipeline
Putting it all together, the full stabilization pipeline works as follows:
- Learn via standard gradient descent
- Exponentiate to ensure positivity
- Apply Sinkhorn normalization to project onto doubly stochastic matrices
- Use the projected in the network’s forward pass
Mathematically:
This single line captures the essence of the approach: take a gradient step in unconstrained space, then project back to the manifold of safe matrices.
9. Why This Works?
The key insight is that we’ve separated two concerns.
- Expressiveness: can still learn—it’s not fixed like in ResNets.
- Stability: The doubly stochastic constraint ensures safe repeated multiplication.
Because doubly stochastic matrices:
- Have bounded eigenvalues
- Preserve signal magnitude on average
- Are stable in both forward and backward passes
The network gains the flexibility of Hyper-Connections without the instability.
Skip paths become learned weighted averages. Forward signals are preserved, backward gradients are preserved, and repeated depth produces redistribution rather than scaling. The identity is no longer fixed—but its stability properties are preserved through geometric constraints on the parameter space.
10. Summary
The progression tells a clear story. Plain networks suffer from vanishing/exploding signals—the fundamental problem. ResNets fix this with identity skip paths, but sacrifice flexibility by hard-coding the skip behavior. Hyper-Connections learn the skip path, gaining flexibility but reintroducing instability. Finally, mHC constrains learned skip paths to doubly stochastic matrices, achieving both stability and flexibility.
The mathematical elegance lies in recognizing that by constraining to lie on the manifold of doubly stochastic matrices (the Birkhoff polytope), we get the best of both worlds—learned routing that remains numerically stable across arbitrary depth.
References
-
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385
-
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity Mappings in Deep Residual Networks. arXiv:1603.05027
-
Zhu, D., Huang, H., Huang, Z., Zeng, Y., Mao, Y., Wu, B., Min, Q., & Zhou, X. (2024). Hyper-Connections. arXiv:2409.19606
-
Xie, Z., Wei, Y., Cao, H., Zhao, C., Deng, C., Li, J., Dai, D., Gao, H., et al. (2025). mHC: Manifold-Constrained Hyper-Connections. arXiv:2512.24880
-
Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics
-
LeetArxiv. (2024). Sinkhorn-Knopp Algorithm. Substack
Enjoyed this post?
Subscribe to get notified when I publish new posts. No spam, unsubscribe anytime.