“Who Said Neural Networks Aren’t Linear?” — explained like over coffee

Alright, let’s start simple. Everyone who’s dabbled a bit in machine learning knows one thing: neural networks are nonlinear. That’s what makes them powerful — they can model weird, curvy, complex relationships, not just straight lines.

But the authors of the paper “Who Said Neural Networks Aren’t Linear?” (Nimrod Berman, Assaf Hallak, Assaf Shocher) asked a cheeky question: what if that’s not entirely true? What if nonlinearity is just… a matter of perspective?

Their idea is wild but elegant: maybe, under a different definition of what “adding” and “scaling” vectors means, even a neural network can become linear.

Why is this cool? Because if something’s linear, you can use all the nice, powerful tools from linear algebra — SVDs, projections, pseudoinverses, etc. — in places we usually can’t. It’s like discovering you can fix a car engine with Lego tools because someone secretly made the parts compatible.

1. For beginners – the simple version, with examples

Imagine you have a map — a 2D surface with points on it. You can transform points from one map to another using some function $f$. Normally, if $f(x + x’) \neq f(x) + f(x’)$, we say it’s nonlinear.

Now, what if we said: “Hey, what if ‘+’ on this map doesn’t mean the usual plus?”
What if we define our own special addition, let’s call it “$\oplus$”, and our own scaling “$\otimes$”? Then maybe, under these new rules, $f$ is linear!

That’s basically what this paper does.

The authors define two invertible neural networks, $g_x$ and $g_y$, and a linear operator $A$.
Then they build a function like this:

$$ f(x) = g_y^{-1}(A , g_x(x)). $$

To us, that looks nonlinear — because of the $g_x$ and $g_y^{-1}$ parts.
But inside their custom-defined space, it’s actually linear.

Think of it like wearing “geometry glasses” that bend the world. When you put them on, what used to look curved now looks perfectly straight.

Why is this useful? Because if you can find such a transformation, you can use all your linear algebra tricks in a world that used to look messy and nonlinear.

2. For the more advanced – the math behind it

Let’s roll up our sleeves.

2.1 The “Linearizer” construction

The authors define a Linearizer as:

$$ f(x) = g_y^{-1} \bigl( A , g_x(x) \bigr) $$

where:

$g_x: X \to U$ is an invertible (bijective) neural network,
$g_y: Y \to V$ is another one,
$A: U \to V$ is a linear operator.

Now the idea is: we can define new vector space operations in $X$ and $Y$ like this:

$$ x \oplus_X x’ = g_x^{-1}\bigl( g_x(x) + g_x(x’) \bigr), \quad c \otimes_X x = g_x^{-1}\bigl( c \cdot g_x(x) \bigr), $$

and similarly in $Y$ using $g_y$.

With these definitions, we can show:

$$ f(x \oplus_X x’) = g_y^{-1}(A , (g_x(x) + g_x(x’))) = g_y^{-1}(A g_x(x) + A g_x(x’)) = f(x) \oplus_Y f(x’). $$

Voilà — $f$ is now linear in those new spaces.
So even though it looks nonlinear in the usual coordinates, it’s actually linear when you “change your definition of addition and scaling.”

2.2 Composition and properties

Some neat results from the paper:

If you chain two Linearizers that share some of the same mappings, the composition is still a Linearizer.
You can apply this to diffusion models, turning multi-step generation into a single-step process.
You can enforce idempotence ($f(f(x)) = f(x)$), making the function a projection — useful for generative models and style transfer.

2.3 Step-by-step summary

Choose invertible networks $g_x$ and $g_y$.
Define a simple linear operator $A$ in between.
Build $f(x) = g_y^{-1}(A g_x(x))$.
Redefine your “addition” and “scaling” so that $f$ behaves linearly.
Combine or constrain as needed for your task.

3. How this can be used – real-world ideas

Here’s where it gets fun.

Faster diffusion models
Diffusion models (like those behind Stable Diffusion) often take dozens or hundreds of steps to generate an image. Using Linearizers, you can collapse that into one step. Huge speed boost.
Generative projections / style transfer
If you make $f$ idempotent ($f(f(x)) = f(x)$), you get a projection — a stable transformation that doesn’t “overdo” its effect. That’s great for style transfer or image editing.
Better interpretability
When your function is linear (even in a weird space), you can apply all your linear analysis tools — singular value decomposition, rank checks, eigen-analysis. It’s a window into what the network is really doing.
Modular design
Linearizers can be stacked and recombined. You can build complex systems from simpler blocks that still have predictable, linear-like behavior in their transformed spaces.

4. Conclusion – why this matters

So what’s the takeaway?

This paper challenges one of ML’s sacred cows — that neural nets are inherently nonlinear.
It shows that nonlinearity depends on your definition of space. Change that, and the math changes too.
It’s not just a fun math trick: it could make models faster, more interpretable, and more modular.
It bridges two worlds: the flexibility of neural networks and the clean, powerful math of linear algebra.

In short: this paper doesn’t just tweak how we train networks — it tweaks how we think about them.

1. For beginners – the simple version, with examples#

2. For the more advanced – the math behind it#

2.1 The “Linearizer” construction#

2.2 Composition and properties#

2.3 Step-by-step summary#

3. How this can be used – real-world ideas#

4. Conclusion – why this matters#