Deep neural networks are like Goldilocks’ porridge: they...
Originally published on Tumblr.

Deep neural networks are like Goldilocks’ porridge: they shouldn’t work, but they do. Theoretically, models with billions of parameters should demand exponentially more training samples to avoid overfitting. Yet, they defy this expectation, generalizing well even with limited data. This paradox is at the heart of the sample complexity and VC dimension crisis.
To understand this, we dive into the realm of Rademacher complexity and PAC (Probably Approximately Correct) bounds. These concepts traditionally suggest that a model’s capacity—its ability to fit random noise—should be tightly controlled to ensure generalization. The VC (Vapnik–Chervonenkis) dimension, a measure of a model’s capacity, implies that deep networks, with their vast parameter spaces, should require an impractical number of samples to learn effectively. But they don’t. Instead, they leverage implicit regularization, a phenomenon where stochastic gradient descent (SGD) noise guides the model toward simpler, more generalizable solutions.
This brings us to the double descent phenomenon. Initially, as model complexity increases, test error decreases until it hits an interpolation threshold, where the model fits the training data perfectly. Beyond this point, classical wisdom predicts a spike in test error due to overfitting. However, in practice, test error often decreases again, defying traditional learning theory. This second descent suggests that overparameterization creates a landscape of null space solutions—regions in parameter space where different configurations yield the same output. These solutions, surprisingly, don’t overfit as expected.
The recent AI funding bubble, with its overpromised capabilities, highlights the tension between theory and practice. While investors pour billions into AI, expecting miraculous results, the underlying mechanics remain elusive. Overparameterization, while effective, conflicts with Occam’s razor and the minimum description length principle, both of which advocate for simpler models. Yet, deep networks thrive in complexity, finding elegant solutions in a sea of possibilities.
This contradiction is not just academic. It challenges our understanding of learning itself. Asides like the recent collapse of a high-profile AI project (which promised revolutionary results but failed to deliver) serve as cautionary tales. They remind us that while deep networks can generalize, they do so in ways we don’t fully comprehend.
In conclusion, the Goldilocks enigma of deep learning—where models are neither too simple nor too complex—forces us to rethink foundational principles. It suggests that the noise of SGD and the peculiarities of overparameterization might hold the key to unlocking the mysteries of generalization. And while the journey is fraught with theoretical contradictions, it is precisely these challenges that drive the field forward.