The Illusion of Linearity in High-Dimensional Embeddings TL;DR:...

The Illusion of Linearity in High-Dimensional Embeddings
TL;DR: High-dimensional embeddings fail to form linear subspaces for semantic concepts, revealing the limitations of probing classifiers.
High-dimensional embeddings are not the panacea for semantic representation that many claim. Despite the allure of probing classifiers suggesting that semantic concepts align neatly into linear subspaces, the reality is far more complex and less flattering.
The linear representation hypothesis posits that semantic concepts can be captured as linear subspaces within high-dimensional embeddings. However, this assumption crumbles under scrutiny. The rank and spectral properties of weight matrices used in these embeddings reveal a stark truth: linear read-outs often achieve spurious accuracy, not through genuine semantic understanding, but by exploiting dataset artifacts. This is a critical flaw, as it suggests that what we perceive as semantic alignment is often just a mirage created by statistical noise.
Moreover, the curse of dimensionality distorts cosine similarity in these embedding spaces. As dimensionality increases, all points tend to become equidistant, rendering cosine similarity a poor measure of semantic closeness. This phenomenon undermines the very foundation of using high-dimensional spaces for semantic tasks.
To further complicate matters, the Johnson-Lindenstrauss lemma provides a mathematical basis for why dimensionality reduction, often employed to manage these high-dimensional spaces, destroys semantic relationships. By projecting data into lower dimensions, we inadvertently lose the very nuances that are crucial for maintaining semantic integrity.
- Key Point One: High-dimensional embeddings fail to form linear subspaces for semantic concepts.
- Key Point Two: Linear read-outs achieve spurious accuracy through dataset artifacts, not genuine semantic understanding.
- Key Point Three: Dimensionality reduction via Johnson-Lindenstrauss lemma destroys semantic relationships.
In light of these findings, we must question the reliance on high-dimensional embeddings for semantic tasks. Are we truly capturing meaning, or are we merely fitting noise? It’s time to rethink our approach and prioritize genuine semantic understanding over superficial statistical tricks.