There is an inportant subtlety when on is trying to interpret what a Neural Network actually does – each neuron, it seems, gets activated on a different set of inputs which corresponds to very different set of features. It is most prominent in a computer vision settings, when a selected neuron reacts on completely unrelated parts of inputs, say of cats and of cars.
Let’s see what is going on out there. The problem comes from the fact that the same “early” layers of neurons has been given completely unrelated sets of inputs. In other words, there is no specialization yet. On the other hand, the “wiring” of the visual system is deliberate and specialized, which is opposite of a “fully connected” networks.
Back-propagation on inconsistent inputs (noisy “examples” or verbalized “bullshit”) necessarily ruins (distorts) the weights, so every next forward pass after a backward pass from a noise example is more “blurred”.
The same neurons respond to a completely different set of inputs (assuming the threshold has been exceeded), so the learned (so far) weights are being “reused” to encode more than one “feature”.
The weights are being constantly distorted with each backward passes on “other features”, and “rebalanced” with each backward pass on “the same features”. The analogy with blurring of an image and sharpening is not arbitrary.
“Reuse” of neurons and learned so-far weights (internal state) for partially encoding different “features” is just an observed fact. This implies that all the naive interpretations, especially based on over-simplified but seeminigly sophisticated-enough meme models (hyper-spheres, and what not) are just disconnected from What Is.
The difference is in the “correctness” or “how noisy” is the next training example and, most importantly, how far the training process has already progressed. In a stable environments the level of noise is by definition has to be tiny. Otherwise what we would learn is the noise (no actual learning could be possible at all).
Consequently, a “mature, well-developed brain” discards the noise very efficiently. Contradictions do not exist, check your premises).
This is how a “multiple features with reuse” encoding stabilizes and why it has been selected by evolution. We have evolved in stable environments with very consistent sensory inputs, that corresponds to the physical constraints of the shared environment and its recurring processes (days, weather, seasons).
Notice that unlike spoken human languages, the environment does not lie to you and do not talk bullshit, so the inputs are indeed very consistent.
This is how we (biological systems) almost never have an over-fitting, and this, it seems, is an evolved “optimization” against being “too specific” or “too narrow”.
There is also no other way for a fully connected network to behave. The brain prunes connections, most of them. The network only learns some zero weights at best.
Pruning implies no way to “re-learn it back” (using the same brain structures, but it has been shown that different parts of the brain could learn), which is ok for a stable environment. Once the brains is matured it is basically “fixed” – the resulting structures after pruning are presumably the best match to the environment.
The conclusion is that almost all explanations, especially based on imaginary topological “objects” are utterly wrong and serve no purpose but to display one’s “intelligence” and being “up to date”. The quest for the Truth (which is always out there, you know) is to find that one “less wrong” one, with the “shortest distance to What Is” (local optimum for a limited human intellect).
So, the weights emerged (learned) so far are being reused to partially represent multiple “features”. This simple fact invalidates all the “papers” by Chuddies.
There is another hint: it is extremely important that the social environment in which a child spends its first years and is being taught to speak to be “stable” and “consistent”, but rich in the sensory input. It has been observed innumerable number of times, that “village” children (of normal families) are in general “smarter” and more clever.
In the modern terms this is called “the quality of the data set”, which, unsurprisingly, is by far the most important metric for any Machine Learning project, on which the most money has been spent (not unlike how smart and rich people spend on their kids).
Being able to zoom in and out through commonalities and universals is the key.