Overview

A valid (less wrong) intuitive metaphor is that we “learn” a “surface” which will match (will cover, up to the last wrinkle) the whole actual Himalaya.

This notion “generalizes to any number of dimensions” meme (differences, distances and derivatives do not care about Mind’s abstract bullshit)

The Himalayas (Truth) has to be “out there”.

A good generalization is a bucket-sort, which can be thought off as a specific example of a classification problem. Bucket sort, however, is clearly captures the essence.A functions thus outputs correct labels for a given input. The handwritten digit recognition task is a canonical example.

Pattern-recognition is another valid generalization, in which the inputs are matched against a “learned” common pattern. Image classification tasks are in this category.

The most successful and straightforward applications of Deep Learning are, indeed, some classification problems in a supervised learning setting.

The underlying fundamental principle, however, is that the “patterns” has to be real and stable. Digits is a stable small set, and every human has a face with distinct common patterns (eyes-nose-ears-mouth) which even birds (crows) are able to easily “track”.

Another (wrong) formulation is a “search for the “best” function (a mapping) in a Set of all possible functions”.

The first thing to realize is that such a function usually has tens or even hundreds “arguments” (inputs). The abstract technique of “currying” is especially handy there.

Think of a very large table (“features” as columns, actual examples as rows), and we have to “learn” all the complex relations which makes this table valid, assuming it is valid beforehand.

It is necessary to re-iterate again and again - TRUTH MUST BE OUT THERE.

There is an implication:

When underlying relations among inputs change after training is completed, the model will “infere” or “predict” bullshit. Immutability of relationships among inputs is required.

This principle is, of course, related to the principle of immutability (and persistence) of the data bindings in functional programming, and in math, which is necessary for a computation to always be correct.

This, in turn, implies that application of machine learning algorithms and deep learning in particular to a partially-observable, stochastic environment will always yield bullshit - “approximations” which are guaranteed to be wrong at some point (when change in the underlying environment will be significant).

Ignorance of this principle is what ruined all the algorithmic traders and some day will ruin them all.

The fundamental problem

In logic a “process” that infers a general rule from a set of specific examples is known as inductive reasoning.

The concept of inductive reasoning is seemingly related to machine learning because a machine learning algorithm induces (or extracts) a general rule (a function) from a set of specific examples (the dataset).

This is fundamentally wrong in principle. A merely collecting observations and even forming some “inner representation” based on them is not enough to establish any actual causality which is the only base of a valid reasoning.

Even seemingly abstract mathematical and algebraic rules can easily be traced back to the underlying reality, from which they have been captured and properly generalized.

Addition, the notion of a Semigroup and of a Monoid are generalized from reality, so is the notion of a Set which is a valid generalization of how the mind of an external observer categorizes its “sensory inputs”, and thus of “how things really are” in the Universe.

The Upanishadic seers knew how to trace everything in the mind back to “reality” and why this is absolutely necessary to distinguish what is “real”. This, by the way, is the only way.

“Generalizing”

The main theme in the applications of Machine Learning and Deep Learning is to “generalize to new data” (which was not in the training set).

Of course, we want to “recognize” all human faces, after “seeing” just a few. This is the whole point - we want to be able to correctly deal with the inputs we have never seen before.

Nature and Evolution came up with the “training from experience” of an “evolved pre-defined neural structures” within the brain. (The neural tissues of specialized brain centers are not arbitrary – they have evolved to “match” particular kinds of “inputs”).

In short, everything is shaped by the constraints of the environment in which is has been evolved (by the decentralized macro process of trial and error).

Feedback loops

The universal “generalized pattern” (from biological systems) for staying up-to-date by actually learning from each new experience (“example”) is to have a feedback loop when a “sucessuful experience” is being used as a valid source for learning - it becomes an another “example”.

Lots of complex neural structures and even a simplest muscle tissue are getting physically altered (updated) with complex feedback mechanisms.

“Neuro Myelination” is such actual biological mechanism which is the basis of all learning from experiences within the brain.

A feedback loop is a properly generalized universal pattern and it is in algorithms (the “accumulator pattern” of recursion) - each loop carries some state within it, and is the basic building block of digital circuits.

The Universal principles

There are some universal principles behind Deep Learning.

  • a breadth-first search process which terminates on a good-enough approximation
  • the same universal notion which has been captured by the Newton’s method
  • biology “uses” something very similar with its “biochemical” feed-back loops

General principles

  • based on implicit feed-back loops from the environment
  • implicit pruning, just like brain (zero weights)
  • conceptually, a “curve/surface” fitting
  • least square errors (reducing a “distance”)
  • convergence due to an “error minimization”

Just like pure math and FP

  • each “neuron” is a simple mathematical function (expression)
  • the whole “network” is a function composition (deep nesting)
  • pure math and FP under the hood - well-defined semantics
  • properly abstracted, this is “just a bunch of arrows” (like FP)

Supervised

  • Learning a representation by trial-and-error (literally)
  • Supervised learning from given “the right answers”
  • Having an implicit feed-back loop (“the right answers”)
  • No explicit programming (back propagation)
  • Representation can be examined and used (updated)
  • Optimizations of the structure of a network (“topology”)

Reinforcement

  • Reinforcement is learning by doing (games, sports)
  • Representation gets refined with each “experience”
  • Implicit feedback loop by evaluating performace
  • Recurrent for linear structure’s (induction)
  • Related to the Bayesian “beliefs”.

Representations

  • We “learn” updated terms of pure mathematical expressions
  • Mathematical expressions are represented as an AST
  • an AST can be manipulated (a well-understood problem)
  • Symbolic differentiation since early LISPs
  • Computing derivatives packaged as libraries (autograds)
  • Mutable structures (no persistence).

Architecture

  • The “architecture” is “fully connected layers”
  • Theoretically it is the right thing to do.
  • Some weights will become zeroes (“pruning”)
  • Pruning is a fundamental notion (child’s brains do it)

Deep learning enables data-driven decisions by identifying and extracting patterns from large datasets that accurately map from sets of complex inputs to good decision outcomes.

This implies that the actual, non-imaginary signal (the set of complex inputs) must be “out there”.

A data-set is a table (or a set of relations), each row is a single “example”, each column is a distinct “feature” or a relation (like a ratio) between features.

The principle is that the data (in a table) must be consistent.

An algorithm defines a process (as a declarative description or a “template”).

A function is a deterministic mapping (can be thought off as a “table”) which corresponds to some relation between the values in its domain and its range.

\[\x \mapsto 2x + 1\]

Is a mapping (\[\forall x\]) and a particular relation (multiply by 2 and then add 1).

Most of the time we want to approximate (to learn a representation of) functions of multiple (many) variables – big tables with many related features.

This is the whole point - what is difficult to be explicitly programmed can be approximated (“learned” as a “structure”) by a generalized set of algorithms.

The goal is a particular set of “parameters” and “weights” in a “learned” representation of a “network” of a particular shape (“architecture”).

The shape of a network has the same number of inputs and an output as the function (set of mappings or “arrows”) we want to approximate (learn a representation of).

It is worth noting this abstract correspondence - a “set of arrows” being approximated by a “network of arrows” (a directed graph).

As the inputs “flow through the network” and the outputs come out of the “black box” – always the same for the same inputs – this arrangement, just like a function, can be thought off as a “mechanical machine”.

The fundamental difference is that we do not define the “function’s body” expression ahead of the time, but gradually improve (“learn”) its representation inside a “black box” by feeding the data and the “right answers” to a learning algorightm.

There are lots of analogies of this kind of a process, ranging from a bunch of people doing something by trial-and-error (trying to come up with a robust and efficient aircraft engine) to a whole process of evolution, which, in a sense, “learns” stable molecular arrangements (of enzymes, say, and everything else).

This is the right understanding. Notice that the “inputs” (the molecules and ions) has to be “stable”, as the physical (electo-chemical) properties of their combinations (“relations among them”).

This is the universal principle – the “building blocks” has to be stable and actually “out there” (non-imaginable).

The “mechanics”

Each “neuron” is a function (a mapping from a set of inputs to an output) – a mathematical expression. It is represented inside a computer as an abstract syntax tree.

An expression is a syntactic closure, which captures all its bound variables and constants (if any) as the “leafs” of an AST.

The process of back-propagation (defined by a particular algorithm) traverses the whole network and modifies (updates) the values of these bound variables by computing the partial derivatives for each variable recursively.

All the “operators” has to be differentiable (do a partial derivative can be taken).

So it is a learning algorithm that updates the “terms” of a vastly complex, deeply nested pure mathematical expression.

And, viewed as a graph, it is a “bunch of arrows between simple functions (“neurons”) which are lexical closures that capture its bound variables”. There is no “free variables” in this context.

So what we actually “learn” is this set of “weights” inside these “closures”, which represent the terms of pure mathematical expressions – mappings from inputs to an output.

Differentiable

Now what is differentiable? Two or more “arrows come together” (this is a universal shape or a “pattern” of multiple causality, of a weighted sum, etc, etc.), and a “contribution” (a “weight” or a “steepness of a slope”) of each “arrow” can be determined.

Programming

The ability to zoom through the layers (of a hierarchy) of abstractions (from general to specific and bottom-up) and to switch between an abstraction and its representation is what makes a good programmer. It is not “knowing” some PHP or Javascript.