These are just assorted notes for now, which shall become something ready to be formalized.
Non-bullshit
The objective is to train a NN which captures subtle recurrent patterns among many well-chosen (and well-defined) features.
The proper set of features that, in turn, captures the most relevant aspects of reality is what determines the distinction between a modest success or a total failure of this ML approach.
All the features should be actual “measurements” of something real, like “Open Interest” or the “Long/Short ratio” and other obvious measurements like “Volume”.
These measurements have to be “by the same instruments” and have to be available for the “unseen before” inputs (for generalizing or “inference”).
The crucial measurements include the day of a week (to capture recurring patterns of the US market), the month, to potentially capture seasonal changes, and even an hour, to potentially capture intro-day fluctuations in the US time-zone .Just daily candles would not be enough.
The more features we have, the more noise and nonsense we “learn”. The hope is that “reality” will reduce the weights of bullshit, as it always happen.
Candles
On the other hand, for candle-level recurrent patterns all this would be a “noise” because what the “candle-patterns” capture are recurring behavioral patterns of the market participants and, potentially, how they react when observing the same chart formations.
At even deeper level, the observed pattern on the chart (the candles) has been caused by the sum-total of the current “market sentiments” - what the majority of participants feel and think about the current market conditions (the current set of memes in their heads).
This is what candle-level patterns of the various time-frames actually capture or even “measure” – the most recent actions (at the right edge) of the market participants. A chart is a “dashboard” of the market. This is where our “inputs” shall come from.
The training set
The quality and adequacy (connection to reality) of a training set is by far the most important “metric”. We absolutely shall not try to find patterns in an abstract data noise.
Data has to be at an appropriate (just right) level. Not of “individual pixels” or a “raw sound”.
Ideally, the conceptual level must math the level at which the market professionals and other participants tend to think and reason. The level of the current memes.
Current readings of the most used meme “indicators” (RSI, MACD, BOLL) has to be added as features.
All the descriptive statustics which “Binance” offers has to be used. Our network should “see” what other people (and other algos) are observing.
Wishful thinking
We wish that the NN will “learn its own features” (capture the overlocked subtle patterns). There are always some “hidden” (non-obvious) relations between “measurements” (they have to be “reliable instrument readings”).
We expect that these relations (connections or “pathways”) will “learn” significant weights (a set of parameters).
In short, we hope not for just the “right structure” (that matches the observed patterns) but also the “myelination” of the crucial connections (of the corresponding relations).
Another good metaphor is beaten paths (frequently used trails) in the mountains. We definitely have these inside our mature brains.
Math
Scaling
\(\frac{42}{1}\) means, literally, “forty two ones”, or 42 “scratches” (in an unary system), which, in turn, means 1 “added to itself” 42 times or just \(42*1\).
When we scale by a scalar (a Real number) we just “replace” that \(1\) (the unit) with the given scalar and then add these together.
26.2 miles times \(1.61\) is, well, \(1.61\) (instead of \(1\)) 26.2 times.
So, again, this is just a repeated addition (putting together) but of different (scaled) “units”.
This simplified (but obviously correct) view is very useful when we think of what ML algorithms really do.
Division is a repeated subtraction.
Weighted sums everywhere
Just adding together and scaling by “weights”.
Polynomials are also a special case of a weighted sum.
A line
\[ \forall x, y \mapsto wx^{1} + bx^{0} \] where \(w\) and \(b\) are parameters which determine the position of a line.
And \(x^{0} = 1\) and \(x^{1} = x\)
Packaging
The notion of a “vector”. \[ x = \begin{bmatrix} x_{0} \\ x_{1} \end{bmatrix} \]
And a scalar is “just” a \[ \begin{bmatrix} y \end{bmatrix} \] So we can “package” a line as two vectors - one of the parameters \[ p = \begin{bmatrix} b \\ a \end{bmatrix} \] and one of the x-es, \[ x = \begin{bmatrix} x^{0} \\ x^{1} \end{bmatrix} \]
An abstract data type with a corresponding notation
It is indeed an ADT defined in term of possible operations.
A particular notation has been evolved to denote such “objects”.
Denotational semantics is to use notation to convey the meaning.
Tensor is a generalized ADT
The concept of a Tensor is even more a proper ADT.
It has the notion of a rank and a set of contrains.
Matrix multiplication
Dimensions has to match up, just like types in a function composition.
Just like a function application – associative but not commutative.
Once composed (chained) correctly can be calculated (evaluated) in any order.
Gilbert Strang
The column-oriented perspective on Linear Algebra.
An “n-dimension vector” (from the Origin) is a single column.
It can be thought off as denoting a point in a /“n-dimensional space”.
Thus the order of “coordinates” is fixed.
the Normal Equations method
\[ \Theta = (X^{T}X)^{-1} X^{T}y \]
Matrix multiplication is associative.
theta = pinv(X'*X)*X'*y
in Julia
\Theta = (pinv(transpose(X)*X))*transpose(X)*y
Learning a representation
Learning the “parameters” of a function.
Generalizing or “inference” (calling on an unseen before values, which is what it is all about) is just a straightforward computation with the weights (parameters) we have “learned”.
Notice that “prediction” is a wrong word. We are not predicting or forecasting the future. We just calculate the output from a given input.
The “learning” is just a “curve fitting” (squared error minimization) or a generalization of it - no magic there.
Think if an “elastic cloth” (as a “space”) transformed by the weights “hanging from it on strings”.
Non-bullshit ML for trading
The most difficult (and crucial) part is to decide what are the xs and /ys.
The “AI will learn the hidden patterns in the data by itself” meme is an utter bullshit. It is like finding patterns in the clouds or waves. They are out there, but they almost never repeat (never emerge again the same).
So this will be finding patterns in a noise or fitting a “space” on all clouds (thus taking a snapshot).
- It has to be a well-defined “table” first
- “features” has to be selected by hand
- including “extra” (explicit) ratios and rates of change
- 3-candle “mini-patterns”
- every 4th candle is a \(y\) for the previous \(3\)
- every scale is OK, just use relative values
High-level
There are high-level libraries to be used (after all logical reasoning and math has been done “on paper”).
A typical training procedure for a neural network is as follows:
- Define the neural network that has some learnable parameters (or weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:
weight = weight - learning_rate * gradient
- You just have to define the
forward
method, and thebackward
method (where gradients are computed) is automatically derived for you using autograd.
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
def forward(self, x):
return x