Little learnings
A collection of notes as I work through The Little Learner.
Chapter 1
1.5: graph a line with and parameters
const f = (w, b) => x => w * x + b;
render(graph(f(100, 10)));
render(graph(f(10, 0)));
1.14: reverse the order to make an argument and and parameters
// A version where w and b become parameters _after_ `x`
const f = x => (w, b) => w * x + b;
return graph(x => f(x)(2, 20));
x
is the argument of the line, while w
and b
which come after are parameters
1.19 plot the xs, ys dataset
const xs = [2, 1, 4, 3];
const ys = [1.8, 1.2, 4.2, 3.3];
return graph({
marks: Plot.dot(zip(xs, ys), { x: n => n[0], y: n => n[1], fill: "black" }),
domainX: [0, 5],
domainY: [0, 5]
});
Rule of parameters: every parameter is a number
Given x and y, or arguments to a function, we can walk backwards and figure out the parameters and then use that to predict other y
values for a given x
θ is the parameter set (lowercase theta)
Given θ, there parameters if it referred to as θ, θ, etc
const f = θ => (θ_1, θ_2) => θ_1 * x + θ_2
Chapter 2
"Scalars" are real numbers
A "tensor" is a vector of scalars: [2.0, 1.0, 4.3, 4.2]
The book uses tensor with a superscript
Tensors can be nested, and the superscript indicates the level of "nested"
"elements" are the individual values in the tensor
I think tensor (with the 1 superscript) specifically means a vector of scalars, and higher tensors have tensors as elements
All tensors must have the same number of elements
A scalar is atensor
9
is tensor
[9, 9, 9]
is tensor
[[9, 9, 9] [9, 9, 9]]
is tensor
2.25: define a function that finds the rank of a tensor
window.scalarp = v => typeof v === "number";
window.rank = t => (scalarp(t) ? 0 : 1 + rank(t[0]));
return log(output => {
output(rank(4), 0);
output(rank([4, 1]), 1);
output(rank([[4, 1], [3, 6]]), 2);
output(rank([[[[[3]]]]]), 5);
});
2.37 define a function that finds the shape of a tensor t
window.shape = (t, acc = []) => {
if (!scalarp(t)) {
acc.push(t.length);
shape(t[0], acc);
}
return acc;
};
return log(output => {
output(shape(5), []);
output(shape([[4, 1], [3, 6], [2, 9]]), [3, 2]);
output(shape([[[[[3]]]]]), [1, 1, 1, 1, 1]);
});
Lists have members while non-scalar tensors have elements
2.42: how are rank
and shape
related?
// `rank` could also be defined as the length of `shape`
window.rank2 = t => shape(t).length;
return log(output => {
output(rank2(4), 0);
output(rank2([4, 1]), 1);
output(rank2([[4, 1], [3, 6]]), 2);
output(rank2([[[[[3]]]]]), 5);
});
Interlude I
I.8: addition of tensors
const tadd = (t1, t2) =>
t1.map((v, idx) => (scalarp(v) ? v + t2[idx] : tadd(v, t2[idx])));
return log(output => {
output(tadd([1, 2, 3], [4, 5, 6]), [5, 7, 9]);
output(tadd([[1, 2], [1, 2]], [[3, 4], [3, 4]]), [[4, 6], [4, 6]]);
output(tadd([[[[[5]]]]], [[[[[1]]]]]), [[[[[6]]]]]);
});
Currently, tensors must be of the same shape to add them with +
Making +
work on tensors of arbitrary rank is called an extension of +
Functions built using extensions are called extended functions
Note: extension applies both to making an operator work on a single argument that has an arbitrary rank, and also making it work with two arguments that may have different ranks. A non-extended operator is one that only works with one specific rank
I.14 extended addition
Let's implement tadd_
as an extended addition operator. Note: click view source to see the implementation of these operators.
We can do this with all mathematical operations: *
, sqrt
, etc
It looks like the order shouldn't matter. With our last definition, we are assuming t2
has a higher rank, but we can make it work so that the higher rank can be in either position
Not all math operations descend; for example sum. sum has a superscript to make it clear it always expects a tensor. Let's implement tsum
:
sum is the extended version of sum which descends until it fins a tensor
Rule of sum: a tensor with a rank , the rank of is
So far, we've implemented tadd_
andtsum_
. These operators will be used for the rest of the page. We will need more operators, so let's go ahead and implement more:
tsub_
:
tmul_
:
tsqr_
:
Chapter 3
Fitting: a well-fitted θ is one the finds the best fit for a given data set
Let's take this line equation again:
window.line = x => (w, b) => tadd_(tmul_(w, x), b)
We start with w
(or θ) and b
(or θ) both being 0.0
return log(output => {
output(line([2, 1, 4, 3])(0, 0));
});
The loss is how far away our parameters are. The best fit would be loss
as close to 0 as possible
How do you calculate loss?
First step: line-ys - predicted-line-ys
, where predicted-line-ys
is line(line-xs)(Θ_1, Θ_2})
. So it becomes:
line-ys - line(line-xs)(Θ_1, Θ_2})
That produces a tensor though; how do we get a scalar?
We sum it together, but also need to square it to get rid of negative values
sum(sqr(ys - pred-ys))
We can create an l2loss
function using the operators we've defined above (tsum_
etc) to implement this:
window.l2loss = target => (xs, ys) => {
return (...params) => {
const pred_ys = target(xs)(params[0], params[1]);
return tsum_(tsqr_(tsub_(ys, pred_ys)));
};
};
return log(output => {
const loss1 = l2loss(line)([2, 1, 4, 3], [1.8, 1.2, 4.2, 3.3])(0, 0);
const loss2 = l2loss(line)([2, 1, 4, 3], [1.8, 1.2, 4.2, 3.3])(0.0099, 0);
output(loss1);
output(loss2);
output("difference", loss2 - loss1);
});
θ is the parameter set, it has nothing to do with the line
function. It's just a list or vector with the right number of parameters needed
well, it does kind of have to do with line
parameters are the inputs need for a kind of "transformation" function. for example w*x + b
. you have an input value, the transformation function, and the parameters for the transformation. at least that's how I'm thinking about it
An expectant function is the (xs, ys) => ...
piece: it's a function that expects a data set as arguments
An objective function is the function which takes a parameter set and returns a scalar representing the loss
In the above code we output two losses: the first one with θ set to 0
and the second with θ to 0.0099
. We increase it by a small number to figure out a rate of change (which is relative to this arbitrary small amount)
The rant of change here is
This helps "seed" our rate of learning because otherwise we'd have no idea how fast changing the parameters changes the loss
Given the rate of change derived from this arbitrary small value, we use another small value and multiple it by this rate of change
This is called the learning rate and is represented by ⍺. Let's use 0.01
It's basically a "step size"
The rate of change depends on θ. Now that we are using a new value for it, we need to get the new rate of change by using the 0.0099
constant again. Rate of change here is -25.12
It's not great to derive the rate of change this way though
Chapter 4
Graph of loss where X is θ
return graph(x => l2loss(line)([2, 1, 4, 3], [1.8, 1.2, 4.2, 3.3])(x, 0), {
domainX: [-1, 5]
});
We also refer to the x-axis as weight
Rate of change is important. It's called gradient and we can define a function to find it for any function
A gradient function takes a function and a parameter set, and returns a list of gradients for each parameter in the parameter set
This is also known as ∇ (del)
My own break: automatic differentiation
It turns out the book never defines ∇ which makes it difficult to continue to build running examples. It casually defines it as a function that gets the gradient (or rate of change) at a specific value of a function. I was annoyed that it never gave a definition.
After doing some research, it turns out that ∇ essentially the derivative of a one-dimensional function. Defining this isn't simple, so the book assumes this already provided. So far the book has done a great job building things up from scratch, so this was a confusing turning point and it should have done a better job explaining exactly what ∇ is and why it's not providing a definition. At least point the reader to where they can find more information about it.
To define ∇ we need to implement automatic differentiation. This is a way to deriving an expression automatically. We build up a graph of operators and use rules of differentiation to get a new graph of operators representing the differentiated expression.
I'm using this observable as inspiration which in turn is inspired by micrograd from Andrej Karpathy.
This uses a neat technique called backwards differentiation which is possible because of a few constraints: we don't need a full derivative, but just a derivative of a single variable, and we never need the derivative of a derivative.
The basic idea is we want to know for a given expression that uses a variable x, how much does changing x affect the result? We can figure this out with a few steps:
Construct a graph of operations, compute the final value along the way eagerly (the forward pass)
Given the final value, perform a backwards pass to compute the amount that each piece of the expression contributed to the final result. Each operation implements its own backwards calculate according to various mathematical rules
The value for the single variable x
in the expression given by the backwards pass is the gradient, or rate of change, for that variable.
First we define a Value
class that represents the a single value in an equation, and then we define a couple operations like vadd
, vmul
, vpow
, and more to operate on these values. (The v
prefix means we're operating on values)
Using it looks like this. This is how we represent the equation :
const variable = new Value(x);
vadd(vpow(vadd(variable, new Value(-4)), 2), new Value(10));
Notice how there isn't really any difference between the x
value and the other constants in here. They all operate as a Value
. This simplifies our work and makes it more efficient, but it's important to remember they actually are different. x
here is an actual variable and needs to be the single input into the equation, and if that's the case reading the x.grad
value makes sense.
The other numbers like -4
and 10
are constants and it wouldn't make any sense to read the grad
value of them. It took me a while to understand this: if you just did new Value(4)
to represent y=4
, which would simply render a horizontal line, and then ran the backwards pass and got the gradient, the value would be 1! The issue is we are assuming the value is the input variable and if it were, 1 would be the correct slope because we'd actually be working with the y=x
equation.
Let's visualize this. This graph shows the equation and we apply automatic differentiation at each whole number to visualize the gradient at that point.
const gradients = range(-4, 12).map(x => {
const target = new Value(x);
const top = vadd(vpow(vadd(target, new Value(-4)), 4), new Value(10));
top.backward();
return v => (v - target.data) * target.grad + top.data;
});
return graph(...gradients, x => Math.pow(x - 4, 4) + 10, {
domainX: [-5, 15],
colors: [...gradients.map(_ => "#a0a0a0"), "#e60049"]
});
Feels really good to understand this and be able to work with a real system written from scratch. I'll be able to continue to run examples from the book!
However, first we need to extend our tensor functions to support our now automatic differentiation system. Remember when we defined tmul
, tadd
, etc? That feels very similar to our new operation above right? We can combine both together into new tensor operations that support both properties: extended math functions that work with tensors that also support automatic differentiation.
That works!
Chapter 4 (continued)
Revisions
4.23: define a revise
function which helps revise parameters given a function
window.revise = function revise(func, revs, params) {
while (revs > 0) {
params = func(params);
revs--;
}
return params;
};
return log(output => {
output(revise(params => params.map(p => p - 3), 5, [1, 2, 3]), [
-14,
-13,
-12
]);
});
Now let's use this revise
function to help adjust parameters to reduce loss
const learning_rate = 0.01;
const obj = l2loss(line)(tdual_([2, 1, 4, 3]), tdual_([1.8, 1.2, 4.2, 3.3]));
window.lineParams = revise(
params => {
const gs = gradient_of(obj, params);
return [
params[0] - gs[0] * learning_rate,
params[1] - gs[1] * learning_rate
];
},
1000,
[0, 0]
);
return log(output => output(lineParams));
We found our line! Let's plot it against the xs, ys dataset and see how it fits:
Hell yeah!
This is the optimization of gradient descent. We used a learning rate of 0.01
with our loss function and revised 1000 times to find the parameter set that fits the dataset bet
So far we've hardcoded the amount of parameters in our revise
usage, let's generalize that. We can also generalize the other constants like learning rate and objective function into a general gradient_descent
function:
window.gradient_descent = function gradient_descent(obj, initialp, α, revs) {
return revise(
params => {
const gs = gradient_of(obj, params);
return gs.map((g, idx) => params[idx] - g * α);
},
revs,
initialp
);
};
return log(output => {
const obj = l2loss(line)(tdual_([2, 1, 4, 3]), tdual_([1.8, 1.2, 4.2, 3.3]));
output(gradient_descent(obj, [0, 0], 0.01, 1000));
});
Θ is the symbol used for the revised parameters passed to the revision function. (The capitalized version of θ)
Interlude II
Θ this is a good