Little learnings

date:: Jul 10th, 2024

A collection of notes as I work through The Little Learner.

Chapter 1

1.5: graph a line with ww and bb parameters

const f = (w, b) => x => w * x + b;

render(graph(f(100, 10)));
render(graph(f(10, 0)));
edit

1.14: reverse the order to make xx an argument and ww and pp parameters

// A version where w and b become parameters _after_ `x`
const f = x => (w, b) => w * x + b;

return graph(x => f(x)(2, 20));
edit

x is the argument of the line, while w and b which come after are parameters

1.19 plot the xs, ys dataset

const xs = [2, 1, 4, 3];
const ys = [1.8, 1.2, 4.2, 3.3];

return graph({
  marks: Plot.dot(zip(xs, ys), { x: n => n[0], y: n => n[1], fill: "black" }),
  domainX: [0, 5],
  domainY: [0, 5]
});
edit

Rule of parameters: every parameter is a number

Given x and y, or arguments to a function, we can walk backwards and figure out the parameters and then use that to predict other y values for a given x

θ is the parameter set (lowercase theta)

Given θ, there parameters if it referred to as θ1_1, θ2_2, etc

const f = θ => (θ_1, θ_2) => θ_1 * x + θ_2

Chapter 2

"Scalars" are real numbers

A "tensor" is a vector of scalars: [2.0, 1.0, 4.3, 4.2]

The book uses tensor1^1 with a superscript

Tensors can be nested, and the superscript indicates the level of "nested"

"elements" are the individual values in the tensor

I think tensor1^1 (with the 1 superscript) specifically means a vector of scalars, and higher tensors have tensors as elements

All tensorsm^m must have the same number of elements

A scalar is atensor0^0

9 is tensor0^0

[9, 9, 9] is tensor1^1

[[9, 9, 9] [9, 9, 9]] is tensor2^2

2.25: define a function that finds the rank of a tensor

window.scalarp = v => typeof v === "number";
window.rank = t => (scalarp(t) ? 0 : 1 + rank(t[0]));

return log(output => {
  output(rank(4), 0);
  output(rank([4, 1]), 1);
  output(rank([[4, 1], [3, 6]]), 2);
  output(rank([[[[[3]]]]]), 5);
});
edit

2.37 define a function that finds the shape of a tensor t

window.shape = (t, acc = []) => {
  if (!scalarp(t)) {
    acc.push(t.length);
    shape(t[0], acc);
  }
  return acc;
};

return log(output => {
  output(shape(5), []);
  output(shape([[4, 1], [3, 6], [2, 9]]), [3, 2]);
  output(shape([[[[[3]]]]]), [1, 1, 1, 1, 1]);
});
edit

Lists have members while non-scalar tensors have elements

// `rank` could also be defined as the length of `shape`
window.rank2 = t => shape(t).length;

return log(output => {
  output(rank2(4), 0);
  output(rank2([4, 1]), 1);
  output(rank2([[4, 1], [3, 6]]), 2);
  output(rank2([[[[[3]]]]]), 5);
});
edit

Interlude I

I.8: addition of tensors

const tadd = (t1, t2) =>
  t1.map((v, idx) => (scalarp(v) ? v + t2[idx] : tadd(v, t2[idx])));

return log(output => {
  output(tadd([1, 2, 3], [4, 5, 6]), [5, 7, 9]);
  output(tadd([[1, 2], [1, 2]], [[3, 4], [3, 4]]), [[4, 6], [4, 6]]);
  output(tadd([[[[[5]]]]], [[[[[1]]]]]), [[[[[6]]]]]);
});
edit

Currently, tensors must be of the same shape to add them with +

Making + work on tensors of arbitrary rank is called an extension of +

Functions built using extensions are called extended functions

Note: extension applies both to making an operator work on a single argument that has an arbitrary rank, and also making it work with two arguments that may have different ranks. A non-extended operator is one that only works with one specific rank

I.14 extended addition

Let's implement tadd_ as an extended addition operator. Note: click view source to see the implementation of these operators.

We can do this with all mathematical operations: *, sqrt, etc

It looks like the order shouldn't matter. With our last definition, we are assuming t2 has a higher rank, but we can make it work so that the higher rank can be in either position

Not all math operations descend; for example sum1^1. sum1^1 has a superscript to make it clear it always expects a tensor. Let's implement tsum:

sum is the extended version of sum1^1 which descends until it fins a tensor1^1

Rule of sum: a tensor tt with a rank rr, the rank of sum(t)sum(t) is r1r - 1

So far, we've implemented tadd_ andtsum_. These operators will be used for the rest of the page. We will need more operators, so let's go ahead and implement more:

tsub_:

tmul_:

tsqr_:

Chapter 3

Fitting: a well-fitted θ is one the finds the best fit for a given data set

Let's take this line equation again:

window.line = x => (w, b) => tadd_(tmul_(w, x), b)
edit

We start with w (or θ0^0) and b (or θ1^1) both being 0.0

return log(output => {
  output(line([2, 1, 4, 3])(0, 0));
});
edit

The loss is how far away our parameters are. The best fit would be loss as close to 0 as possible

How do you calculate loss?

First step: line-ys - predicted-line-ys, where predicted-line-ys is line(line-xs)(Θ_1, Θ_2}). So it becomes:

line-ys - line(line-xs)(Θ_1, Θ_2})

That produces a tensor1^1 though; how do we get a scalar?

We sum it together, but also need to square it to get rid of negative values

sum(sqr(ys - pred-ys))

We can create an l2loss function using the operators we've defined above (tsum_ etc) to implement this:

window.l2loss = target => (xs, ys) => {
  return (...params) => {
    const pred_ys = target(xs)(params[0], params[1]);
    return tsum_(tsqr_(tsub_(ys, pred_ys)));
  };
};

return log(output => {
  const loss1 = l2loss(line)([2, 1, 4, 3], [1.8, 1.2, 4.2, 3.3])(0, 0);
  const loss2 = l2loss(line)([2, 1, 4, 3], [1.8, 1.2, 4.2, 3.3])(0.0099, 0);
  output(loss1);
  output(loss2);
  output("difference", loss2 - loss1);
});
edit

θ is the parameter set, it has nothing to do with the line function. It's just a list or vector with the right number of parameters needed

well, it does kind of have to do with line

parameters are the inputs need for a kind of "transformation" function. for example w*x + b. you have an input value, the transformation function, and the parameters for the transformation. at least that's how I'm thinking about it

An expectant function is the (xs, ys) => ... piece: it's a function that expects a data set as arguments

An objective function is the function which takes a parameter set and returns a scalar representing the loss

In the above code we output two losses: the first one with θ0^0 set to 0 and the second with θ0^0 to 0.0099. We increase it by a small number to figure out a rate of change (which is relative to this arbitrary small amount)

The rant of change here is 0.62/0.0099=62.63-0.62 / 0.0099 = -62.63

This helps "seed" our rate of learning because otherwise we'd have no idea how fast changing the parameters changes the loss

Given the rate of change derived from this arbitrary small value, we use another small value and multiple it by this rate of change

This is called the learning rate and is represented by ⍺. Let's use 0.01

It's basically a "step size"

62.63=0.6263⍺ * -62.63 = -0.6263

The rate of change depends on θ00. Now that we are using a new value for it, we need to get the new rate of change by using the 0.0099 constant again. Rate of change here is -25.12

It's not great to derive the rate of change this way though

Chapter 4

Graph of loss where X is θ0^0

return graph(x => l2loss(line)([2, 1, 4, 3], [1.8, 1.2, 4.2, 3.3])(x, 0), {
  domainX: [-1, 5]
});
edit

We also refer to the x-axis as weight

Rate of change is important. It's called gradient and we can define a function to find it for any function

A gradient function takes a function and a parameter set, and returns a list of gradients for each parameter in the parameter set

This is also known as ∇ (del)

My own break: automatic differentiation

It turns out the book never defines ∇ which makes it difficult to continue to build running examples. It casually defines it as a function that gets the gradient (or rate of change) at a specific value of a function. I was annoyed that it never gave a definition.

After doing some research, it turns out that ∇ essentially the derivative of a one-dimensional function. Defining this isn't simple, so the book assumes this already provided. So far the book has done a great job building things up from scratch, so this was a confusing turning point and it should have done a better job explaining exactly what ∇ is and why it's not providing a definition. At least point the reader to where they can find more information about it.

To define ∇ we need to implement automatic differentiation. This is a way to deriving an expression automatically. We build up a graph of operators and use rules of differentiation to get a new graph of operators representing the differentiated expression.

I'm using this observable as inspiration which in turn is inspired by micrograd from Andrej Karpathy.

This uses a neat technique called backwards differentiation which is possible because of a few constraints: we don't need a full derivative, but just a derivative of a single variable, and we never need the derivative of a derivative.

The basic idea is we want to know for a given expression that uses a variable x, how much does changing x affect the result? We can figure this out with a few steps:

Construct a graph of operations, compute the final value along the way eagerly (the forward pass)

Given the final value, perform a backwards pass to compute the amount that each piece of the expression contributed to the final result. Each operation implements its own backwards calculate according to various mathematical rules

The value for the single variable x in the expression given by the backwards pass is the gradient, or rate of change, for that variable.

First we define a Value class that represents the a single value in an equation, and then we define a couple operations like vadd, vmul, vpow, and more to operate on these values. (The v prefix means we're operating on values)

Using it looks like this. This is how we represent the equation (x+4)2+10(x + -4)^2 + 10:

const variable = new Value(x);
vadd(vpow(vadd(variable, new Value(-4)), 2), new Value(10));

Notice how there isn't really any difference between the x value and the other constants in here. They all operate as a Value. This simplifies our work and makes it more efficient, but it's important to remember they actually are different. x here is an actual variable and needs to be the single input into the equation, and if that's the case reading the x.grad value makes sense.

The other numbers like -4 and 10 are constants and it wouldn't make any sense to read the grad value of them. It took me a while to understand this: if you just did new Value(4) to represent y=4, which would simply render a horizontal line, and then ran the backwards pass and got the gradient, the value would be 1! The issue is we are assuming the value is the input variable and if it were, 1 would be the correct slope because we'd actually be working with the y=x equation.

Let's visualize this. This graph shows the equation (x4)2+10(x - 4)^2 + 10 and we apply automatic differentiation at each whole number to visualize the gradient at that point.

const gradients = range(-4, 12).map(x => {
  const target = new Value(x);
  const top = vadd(vpow(vadd(target, new Value(-4)), 4), new Value(10));
  top.backward();

  return v => (v - target.data) * target.grad + top.data;
});

return graph(...gradients, x => Math.pow(x - 4, 4) + 10, {
  domainX: [-5, 15],
  colors: [...gradients.map(_ => "#a0a0a0"), "#e60049"]
});
edit

Feels really good to understand this and be able to work with a real system written from scratch. I'll be able to continue to run examples from the book!

However, first we need to extend our tensor functions to support our now automatic differentiation system. Remember when we defined tmul, tadd, etc? That feels very similar to our new operation above right? We can combine both together into new tensor operations that support both properties: extended math functions that work with tensors that also support automatic differentiation.

That works!

Chapter 4 (continued)

Revisions

4.23: define a revise function which helps revise parameters given a function

window.revise = function revise(func, revs, params) {
  while (revs > 0) {
    params = func(params);
    revs--;
  }
  return params;
};

return log(output => {
  output(revise(params => params.map(p => p - 3), 5, [1, 2, 3]), [
    -14,
    -13,
    -12
  ]);
});
edit

Now let's use this revise function to help adjust parameters to reduce loss

const learning_rate = 0.01;
const obj = l2loss(line)(tdual_([2, 1, 4, 3]), tdual_([1.8, 1.2, 4.2, 3.3]));

window.lineParams = revise(
  params => {
    const gs = gradient_of(obj, params);
    return [
      params[0] - gs[0] * learning_rate,
      params[1] - gs[1] * learning_rate
    ];
  },
  1000,
  [0, 0]
);

return log(output => output(lineParams));
edit

We found our line! Let's plot it against the xs, ys dataset and see how it fits:

Hell yeah!

This is the optimization of gradient descent. We used a learning rate of 0.01 with our loss function and revised 1000 times to find the parameter set that fits the dataset bet

So far we've hardcoded the amount of parameters in our revise usage, let's generalize that. We can also generalize the other constants like learning rate and objective function into a general gradient_descent function:

window.gradient_descent = function gradient_descent(obj, initialp, α, revs) {
  return revise(
    params => {
      const gs = gradient_of(obj, params);
      return gs.map((g, idx) => params[idx] - g * α);
    },
    revs,
    initialp
  );
};
edit
return log(output => {
  const obj = l2loss(line)(tdual_([2, 1, 4, 3]), tdual_([1.8, 1.2, 4.2, 3.3]));
  output(gradient_descent(obj, [0, 0], 0.01, 1000));
});
edit

Θ is the symbol used for the revised parameters passed to the revision function. (The capitalized version of θ)

Interlude II

Θ2^2 this is a good