# Little learnings

**date::**Jul 10th, 2024

A collection of notes as I work through The Little Learner.

## Chapter 1

#### 1.5: graph a line with $w$ and $b$ parameters

```
const f = (w, b) => x => w * x + b;
render(graph(f(1, 10)));
render(graph(f(10, 0)));
```

#### 1.14: reverse the order to make $x$ an argument and $w$ and $p$ parameters

```
// A version where w and b become parameters _after_ `x`
const f = x => (w, b) => w * x + b;
return graph(x => f(x)(2, 20));
```

`x`

is the argument of the line, while `w`

and `b`

which come after are `parameters`

#### 1.19 plot the xs, ys dataset

```
const xs = [2, 1, 4, 3];
const ys = [1.8, 1.2, 4.2, 3.3];
return graph({
marks: Plot.dot(zip(xs, ys), { x: n => n[0], y: n => n[1], fill: "black" }),
domainX: [0, 5],
domainY: [0, 5]
});
```

**Rule of parameters: every parameter is a number**

Given x and y, or arguments to a function, we can walk backwards and figure out the parameters and then use that to predict other `y`

values for a given `x`

θ is the *parameter set* (lowercase theta)

Given θ, there parameters if it referred to as θ$_1$, θ$_2$, etc

`const f = θ => (θ_1, θ_2) => θ_1 * x + θ_2`

## Chapter 2

"Scalars" are real numbers

A "tensor" is a vector of scalars: `[2.0, 1.0, 4.3, 4.2]`

The book uses tensor$^1$ with a superscript

Tensors can be nested, and the superscript indicates the level of "nested"

"elements" are the individual values in the tensor

I think tensor$^1$ (with the 1 superscript) specifically means a vector of scalars, and higher tensors have tensors as elements

**All tensors$^m$ must have the same number of elements**

A scalar is atensor$^0$

`9`

is tensor$^0$

`[9, 9, 9]`

is tensor$^1$

`[[9, 9, 9] [9, 9, 9]]`

is tensor$^2$

#### 2.25: define a function that finds the rank of a tensor

```
window.scalarp = v => typeof v === "number";
window.rank = t => (scalarp(t) ? 0 : 1 + rank(t[0]));
return log(output => {
output(rank(4), 0);
output(rank([4, 1]), 1);
output(rank([[4, 1], [3, 6]]), 2);
output(rank([[[[[3]]]]]), 5);
});
```

#### 2.37 define a function that finds the shape of a tensor *t*

```
window.shape = (t, acc = []) => {
if (!scalarp(t)) {
acc.push(t.length);
shape(t[0], acc);
}
return acc;
};
return log(output => {
output(shape(5), []);
output(shape([[4, 1], [3, 6], [2, 9]]), [3, 2]);
output(shape([[[[[3]]]]]), [1, 1, 1, 1, 1]);
});
```

**Lists have members while non-scalar tensors have elements**

#### 2.42: how are `rank`

and `shape`

related?

```
// `rank` could also be defined as the length of `shape`
window.rank2 = t => shape(t).length;
return log(output => {
output(rank2(4), 0);
output(rank2([4, 1]), 1);
output(rank2([[4, 1], [3, 6]]), 2);
output(rank2([[[[[3]]]]]), 5);
});
```

## Interlude I

#### I.8: addition of tensors

```
const tadd = (t1, t2) =>
t1.map((v, idx) => (scalarp(v) ? v + t2[idx] : tadd(v, t2[idx])));
return log(output => {
output(tadd([1, 2, 3], [4, 5, 6]), [5, 7, 9]);
output(tadd([[1, 2], [1, 2]], [[3, 4], [3, 4]]), [[4, 6], [4, 6]]);
output(tadd([[[[[5]]]]], [[[[[1]]]]]), [[[[[6]]]]]);
});
```

Currently, tensors must be of the same shape to add them with `+`

Making `+`

work on tensors of arbitrary rank is called an *extension* of `+`

Functions built using extensions are called *extended functions*

Note: extension applies both to making an operator work on a single argument that has an arbitrary rank, and also making it work with two arguments that may have different ranks. A non-extended operator is one that only works with one specific rank

#### I.14 extended addition

Let's implement `tadd_`

as an extended addition operator. *Note: click view source to see the implementation of these operators.*

We can do this with all mathematical operations: `*`

, `sqrt`

, etc

It looks like the order shouldn't matter. With our last definition, we are assuming `t2`

has a higher rank, but we can make it work so that the higher rank can be in either position

Not all math operations descend; for example sum$^1$. sum$^1$ has a superscript to make it clear it always expects a tensor. Let's implement `tsum`

:

sum is the extended version of sum$^1$ which descends until it fins a tensor$^1$

**Rule of sum: a tensor $t$ with a rank $r$, the rank of $sum(t)$ is $r - 1$**

So far, we've implemented `tadd_`

and`tsum_`

. **These operators will be used for the rest of the page**. We will need more operators, so let's go ahead and implement more:

`tsub_`

:

`tmul_`

:

`tsqr_`

:

## Chapter 3

Fitting: a well-fitted θ is one the finds the best fit for a given data set

Let's take this line equation again:

`window.line = x => (w, b) => tadd_(tmul_(w, x), b)`

We start with `w`

(or θ$^0$) and `b`

(or θ$^1$) both being `0.0`

```
return log(output => {
output(line([2, 1, 4, 3])(0, 0));
});
```

The **loss** is how far away our parameters are. The best fit would be `loss`

as close to 0 as possible

How do you calculate loss?

First step: `line-ys - predicted-line-ys`

, where `predicted-line-ys`

is `line(line-xs)(Θ_1, Θ_2})`

. So it becomes:

`line-ys - line(line-xs)(Θ_1, Θ_2})`

That produces a tensor$^1$ though; how do we get a scalar?

We sum it together, but also need to square it to get rid of negative values

`sum(sqr(ys - pred-ys))`

We can create an `l2loss`

function using the operators we've defined above (`tsum_`

etc) to implement this:

```
window.l2loss = target => (xs, ys) => {
return (...params) => {
const pred_ys = target(xs)(params[0], params[1]);
return tsum_(tsqr_(tsub_(ys, pred_ys)));
};
};
return log(output => {
const loss1 = l2loss(line)([2, 1, 4, 3], [1.8, 1.2, 4.2, 3.3])(0, 0);
const loss2 = l2loss(line)([2, 1, 4, 3], [1.8, 1.2, 4.2, 3.3])(0.0099, 0);
output(loss1);
output(loss2);
output("difference", loss2 - loss1);
});
```

θ is the *parameter set*, it has nothing to do with the `line`

function. It's just a list or vector with the right number of parameters needed

well, it does kind of have to do with `line`

parameters are the inputs need for a kind of "transformation" function. for example `w*x + b`

. you have an input value, the transformation function, and the parameters for the transformation. at least that's how I'm thinking about it

An *expectant* function is the `(xs, ys) => ...`

piece: it's a function that expects a data set as arguments

An *objective* function is the function which takes a parameter set and returns a scalar representing the *loss*

In the above code we output two losses: the first one with θ$^0$ set to `0`

and the second with θ$^0$ to `0.0099`

. We increase it by a small number to figure out a *rate of change* (which is relative to this arbitrary small amount)

The rant of change here is $-0.62 / 0.0099 = -62.63$

This helps "seed" our rate of learning because otherwise we'd have no idea how fast changing the parameters changes the loss

Given the rate of change derived from this arbitrary small value, we use *another* small value and multiple it by this rate of change

This is called the **learning rate** and is represented by ⍺. Let's use `0.01`

It's basically a "step size"

$⍺ * -62.63 = -0.6263$

The rate of change depends on θ$0$. Now that we are using a new value for it, we need to get the new rate of change by using the `0.0099`

constant again. Rate of change here is `-25.12`

It's not great to derive the rate of change this way though

## Chapter 4

Graph of loss where X is θ$^0$

```
return graph(x => l2loss(line)([2, 1, 4, 3], [1.8, 1.2, 4.2, 3.3])(x, 0), {
domainX: [-1, 5]
});
```

We also refer to the x-axis as **weight**

Rate of change is important. It's called **gradient** and we can define a function to find it for any function

A gradient function takes a function and a parameter set, and returns a list of gradients for each parameter in the parameter set

This is also known as ∇ (del)

## My own break: automatic differentiation

It turns out the book never defines ∇ which makes it difficult to continue to build running examples. It casually defines it as a function that gets the gradient (or rate of change) at a specific value of a function. I was annoyed that it never gave a definition.

After doing some research, it turns out that ∇ essentially the derivative of a one-dimensional function. Defining this isn't simple, so the book assumes this already provided. So far the book has done a great job building things up from scratch, so this was a confusing turning point and it should have done a better job explaining exactly what ∇ is and why it's not providing a definition. At least point the reader to where they can find more information about it.

To define ∇ we need to implement **automatic differentiation**. This is a way to deriving an expression automatically. We build up a graph of operators and use rules of differentiation to get a new graph of operators representing the differentiated expression.

I'm using this observable as inspiration which in turn is inspired by micrograd from Andrej Karpathy.

This uses a neat technique called backwards differentiation which is possible because of a few constraints: we don't need a *full* derivative, but just a derivative of a single variable, and we never need the derivative of a derivative.

The basic idea is we want to know for a given expression that uses a variable *x*, how much does changing *x* affect the result? We can figure this out with a few steps:

Construct a graph of operations, compute the final value along the way eagerly (the forward pass)

Given the final value, perform a backwards pass to compute the amount that each piece of the expression contributed to the final result. Each operation implements its own backwards calculate according to various mathematical rules

The value for the single variable `x`

in the expression given by the backwards pass is the gradient, or rate of change, for that variable.

First we define a `Value`

class that represents the a single value in an equation, and then we define a couple operations like `vadd`

, `vmul`

, `vpow`

, and more to operate on these values. (The `v`

prefix means we're operating on values)

Using it looks like this. This is how we represent the equation $(x + -4)^2 + 10$:

```
const variable = new Value(x);
vadd(vpow(vadd(variable, new Value(-4)), 2), new Value(10));
```

Notice how there isn't really any difference between the `x`

value and the other constants in here. They all operate as a `Value`

. This simplifies our work and makes it more efficient, but it's important to remember they actually are different. `x`

here is an actual variable and needs to be the single input into the equation, and if that's the case reading the `x.grad`

value makes sense.

The other numbers like `-4`

and `10`

are *constants* and it wouldn't make any sense to read the `grad`

value of them. It took me a while to understand this: if you just did `new Value(4)`

to represent `y=4`

, which would simply render a horizontal line, and then ran the backwards pass and got the gradient, the value would be 1! The issue is we are assuming the value is *the input variable* and if it were, 1 would be the correct slope because we'd actually be working with the `y=x`

equation.

Let's visualize this. This graph shows the equation $(x - 4)^2 + 10$ and we apply automatic differentiation at each whole number to visualize the gradient at that point.

```
const gradients = range(-4, 12).map(x => {
const target = new Value(x);
const top = vadd(vpow(vadd(target, new Value(-4)), 2), new Value(10));
top.backward();
return v => (v - target.data) * target.grad + top.data;
});
return graph(...gradients, x => Math.pow(x - 4, 2) + 10, {
domainX: [-5, 15],
colors: [...gradients.map(_ => "#a0a0a0"), "#e60049"]
});
```

Feels really good to understand this and be able to work with a real system written from scratch. I'll be able to continue to run examples from the book!

However, first we need to extend our tensor functions to support our now automatic differentiation system. Remember when we defined `tmul`

, `tadd`

, etc? That feels very similar to our new operation above right? We can combine both together into new tensor operations that support both properties: extended math functions that work with tensors that also support automatic differentiation.

That works!

## Chapter 4 (continued)

Revisions

#### 4.23: define a `revise`

function which helps revise parameters given a function

```
window.revise = function revise(func, revs, params) {
while (revs > 0) {
params = func(params);
revs--;
}
return params;
};
return log(output => {
output(revise(params => params.map(p => p - 3), 5, [1, 2, 3]), [
-14,
-13,
-12
]);
});
```

Now let's use this `revise`

function to help adjust parameters to reduce loss

```
const learning_rate = 0.01;
const obj = l2loss(line)(tdual_([2, 1, 4, 3]), tdual_([1.8, 1.2, 4.2, 3.3]));
window.lineParams = revise(
params => {
const gs = gradient_of(obj, params);
return [
params[0] - gs[0] * learning_rate,
params[1] - gs[1] * learning_rate
];
},
1000,
[0, 0]
);
return log(output => output(lineParams));
```

We found our line! Let's plot it against the xs, ys dataset and see how it fits:

Hell yeah!

This is the optimization of **gradient descent**. We used a learning rate of `0.01`

with our loss function and revised 1000 times to find the parameter set that fits the dataset bet

So far we've hardcoded the amount of parameters in our `revise`

usage, let's generalize that. We can also generalize the other constants like learning rate and objective function into a general `gradient_descent`

function:

```
window.gradient_descent = function gradient_descent(obj, initialp, α, revs) {
return revise(
params => {
const gs = gradient_of(obj, params);
return gs.map((g, idx) => params[idx] - g * α);
},
revs,
initialp
);
};
```

```
return log(output => {
const obj = l2loss(line)(tdual_([2, 1, 4, 3]), tdual_([1.8, 1.2, 4.2, 3.3]));
output(gradient_descent(obj, [0, 0], 0.01, 1000));
});
```

Θ is the symbol used for the revised parameters passed to the revision function. (The capitalized version of θ)

## Interlude II

Θ$^2$ this is a good