Epoch 3: Micrograd, Part 3

Intro

Hi, I’m Mon, This is Epoch 3, Micrograd Part 3, in the previous epoch we learnt how to create a neuron manually, and we made an introduction about Tensors and PyTorch, now we are going to learn how to make a simple neural network (nn) layers of neurons

Let’s train manually with a neuron!

Let me go through a neuron first, an abstract about its mechanism with a practical example:

neuron is

z = x1*w1 + x2*w2 + b

and we activate it with tanh(or other activators)

y_hat = tanh(z)

Also for the Loss, let’s use squared error:

L = (y_hat - y_true)^2

y_hat is the prediction, output that comes from our model, it changes every run with a little more trained weights and bias to get closer to y_true: the output from our dataset, predefined, and our goal to reach!

Step 0

Initialize inputs and random weights & bias

w1: 0.0259 (learnable)
w2: 0.1967 (learnable)
b: 0.5 (learnable)
x1: 0.5
x2: 0.7
y_true = 1.0
lr = 0.1

Step 1: First Forward pass

Calculate y_hat with the inputs and the current weights and bias

z = (0.5 * 0.0259) + (0.7 * 0.1967) + 0.5
z = 0.65064
y_hat = tanh(0.65064) ≈ 0.572101

Note: tanh(x) splash a number into a
number range: -1 < y < 1

Step 2: Compute Loss

L = (y_hat - y_true)^2 ≈ 0.1831

Everytime we check this loss to go lower

Now we need to go backward, calculate the gradients of loss (L) with respect to the inputs, then create and use new weights and bias, then repeat the process of going forward, backward, train

Step 3: Backward Pass: compute gradients

To compute the gradients of Loss(L) with respect to the parameters we need to begin from calculating the gradient of L w.r.t its parents (y_hat and y_true),
then the gradient of L w.r.t tanh(y_hat)
then keep going backward and calculate the gradient of L w.r.t its parents by multiplying ∂L/∂child * ∂child/∂parent (Chain Rule)
and continue the process until we reach to the inputs.

3.1: Gradient L w.r.t y_hat

Each operation calrify what formula to use for calculating local derivatives, refer to Epoch 1

∂L/∂y_hat = 2 * (y_hat - y_true)
∂L/∂y_hat = -0.855798

note Since y_true is not a leaf of other branches, no need to calcualte its gradient.

3.2: Gradient y_hat w.r.t z and L w.r.t z

∂y_hat/∂z = 1 - y_hat^2
∂y_hat/∂z ≈ 0.6727

from the Chain Rule, to calculate the gradient of L w.r.t z:

∂L/∂z = ∂L/∂y_hat * ∂y_hat/∂z
∂L/∂z ≈ -0.5757

3.3: Gradient L w.r.t w1 and w2 and b

∂L/∂w1 = ∂L/∂z * x1 ≈ -0.288
∂L/∂w2 = ∂L/∂z * x2 = -0.403
∂L/∂b = ∂L/∂z * 1 ≈ -0.5757

3.4: Update parameters (gradient descent) $$ \theta_{n+1} = \theta_n - \eta \cdot \frac{\partial L}{\partial \theta} $$

w1 = w1 - lr * ∂L/∂w1 ≈ 0.0259 - 0.1 * -0.288 ≈ 0.0547
w2 = w2 - lr * ∂L/∂w2 ≈ 0.1967 - 0.1 * -0.403 ≈0.237
b = b - lr * ∂L/∂b ≈ 0.5 - 0.1 * -0.5757 ≈ 0.55757

As you see all numbers changed slightly! that’s learning!
w1: 0.0259 -> 0.0547 w2: 0.1967 -> 0.237 b: 0.5 -> 0.55757

Now if we do the forward pass again with the new weights and bias:
L = (tanh(0.5 * 0.0547 + 0.7 * 0.237 + 0.55757) - 1) ^ 2
≈ 0.13276

0.1831 -> 0.13276 So the Loss also decreased, and going toward 0

Step 4: Repeat

Content: nn

Now let’s implement a neural network with layers in micrograd.

We learnt how actually a neuron works, and we were able to build a neuron from scratch in the micrograd, now we want to build a multi layer neurons in micrograd.

A neural network made of input and output layers + Wiring:

Input layer: This layer contains input neuron, they are not computational neurons

Hidden Layers: They are the computation layers, located in the middle of input and output layers.

In each hidden layer neurons follow the same formula of

z = x1*w1 + x2*w2 + b

Note: in one hidden layer: neurons have same inputs but different weights each time

in hidden layer 2: neurons have different inputs than layer 1 because the inputs have changed after going through layer 1.

e.g hidden layer 1:

z1 = x1*w1 + x2*w2 + b1

hidden layer 2:

z2 = x3*w3 + x2*w4 + b2

Wiring: Clarifies the wiring and connections between layers. We have different architecture for wiring, such as:

MLP: Fully conneced neurons between layers.
CNN: Not fully connected, uses local connections (kernels).
RNN: Has loops, has memory
Transformer: Use attention

Let’s make nn manually wih micrograd:

Firstly we begin with

Content: Neuron

class Neuron:
    def __init__(self, nin):
        self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
        self.b = Value(random.uniform(-1,1))
    def __call__(self, x):
        # w * x + b
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        out = act.tanh()
        return out

    def parameters(self):
        return self.w + [self.b]

nin number of inputs each neuron receives
self.w generates n random numbers between -1 and 1 for weights
self.b generates 1 random number for bias
init of Neuron is used one time to initialize the neuron
parameters returns a list of trainable Values in this neuron
call of Neuron forward pass
zip pairs up ws with xs in order

Content: Layer

class Layer:
    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(nout)]

    def __call__(self, x):
        outs = [n(x) for n in self.neurons]
        return outs[0] if len(outs) == 1 else outs

    def parameters(self):
        return [p for neuron in self.neurons for p in neuron.parameters()]

nin number of inputs each neuron receives
nout number of neurons in this layer

[Neuron(nin) for _ in range(nout)]

Creates a list of neurons

e.g. nin = 3, nout = 4
self.neurons = [
Neuron(3),
Neuron(3),
Neuron(3),
Neuron(3),
]

Until here we initialize our layer with 4 neurons (calling Layer), each neuron accepts 3 inputs (calling Neuron from Layer) in our example

call of Layer

[n(x) for n in self.neurons]

From the list of neurons we made, we give same inputs x to each neuron
folloiwng our previous example of nin = 3, nout = 4

imagine we add x = [x1, x2, x3]

So by this until here: we have:
[tanh(w11x1 + w12x2 + w13x3 + b1),
tanh(w14x1 + w15x2 + w16x3 + b2),
tanh(w17x1 + w18x2 + w19x3 + b3),
tanh(w20x1 + w21x2 + w22x3 + b4)]

return outs[0] if len(outs) == 1 else outs

This is just for the case we have 1 neuron to return 1 neuron instead of list.

[p for neuron in self.neurons for p in neuron.parameters()]

This is a nested loop,
(loop over all neurons in the layer, and for each neuron it loops and extract the parameters w and b)

As a result it creates a flat array of all parameters in order

e.g
[ w11, w12, w13, b1,
w21, w22, w23, b2,
w31, w32, w33, b3,
w41, w42, w43, b4
]

Content: Wires (MLP)

class MLP:

    def __init__(self, nin, nouts):
        sz = [nin] + nouts
        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]

    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

nin number of inputs to the first layer
nouts number of neurons in each layer
e.g [4, 4, 1]

sz = [nin] + nouts

This size of array helps us to show the inputs and outputs of each layer

e.g.
nin = 3
nouts = [4, 4, 1]
sz = [3, 4, 4, 1]

self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]

This creates list of layers, that each layer has the property: number of inputs and number of neurons(= outputs)
means:
Layer 0: inputs = 3, outputs = 4 also Layer(3, 4)

Layer 1: inputs = 4, outputs = 4 also Layer(4, 4)

Layer 2: inputs = 4, outputs = 1 also Layer(4, 1)

note: Layer(3, 4) will be expanded to
Neuron(3),
Neuron(3),
Neuron(3),
Neuron(3),

def __call__(self, x):
   for layer in self.layers:
       x = layer(x)
   return x

Now it’s the time to give x to nn, because of our MLP arcitucture, we begin the loop for all layers, first layer gets x(one or more inputs), then it calculates the y which if you remember from Neuron:

act = sum(wi*xi for wi, xi in zip(self.w, x), self.b)

y = act.tanh()

because of: x = layer(x) , our vector y (y1 neuron 1, y2 neuron 2,…), is considered as x of the next layer

Important note
Because we use the vector of all y values from layer n as the input x for layer n + 1,
Each neuron output in layer n can influence all the outputs of layer n + k.

Example:

MLP: nin = 3, nouts = [4,4,1]

Initial input:

x = [x1, x2, x3]

Layer 0: Layer(3,4)

Input: [x1, x2, x3]

Neurons N1..N4 produce [y1, y2, y3, y4]

x = [y1, y2, y3, y4]

Layer 1: Layer(4,4)

Input: [y1, y2, y3, y4]

Neurons N5..N8 produce [y5, y6, y7, y8]

x = [y5, y6, y7, y8]

Layer 2: Layer(4,1)

Input: [y5, y6, y7, y8]

Neuron N9 produces single y9

x = y9

Final return → y9 (Value object)

Excellent! now we activate non-linearity with tanh to get the y_hat
tanh(y) = y_hat

Now we calculate Loss (L):

L = (y_hat - y_true)^2

Then

Update parameters (gradient descent)
with:
$$ \theta_{n+1} = \theta_n - \eta \cdot \frac{\partial L}{\partial \theta} $$

Let’s implement this last step to our micrograd too
We want to make the initiator of MLP to calculate the tanh(y_hat), then calculate the loss, then do backward() to calculate the gradient, then find the new parameters (train), then find the new loss, and do these steps in loops k times.

I call it “tunning a big radio with many knobs” the better result is for being more precise (smaller learning ratio with more steps)

for k in range(10): #1
    ypred = [n(x) for x in xs] #2
    loss = sum((y_hat - y_true)**2 for y_true, y_hat in zip(ys, ypred)) #3

    for p in n.parameters(): #5
        p.grad = 0.0 #6
    loss.backward() #7
    
    for p in n.parameters(): 
        p.data += -0.05 * p.grad #10

    print(k, loss.data) #12


xs = [
    [3.0, 3.0, 2.0],
    [2.5, 3.0, 1.0],
    [-2.0, -1.5, -1.0],
    [1.0, -3.0, 0.5],
]
ys = [1.0 , -1.0, -1.0, 1.0]
n = MLP(3, [4,4,1])

draw_dot(loss) #24

Line 1: This training has 10 loops (10 times going forward and backward and train)

Line 2: Uses each x in xs list, to call n, which is
MLP(3, [4,4,1])

Line 3: Calulates the sigma of Losses,

Line 5: Gather all the parameters from the MLP (all layers, all neurons)

Line 6: For each step (k) we need to reset the gradient, to not add gradients of each step (k)

Line 7: backward calculates all the gradients,

Note: We could do backward, because loss is a Value object refer to Epoch 1, that from forward pass, it stored ._prev and .data,
and from backward pass it stores .grad

Line 10: After each backward, it updates the parameters with:

$$ \theta_{n+1} = \theta_n - \eta \cdot \frac{\partial L}{\partial \theta} $$

Line 12:

0 2.094403631655075
1 2.0751115878578568
2 2.058205309464937
3 2.0417761466655073
4 2.024357050402771
5 2.0042442666008617
6 1.9787204164593886
7 1.942384309638119
8 1.8821567826349068
9 1.7582003851194667

Line 24: generate the latest graph

You can access the code here: https://github.com/auroramonet/memo/blob/main/codes/micrograd_3.ipynb

Resources

Micrograd - A Tiny Autograd Engine

GitHub: karpathy/micrograd

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API, created by Andrej Karpathy.

The spelled-out intro to neural networks and backpropagation: building micrograd

2025-12-30

../