Epoch1: Micrograd, Part 1

Intro

Hi! I’m Mon. I’ve decided to study AI, ML… almost every day with doing practical projects.

I write about them here.

This is Epoch 1: the very first step.

Content: Gradient

Building a neural network; a tiny autograd engine (micrograd) — an automatic differentiation engine with backpropagation.

Derivative (single input change → single output change): the rate of change.

Example: if $f(x)=x^2$, and $x$ changes by a fixed amount, say 1:

if $x=3$, then $f(x)=9$

if $x=4$, then $f(x)=16$

Derivative:

$\frac{16-9}{4-3} = \frac{7}{1.00} = 7.00$

Gradient: how a single output changes by adjusting multiple inputs.

Example: if $f(x,y) = x^2 + y^2$ at point (1,2):

Partial derivative w.r.t. x: $\frac{\partial f}{\partial x} = 2x = 2(1) = 2$

Partial derivative w.r.t. y: $\frac{\partial f}{\partial y} = 2y = 2(2) = 4$

Gradient vector: $\nabla f = \langle 2, 4 \rangle$

This points in the direction of steepest ascent from (1,2)

Extended description: We have a function with two inputs $f(x, y)$ (say it as “f of x comma y”). $f(x, y) = x^2 + y^2$ (say it as “x squared plus y squared”).

Partial derivative of f with respect to x: $\frac{\partial f}{\partial x}$ means: How fast does f change if I move x a tiny bit, while y stays frozen?

In this example the derivative of $x^2 \rightarrow 2x$, derivative of $y^2 \rightarrow 0$ (because y is treated as fixed).

How we calculate derivative: $f’(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$ (say it as: “The derivative of f at x equals the limit as h goes to zero of (f of x plus h minus f of x) divided by h”).

Derivatives are also written as $f’(x)$ or $\frac{df}{dx}$.

So we got $\frac{\partial f}{\partial x} = 2x$ (say it as “partial derivative of f with respect to x”).

E.g. $\frac{\partial f}{\partial x}$ at (1, 2) = 2 × 1 = 2. “At the point one comma two, the partial derivative with respect to x equals two” means: If I move a tiny bit in the x direction, the height goes up by about 2 per unit.

From the same function: $f(x, y) = x^2 + y^2$.

Derivative of $y^2 \rightarrow 2y$, derivative of $x^2 \rightarrow 0$ (x is frozen now).

So: $\frac{\partial f}{\partial y} = 2y$ (say it as “partial derivative of f with respect to y”).

E.g. $\frac{\partial f}{\partial y}$ at (1, 2) = 2 × 2 = 4. “At the point one comma two, the partial derivative with respect to y equals four”.

Gradient Vector: Putting all partial derivatives together into a vector. For our function $f(x, y) = x^2 + y^2$, the gradient is $\nabla f = \langle \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \rangle = \langle 2x, 2y \rangle$.

At point (1, 2): $\nabla f = \langle 2(1), 2(2) \rangle = \langle 2, 4 \rangle$

This vector points in the direction of steepest ascent. The gradient vector points exactly in the direction that maximizes the slope. That slope is the fastest rate of increase.

In neural networks, we want to go in the opposite direction (steepest descent) to minimize the loss function.

Weight: input.

Loss: output.

Backward: going from the final result back to the inputs to figure out how each one contributed.

Backpropagation: An algorithm for computing gradients by repeatedly applying Backward with respect to all weights, in order to find directions where the Loss decreases.

Forward + Backward + Update loop = train (update parameter and weight).

Content: Micrograd

  1. Let’s begin to build a Micrograd from scratch!
    Basically Micrograd is a tiny neural network library that computes gradients efficiently through a chain of operations to minimize the loss, to train.

  2. These are the concepts:

2.1 Functions of numbers: Micrograd treats operations like addition, multiplication, powers as functions it can track

2.2 Gradient vector: How to change each input to increase or decrease the output

2.3 Direction of steepest ascent / descent: Gradient points uphill
For training, we usually go opposite direction (downhill) to minimize loss

2.4 Backpropagation: A way to compute gradients efficiently through a chain of operations




Let’s go!

Make sure you have Python 3 in your terminal:

python3 --version

Create a workspace (virtual env):

mkdir micrograd_tutorial
cd micrograd_tutorial
python3 -m venv venv
source venv/bin/activate
pip install jupyter
jupyter notebook

It opens a browser tab, File > Create a new notebook:

We are going to implement:

Let’s begin with Gradient

1. First we create an object called “Value”

class Value:
    def __init__(self, data):
        self.data = data    # the number
        self.grad = 0.0     # gradient

For now it holds a number.

x = Value(3.0)
y = Value(4.0)
print(x.data, x.grad)  # 3.0 0.0

2. Track history for backprop

Every time you perform an operation, you want to remember how it was computed.

class Value:
    def __init__(self, data, _children=(), _op='', name=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)  # parents
        self._op = _op               # operation
        self.name = name
        self._backward = lambda: None  # will define later

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

# for better readability, if not using this, we can see the memory locations



# example
a = Value(3.0)
b = Value(4.0)
c = Value(a.data + b.data, (a, b), '+')

print(a.data, a.grad)
print(c.data, c._prev)

print("Parent values:")
for parent in c._prev:
    print(parent.data)


# 3.0 0.0
# 7.0 {Value(data=3.0, grad=0.0), Value(data=4.0, grad=0.0)}
# Parent values:
# 3.0
# 4.0

From this we can make graph of 3 things: result, parents, operations

let’s make it less manual work, by defining operations:

addition

def __add__(self, other):
    # ensure other is a Value so constants become part of the computation graph
    other = other if isinstance(other, Value) else Value(other)
    out = Value(
        self.data + other.data,   # numeric result
        (self, other),            # parents
        '+'                        # operation
    )

    def _backward():
        self.grad += 1.0 * out.grad
        other.grad += 1.0 * out.grad

    out._backward = _backward

    return out

multiplication

def __mul__(self, other):
    # ensure other is a Value so constants become part of the computation graph
    other = other if isinstance(other, Value) else Value(other)
    out = Value(
        self.data * other.data,   # numeric result
        (self, other),            # parents
        '*'
    )

    def _backward():
        self.grad += other.data * out.grad
        other.grad += self.data * out.grad

    out._backward = _backward

    return out

example:

a = Value(2.0)
b = Value(3.0)
c = a * b + b

Internally, this creates:

tmp._prev = {a, b}
c._prev = {tmp, b}
a._prev = {}
b._prev = {}

graph:

a ──┐
    ├─ (*) → tmp ───┐
b ──┘               ├─ (+) → c
                 b ─┘

tmp = a * b
c = tmp + b

important
_prev tells backprop where to go
_backward tells backprop what math to do

now lets make the backward:

def backward(self):
    topo = []
    visited = set()
    def build(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build(child)
            topo.append(v)
    build(self)         # gathers nodes in topological order
    self.grad = 1.0      # seed
    for v in reversed(topo):
        v._backward()   # call stored backward function

same example

a = Value(2.0)
b = Value(3.0)

print(a.data) #2.0
print(b.data) #3.0

print(a._prev) #{}
print(b._prev) #{}

See the graph

print("c.data:", c.data)
print("c._prev:", c._prev)

print("tmp.data:", tmp.data)
print("tmp._prev:", tmp._prev)

# c.data: 9.0
# c._prev: {tmp, b}
# tmp.data: 6.0
# tmp._prev: {a, b}

When we run c.backward()
It begins to run build(v)

Step 1: Topological sort

  1. Mark the current node v as visited
  2. Recursively call itself on all parent nodes (v._prev)
  3. Add the node v to topo after visiting its parents

Effect: topological sorting of the computation graph ensures parents always appear before children.

Step by step example:

Result:

topo = [a, b, tmp, c]

Note: Parents always appear before children, so we reverse the topo in backprop with reversed(topo).


Step 2: Core logic of chain rule

$$ \frac{dL}{dx} = \frac{dL}{dy} \cdot \frac{dy}{dx} $$

Where:



Step 3: Seed gradient

Start at the final output:

c.grad = 1  # always seed with 1

Step 4: Local derivatives by operation

OperationLocal derivatives
Addition: c = a + b∂c/∂a = 1, ∂c/∂b = 1
Multiplication: c = a * b∂c/∂a = b, ∂c/∂b = a
Division: $c = \frac{a}{b}$$\frac{\partial c}{\partial a} = \frac{1}{b}$, $\frac{\partial c}{\partial b} = -\frac{a}{b^2}$
Power: $c = a^b$$\frac{\partial c}{\partial a} = b \cdot a^{b-1}$
Tanh: c = tanh(a)∂c/∂a = 1 - tanh(a)^2

Step 5: Example graph and forward pass

a ──┐
    ├─ (*) → tmp ───┐
b ──┘               ├─ (+) → c
                 b ─┘
a = 2
b = 3
tmp = a * b = 6
c   = tmp + b = 6 + 3 = 9

Step 6: Backprop step by step

$$ v.grad += \sum_{\text{children } c} \left( c.grad \cdot \frac{\partial c}{\partial v} \right) $$

  1. Backprop from c
∂c/∂tmp = 1
∂c/∂b   = 1
tmp.grad += c.grad * ∂c/∂tmp = 1 * 1 = 1
b.grad   += c.grad * ∂c/∂b   = 1 * 1 = 1
  1. Backprop through tmp = a * b
∂tmp/∂a = b = 3
∂tmp/∂b = a = 2
a.grad += tmp.grad * ∂tmp/∂a = 1 * 3 = 3
b.grad += tmp.grad * ∂tmp/∂b = 1 + (1 * 2) = 3

Step 7: Final gradients

a.grad   = 3
b.grad   = 3
tmp.grad = 1
c.grad   = 1

Step 8: Try to change inputs to see the gradient changes

in c = a * b + b

example 1:

Forward pass:

a: 2.0
b: 3.0
c: 9.0

Backward pass (gradients):

a.grad: 3.0
b.grad: 3.0
c.grad: 1.0

example 2:

Forward pass:

a: 2.0001
b: 3.01
c: 9.030301

Backward pass (gradients):

a.grad: 3.01
b.grad: 3.0001
c.grad: 1.0

Example 3:

epoch1-example

Next epoch will be about building a tiny neural network.

Resources

Micrograd - A Tiny Autograd Engine

GitHub: karpathy/micrograd

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API, created by Andrej Karpathy.

The spelled-out intro to neural networks and backpropagation: building micrograd