• Starter AI
  • Posts
  • Deep Dive: Building GPT from scratch - part 5

Deep Dive: Building GPT from scratch - part 5

learning from Andrej Karpathy

Hello and welcome back to the series on Starter AI. I’m Miko, this time writing from Tokyo.

Today, we’re picking up where we left last week, and we’re working on stabilising the neural network using batch normalization, and learning helpful visualizations in the process.

The roadmap

The goal of this series is to implement a GPT from scratch, and to actually understand everything needed to do that. We’re following Andrej’s Zero To Hero videos. If you missed a previous part, catch up here:

  1. Neural Networks & Backpropagation part 1 - 2024/02/09

  2. Neural Networks & Backpropagation part 2 - 2024/02/16

  3. Generative language model - bigrams - 2024/02/23

  4. Generative language model - MLP - 2024/03/01

  5. Today: Generative language model - activations & gradients

To follow along, subscribe to the newsletter at starterai.dev. You can also follow me on LinkedIn.

Generative language model - activations & gradients

Today’s lecture is called “Building makemore Part 3: Activations & Gradients, BatchNorm”, and it builds from where we left last week.

Last time we covered building a Multilayer perceptron (MLP), following the Bengio et al. 2003 MLP language model paper. Before we move to more sophisticated networks, we’re spending today’s lecture on building a deeper understanding of activations and gradients, how to develop an intuition on what numbers make sense and what don’t, and how to visualise them.

The lecture is in two parts. The first part covers initialisation and the Batch normalization paper. The second part turns the code to look like PyTorch’s equivalent, and teaches us how to visualise the different ratios using basic histograms, to better understand how well the training is going.

Only a few new concepts in this lecture.


Fan-in (and fan-out) are the number of inputs (or outputs, respectively).

Kaiming init paper - a paper discussing the behaviour of various squishing functions, both in forward and backpropagation passes. It’s implemented in torch.nn.init.kaiming_normal_ and it’s considered one of the most popular ways of initialising neural networks.

Batch normalization paper. A technique allowing for normalising ranges of data in a neural network to avoid calculus pitfalls, and stabilise the learning of the whole network. It takes out some heuristics, and replaces them with formulas. The lecture covers this in detail.

Also, the magical 5/3 according to @leopetrini comes from the average value of tanh^2(x) where x is Gaussian:

Video + timestamps

Part 1

00:04:19 Fixing the initial loss, removing the hockey stick appearance of the graph

00:12:59 Tanh quirks & how to work around them

00:27:53 Initialising the network - “Kaiming init” paper

01:04:50 Real example: resnet50 walkthrough

Part 2

01:18:35 PyTorch-ifying the code

01:26:51 Viz #1: forward pass activations statistics

01:30:54 Viz #2: backward pass gradient statistics

01:36:15 Viz #3: parameter activation and gradient statistics

01:39:55 Viz #4: update:data ratio over time

01:46:04 bringing back batchnorm, looking at the visualizations

01:51:34 Summary


I really liked this lecture - Andrej took a quick detour on our quest of making makemore, to remove a little bit of the fog around the initialisation, turning the whole process from very artisanal to more engineering-based.

We covered the Kaiming init paper and well as the Batch normalization paper, which make for far more predictable outcomes.

And plotting the different ratios and distributions to confirm things look reasonable makes me feel much better about the whole thing :)

What’s next

Next week, we’re following Andrej into another rabbit hole - that of backpropagation.

As always, subscribe to this newsletter at starterai.dev to get the next parts in your mailbox!

Share with a friend

If you like this series, please forward to a friend!


How did you like it? Was it easy to follow? What should I change for the next time?

Please reach out on LinkedIn and let me know!

How did you like this issue?

Login or Subscribe to participate in polls.

Subscribe to keep reading

This content is free, but you must be subscribed to Starter AI to continue reading.

Already a subscriber?Sign In.Not now