• Starter AI
  • Posts
  • Deep Dive: Building GPT from scratch - part 6

Deep Dive: Building GPT from scratch - part 6

learning from Andrej Karpathy

Hello and welcome back to the series on Starter AI. It’s Miko again, writing from the rainy London.

Today, we’re going on a side quest to understand the backpropagation better. This is the most maths-y part of the series so far - you’ve been warned! You’re going to need at least a couple of hours for this one, so brace yourself. Let’s go!

The roadmap

The goal of this series is to implement a GPT from scratch, and to actually understand everything needed to do that. We’re following Andrej’s Zero To Hero videos. If you missed a previous part, catch up here:

To follow along, subscribe to the newsletter at starterai.dev. You can also follow me on LinkedIn.

Generative language model - backpropagation

Today’s lecture is called “Building makemore Part 4: Becoming a Backprop Ninja”. Just like the name implies, we’re taking one more detour before finishing up makemore, to build a better understanding of how backpropagation works.

The video’s motivation is explained in Andrej’s blog post called Yes you should understand backprop. It boils down to the fact that without knowing the low-level mechanics, only relying on available implementations like PyTorch’s, it’s very easy to shoot yourself in the foot, and introduce subtle bugs. The post gives a few counter-intuitive examples, including dead neurons and vanishing gradients that we already saw last time.

If you paid attention, you know we’ve already spent some time on that in the first two parts, building micrograd. However, that worked on the level of single values, and today we’ll make it more real life-like, by implementing it on the level of tensors. Also, not that long ago, it was pervasive to write your own backpropagation, before everyone started using frameworks for that, as illustrated in the following frame from the lecture:

Throughout the lecture, we’ll be following this notebook. The video is structured in a way to maximise your learning by pausing it, and trying to figure out how to calculate each bit by yourself.


This is the first part of the series, where the calculus really comes in handy. If you haven’t seen differential expressions before, you’re probably best off pausing here and learning a bit about that. Don’t worry - like everything else, you can find everything you need to know on the internet. Try Khan Academy if you want to start from scratch, or this BBC bitesize, if you want a quick refresher.

Bessel’s correction - a method of correcting bias when calculating a variance of a sample, instead of a whole population. In the lecture, it boils down to normalizing with 1/(n-1) vs 1/n, so nothing too scary 🙂

Other than that, there isn’t much else that’s new, so let’s jump straight to it!

Video + timestamps

00:13:01 Exercise 1: backprop the whole graph, subexpression by subexpression

01:26:31 Exercise 2: cross entropy loss backward pass

01:36:37 Exercise 3: batch norm layer backward pass

01:50:02 Exercise 4: putting it all together


OK, so things got a little mathsy today, as we watched Andrej guide us through a manual, step by step calculation of the derivatives, using tensors of various dimensions.

This removes some of the magic that PyTorch provides, and gives you the sweet reassurance that - should all copies of PyTorch randomly disappear one day - you’ll be able to train some neural nets!

Although you’re unlikely to do it manually like this in real life very often, knowing how things work under the hood is the best way to learn, and be able to avoid pitfalls.

What’s next

Next time we’re done with the detours, and we’re coming back to finish up makemore, complete with a convolutional neural network.

As always, subscribe to this newsletter at starterai.dev to get the next parts in your mailbox!

Share with a friend

If you like this series, please forward to a friend!


How did you like it? Was it easy to follow? What should I change for the next time?

Please reach out on LinkedIn and let me know!

How did you like this issue?

Login or Subscribe to participate in polls.

Subscribe to keep reading

This content is free, but you must be subscribed to Starter AI to continue reading.

Already a subscriber?Sign In.Not now