• Starter AI
  • Posts
  • Deep Dive: Building GPT from scratch - part 3

Deep Dive: Building GPT from scratch - part 3

learning from Andrej Karpathy

Hello and welcome back to the series on Starter AI. I’m Miko, writing this from the snowy Hokkaido!

In the previous two parts, we got our first exposure to building AI - we saw Andrej implement micrograd, build a tiny neural network using it, and compare it to PyTorch, the industry standard.

Today, it’s time to start on Generative AI. We’ll follow Andrej’s building makemore - a generative language model creating name-like words, implementing increasingly robust models in the consecutive videos.

You’re going to need your favourite beverage, and about 2 hours of your time.

The roadmap

The goal of this series is to implement a GPT from scratch, and to actually understand everything needed to do that. We’re following Andrej’s Zero To Hero videos. If you missed a previous part, catch up here:

  1. Neural Networks & Backpropagation part 1 - 2024/02/09

  2. Neural Networks & Backpropagation part 2 - 2024/02/16

  3. Today:  Generative language model - bigrams

To follow along, subscribe to the newsletter at starterai.dev. You can also follow me on LinkedIn.

Generative language model - bigrams

Today’s lecture is called “The spelled-out intro to language modeling: building makemore”. It’s the first one in the series of five, building up language models in increased order of sophistication

We’re focusing on bigrams - a super simple approach, that’s looking at the probabilities of one character following another. The lecture is in two parts: first, a simple, statistical approach consisting of explicitly calculating the table of all possible combinations. And second, where we achieve a similar result by training a neural network to do the job.

You’ll also see how to implement a single layer neural net using matrix multiplication.

As usual, here’s some helpful context to add to your warm cache before jumping into the video.

Context

Bigrams (or more generally N-grams) - a subsequence of a sequence of a length 2 (or N).

Broadcasting semantics (PyTorch, NumPy) - a set of rules to define how to deal with tensors of different sizes during arithmetic operations.

Maximum likelihood estimation (MLE) - a method of estimating the parameters of an assumed probability distribution.

Model smoothing - a technique Andrej describes to make the data easier to work with.

One-hot encoding - only a single 1, all other 0s in a vector. There is also inverse - one-cold, where it\s all 1s, other than a single 0.

Logits (log counts, explained in the lecture) - a mathematical function to transform probabilities to real numbers between -inf and +inf.

Softmax (explained in the lecture) - a mathematical function to transform real numbers into a probability distribution.

And with that cached, let’s jump straight into it. The code is available at https://github.com/karpathy/makemore and the Jupiter notebook lives here.

Video + timestamps

Part 1 - simple bigram statistical approach

00:06:24 Exploring the bigrams in the dataset

00:18:19 Visualising the bigram tensor using matlab plot

00:24:02 Generating some names - aka sampling from the model

00:36:17 efficiency! vectorized normalisation of the rows, tensor broadcasting 

00:50:14 Evaluating the model - loss function

01:00:50 Model smoothing

Part 2 - neural network approach

01:02:57 Neural network approach intro

01:10:01 One-hot encodings

01:13:53 Implementing a single layer of neurons with matrix multiplication

01:18:46 The softmax

01:26:17 Summary, reference to micrograd

01:35:49 Vectorized loss

01:38:36 Training the net with PyTorch

01:47:49 Bonus notes

01:54:31 Sampling the neural net

Summary

I hope you enjoyed this lecture as much as I did.

We saw two approaches to predicting the next character in name-like words, both using bigrams. One is to literally calculate the probabilities by counting occurrences in the training data (+super simple, -not flexible), and the second one to use the most basic neural net possible (a single layer) and apply the techniques we learnt from the previous parts to train it to achieve similar quality of predictions.

The resulting model is not great, which makes sense - the only information looking at bigrams is encoding is that of the character immediately before. But as a starting point, ir provides a valuable baseline to improve on.

Awkward John Krasinski GIF by Saturday Night Live

Gif by snl on Giphy

What’s next

Next time we’ll be building up on the neural net approach - adding more information to the model in hopes of generating better predictions.

As always, subscribe to this newsletter at starterai.dev to get the next parts in your mailbox!

Share with a friend

If you like this series, please forward to a friend!

Feedback

How did you like it? Was it easy to follow? What should I change for the next time?

Please reach out on LinkedIn and let me know!

How did you like this issue?

Login or Subscribe to participate in polls.

Subscribe to keep reading

This content is free, but you must be subscribed to Starter AI to continue reading.

Already a subscriber?Sign In.Not now