How does tokenization work?

Each unique character in the training data is simply mapped to a unique token. For instance, 'a' might be mapped to the token 1, 'b' might be mapped to the token 2, etc.

What is each hyperparameter?

We divide hyperparameters up into several groups depending on the way they are used in the pipeline. Some hyperparameters directly impact model architecture ("Architectural Hyperparameters"), while others impact the way the model is trained ("Training Hyperparameters"). Below is an explanation of each hyperparameter, divided into these various groups.

Architectural Hyperparameters:

Context Size: the size of the context window for the NanoGPT model. This corresponds to "Block Size" for the Bigram model, but actually impacts model architecture.

Number of Embeddings: the size of the embedding dimension used to represent each token.

Number of Transformers: the number of transformer blocks applied sequentially in the model.

Number of Heads: the number of self-attention heads per transformer block.

Head Size: the size of the key, query, and value embeddings in each head of self-attention.

Dropout: the rate of dropout (i.e. randomly setting certain nodes to output 0) throughout the model.

Training Hyperparameters:

Batch Size: how many batches of data is the model fed during a single iteration.

Block Size: how many tokens are contained in one batch of training data. This is akin to "Context Size" for the NanoGPT model, but doesn't impact the architecture of the Bigram model.

Learning Rate: how much to adjust model weights during a single training iteration.

Eval Interval: how many training iteration before we log the model train and validation loss. Note that this is not a "true" hyperparameter, and does not impact training.

Max Iterations: how many iterations to train the model for.

I'm still confused.

Check out Andrej Karpathy's excellent video on building NanoGPT from scratch, which was a major source of inspiration for this project!