N Recursions: Generative AI with Large Language Models: Part 1

To learn more about Generative AI, there's a 3 week course I went through (the free version). Am glad I did, because there's so much that was taught which various blogs and Wikipedia do not inform us about. I'm creating these blog posts as a summary that I can refer later, since going through a blog post is faster than finding information in multiple videos.

LICENSE: Images shown in this blog post and following parts of the blog are screenshots taken from the Generative AI course. You may not use or distribute these or the text content for commercial purposes. It was created by DeepLearning.ai, and is licensed under a creative commons license.

Here are some of the main points:

Generative AI is a subset of Machine Learning, which finds statistical patterns in massive datasets. This is a (not very accurate) representation of how massive the current Large Language Models (LLM's) are:

Prompt: The text you pass to an LLM.
Context window: The memory available to the prompt.
Completion: The output of the LLM.

LLM's can be used for:

Chatbots.
Creative: Writing essays or summarizing conversations.
Translation tasks: between different languages and from natural language to machine code.
Named entity recognition: Retrieving specific information from text.

The capabilities of LLM's can be augmented by connecting them to external data sources or invoking API's.

Transformers to the rescue

In 2017, Transformers arrived, and were capable of capturing context and meaning of language, much better than Recurrent Neural Networks (RNN's) ever could.

Scaling: It can be scaled to use multicore GPU's.
Data parallelism: Can process data in parallel, thus handling much larger data sets.
Attention: Can establish relevance of meaning and position of words in sentences.

Self attention

This an attention map that shows attention weights between each word and every other word. Here, the word "book" is strongly connected with or paying attention to the word "teacher" and the word "student". This is called self-attention and the ability to learn attention in this way across the whole input significantly approves the model's ability to encode language.

Overview of Encoder-Decoder

This section describes a simplistic view of the steps involved in processing words using the Encoder-Decoder model.

Tokenising: As a first step, words or parts of words are converted into number tokens and fed to the input. Once you use a set of tokens, you have to use the same tokens for decoding.

Word Embedding: Then, token id's are matched to a multi-dimensional vector (called embedding) space, occupying a unique location in the space. The intuition is that these vectors learn to encode the meaning and context of individual tokens in the input sequence (like word2vec did).

Token id's (342, 879, etc.) mapped to vectors

A simplistic 3D vector space. Actual vector spaces are much more high dimensional

Instead of 512, if the vector was just 3 dimensional, the diagram above shows how the vector space would appear in 3D. The angle and proximity helps the model mathematically "understand" relationships between words, and hence, language.

Positional encoding: Preserves the word order. Relevance of the position of the word in a sentence.

Self attention: Applies the self-attention weights to a model (token relationships in the input sequence). This allows the model to "attend" to different parts of the input sequence to better capture the contextual dependencies between words. The transformer architecture has multi-headed self-attention. This means that multiple sets of self-attention weights or heads are learned in parallel independently of each other. The number of attention heads included in the attention layer varies from model to model, but numbers in the range of 12-100 are common. The intuition here is that each self-attention head will learn a different aspect of language. For example, one head may see the relationship between the people entities in our sentence. Whilst another head may focus on the activity of the sentence. Whilst yet another head may focus on some other properties such as if the words rhyme. It's important to note that you don't dictate ahead of time what aspects of language the attention heads will learn. The weights of each head are randomly initialized and given sufficient training data and time, each will automatically learn different aspects of language.

Feed-forward network: The output is processed through a fully-connected feed-forward network. The output of this layer is a vector of logits proportional to the probability score for each and every token in the tokenizer dictionary.

Softmax output: Logits are sent to the softmax layer, where they are normalized into a probability score for each word. This output includes a probability for every word in the vocabulary, so there's likely to be thousands of scores here. One single token will have a score higher than the rest. This is the most likely predicted token.

Example: Sequence to sequence prediction

If you are translating from French to English, First, the tokenizing is done, the tokens are fed into the Encoder side, passed through the embedding layer, to the multi-headed attention layers, through a feed-forward network to the output of the encoder. At this point, the data that leaves the encoder is a deep representation of the structure and meaning of the input sequence. This representation is inserted into the middle of the decoder to influence the decoder's self-attention mechanisms.

Next, a start of sequence token is added to the input of the decoder. This triggers the decoder to predict the next token, which it does based on the contextual understanding that it's being provided from the encoder. The output of the decoder's self-attention layers gets passed through the decoder feed-forward network and through a final softmax output layer. At this point, we have our first token.

You'll continue this loop, passing the output token back to the input to trigger the generation of the next token, until the model predicts an end-of-sequence token. At this point, the final sequence of tokens can be detokenized into words, and you have your output.

Encoder: Encodes prompts with contextual understanding and produces one vector per input token.
Decoder: Accepts input tokens and generates new tokens.

Types of models

Encoder only: Normally used for sequence-to-sequence tasks where input and output sequences are of same length. With some modification, it can perform classification and sentiment analysis. Example: BERT.
Encoder-decoder: Sequence to sequence, where input and output are of different length. Can be scaled to perform general text generation. Eg: BART and FLAN-T5.
Decoder only: GPT family, BLOOM, Jurassic, LLaMa and more.

Transformer summary

The Attention is all you need paper is here and published here. It replaces RNNs and convolutional neural networks (CNNs) with an entirely attention-based mechanism.

Multi-head self-attention: Allows the model to attend to different parts of the input sequence.
Feed-forward network: Applies a point-wise fully connected layer to each position separately and identically.
Positional encoding: Encodes the position of each token in the input sequence, enabling the model to capture the order of the sequence without the need for recurrent or convolutional operations.

The Transformer model also uses residual connections and layer normalization to facilitate training and prevent overfitting.

Details of the Encoder Decoder Transformer

Prompting and prompt engineering

Zero shot inference: Including your input data within the prompt (large models can handle this easily).
One shot or few-shot inference: Providing a single example to include multiple examples (smaller models need it, and going beyond 5 or 6 exemplars does not yield better results).

Configuring model output parameters

These parameters are invoked at inference time and control things like the maximum number of tokens in the completion, and how creative the output is.

Max tokens: Specifies that the model can generate x number of tokens or less (until the model predicts an end of sequence token).

Most large language models by default will operate with greedy decoding, where the model will always choose the word with the highest probability. It works very well for short generation but is susceptible to repeated words or repeated sequences of words.

Random sampling is the easiest way to introduce some variability. The model chooses an output word at random using the probability distribution to weight the selection.

Top k: You specify the number of tokens to randomly choose from. It can help the model have some randomness while preventing the selection of highly improbable completion words. This in turn makes your text generation more likely to sound reasonable and to make sense.

Top p: Limits the random sampling to the predictions whose combined probabilities do not exceed p.

Temperature: It influences the shape of the probability distribution that the model calculates for the next token. The higher the temperature, the higher the randomness. The value is a scaling factor that's applied within the final softmax layer of the model that impacts the shape of the probability distribution of the next token. In contrast to the top k and top p parameters, changing the temperature actually alters the predictions that the model will make. If you choose a low value of temperature, say less than one, the resulting probability distribution from the softmax layer is more strongly peaked with the probability being concentrated in a smaller number of words. The model will select from this distribution using random sampling.

Pre-training: A deep statistical representation of language is developed in the pre-training phase when the model learns from gigabytes, terabytes, and even petabytes of unstructured textual data. In this self-supervised learning step, the model internalizes the patterns and structures present in the language. These patterns then enable the model to complete its training objective, which depends on the architecture of the model. Model weights get updated to minimize the loss of the training objective. Pre-training also requires a large amount of compute and the use of GPUs. Data needs processing to increase quality, address bias, and remove other harmful content. As a result of this data quality curation, often only 1-3% of tokens are used for pre-training.

LLM pre-training at a high level

Encoder only model pre-training: Also known as Autoencoding models, and they are pre-trained using masked language modeling (MLM). Tokens in the input sequence are randomly masked, and the training objective is to predict the masked tokens, to reconstruct the original sentence (called a denoising objective). Autoencoding models build bi-directional representations of the input sequence (model has an understanding of the full context of a token and not just of the words that come before). Encoder-only models are suited to tasks that benefit from this bi-directional contexts. Used for sentence classification tasks, sentiment analysis, token-level tasks like named entity recognition or word classification. Eg: BERT and RoBERTa.

Decoder-only models pre-training: Called autoregressive models, pre-trained using causal language modeling (CLM). Training objective is to predict the next token (called full language modelling) based on the previous sequence of tokens. Most of the input sequence is masked and the model can only see the input tokens leading up to the token in question. The model has no knowledge of the end of the sentence. The model then iterates over the input sequence one by one to predict the next token. In contrast to the encoder architecture, this means that the context is unidirectional. By learning to predict the next token from a vast number of examples, the model builds up a statistical representation of language. Used for text generation, and larger models show strong zero-shot inference abilities, and can often perform a range of tasks well. Eg: GPT and BLOOM.

Sequence to sequence model pre-training: This model uses both the encoder and decoder parts. The exact details of the pre-training objective vary from model to model. T5, pre-trains the encoder using span corruption, which masks random sequences of input tokens. Those mass sequences are then replaced with a unique Sentinel token (special tokens added to the vocabulary, but do not correspond to any actual word from the input text). The decoder is then tasked with reconstructing the masked token sequences auto-regressively. The output is the Sentinel token followed by the predicted tokens. You can use sequence-to-sequence models for translation, summarization, and question-answering. Another well-known encoder-decoder model is BART.

Computational challenges

Running out of memory is a common issue. Here's the math:

1 parameter (weight) = 4 bytes (32 bit float)
1 billion parameters need 4*10^9 bytes = 4GB of GPU RAM needed at 32 bit full precision, just for the weights.

Training also requires:

2 Adam optimizer states = 8 bytes per weight.
Gradients = 4 bytes per weight.
Activations and temp memory = 8 bytes per weight (worst case).

Total: Approx. 24 bytes per weight. So approx. 80GB RAM for 1 billion weights.

Quantization

Reduces the memory required. Instead of FP32, FP16 (16 bit float) is used. 8 bit can also be used. Quantization speeds up calculations. BF16 is a popular format, but it isn't well suited for integer calculations.

At 8 bit, a 1 billion weight model would need 20GB GPU RAM. Today's models are at 175 to 500 billion weights in size.

Multi-GPU compute strategies

PyTorch's Distributed Data Parallel (DDP) method requires model parameters, gradients and optimizer states to fit onto a single GPU.

DDP

If the model is too big for DDP, use Model Sharding via Fully Sharded Data Parallel (FSDP). The ZeRO technique is used for sharding.

FSDP

Scaling laws and compute-optimal models

One petaflop per second-day: Is 1,000,000,000,000,000 (one quadrillion) floating point operations per second for one day.

It will take 8 NVIDIA V100 GPU's to run 1 petaflop/s-day.

It will take 2 NVIDIA A100 GPU's to run 1 petaflop/s-day.

Number of petaflops/s-days to pre-train various LLMs (y axis is logarithmic)

There have been some interesting papers that examined how Neural Networks scale. The Chinchilla paper is particularly famous, because it pointed out that LLMs may have more parameters (over-parameterised) than required to learn a language, and are under-trained, so they would benefit from more training data.

Optimal data size is 20 times the number of weights

Smaller models that are better trained, considering Chinchilla laws, can perform better

Pre-training for domain adaptation

For example, training models to summarize legal text, you may need to consider phrases like "mens rea" or "res judicata" or even normal words that mean something entirely different in a legal context. Such words, even in the medical domain, may not occur frequently in training datasets from information scraped from the web or certain general datasets. A doctor may write "1 tab po qid pc & hs", which a pharmacist can understand as "1 tablet by mouth 4 times a day after meals and at bedtime". ChatGPT could explain it, but it demonstrates the varied types of information that an LLM can encounter.

The BloombergGPT model for finance is one such LLM that was fine-tuned with fewer parameters than the recommended Chinchilla law. Also, there is limited financial data, so the model could be trained only on a limited set of data.

Kaplan and Chinchilla scaling laws compared for BloombergGPT and other LLMs

The BloombergGPT project is a good illustration of pre-training a model for increased domain-specificity, and the challenges that may force trade-offs against compute-optimal model and training configurations.

This series is continued in Part 2.