4 What are LLMs anyway?

As I mentioned before, there are terms that are used more or less interchangeably when talking about LLMs in higher education. Sometimes people refer to them as “AI” or “Artificial Intelligence.” Other times, people refer to this as “genAI” or “generative Artificial Intelligence.” Usually, these terms are made in reference to LLMs or “large language models,” which are a type of genAI, which is a type of AI.

I am not going to describe the technical aspects of LLMs here, but we will explore the big features and components of LLMs to understand better their reach and limitations, as well as to value their outputs.

Large Language Models specialize in generating text following a prompt. This text generation is based on a combination of mathematical -and statistical- models that are trained on data, loads and loads of data. But before diving into the moving pieces of LLMs, let’s check a predecessor of modern day LLMs.

4.1 Predictive text

We all have used auto-correct or some sort of predictive text assistance in our phones, email, or even your favorite text editor. These tools have become ubiquitous across devices and platforms.

One of the early versions of predictive text is based on a mathematical and statistical concept called Markov chains. In short, a system presents the Markov property if its current state is determined only by its previous state. in reference to text prediction, this means that suggesting the next word in a text only requires knowledge of the current (previous) word.

This property might seem very simplistic and perhaps not useful. However, it is impressive the amount of modern day technologies that is based on this simple assumption. The Markov property makes systems to be lightweight, meaning that not a lot of resources are needed in order to implement them.

Roughly speaking, Markov chains describe systems that have a finite number of possible states. These systems transition from one state to another in each time step. It is allowed to stay in the current state. The transitions occur according to certain probabilities. Much of the practical work focuses on accurately describing and computing these probabilities.

For example, for our simple predictive text model, the system described is a text (essay, sms, email, etc.) as a sequence of words. Our Markov chain aims to describe the transition from one word to the next one. For this, we need to know the probability of going from one specific word to the next one. If in my text there is the word “I,” there is a high probability that the next word is “am” and a low probability that the next word would be “I” again.

Coming up with these probabilities can be challenging and not standard. Akin to different chefs having different recipes for the same plate, different implementations of Markov chains might compute or estimate probabilities in different ways.

A common way to estimate probabilities for predictive text (as before) is to use frequency counts on reference corpora (reference texts). The main idea is to tally how frequent the word “am” follows the word “I” in a collection of reference texts and to compare it to the total number of pairs of words that have “I” as the first word.

Notice that this depends on the reference corpora. If our reference text were the lyrics of Avril Lavigne’s song “I’m with you,” the word “am” doesn’t show up as a possible follow-up to the word “I,” however the word “I” does!

4.2 Knowledge and prediction

Just relying on the previous word to predict the next word can feel overly simplistic. Notice that the same idea of Markov chains can be applied for bi-grams (pairs of words) instead of single words. We can estimate probabilities of what is the next word given the two previous words. In general, it is possible to extend this idea to n-grams, sequences of \(n\) words, as the input for predicting the next word.

As we would expect, taking more words than just the previous one leads to better results in predictive text. The price to pay is not only now keeping track of a longer sequence of words for predictions, but also considering more possible combinations when estimating the transition probabilities.

Enlarging the context window for a predictive system requires more attention to the estimation of the transition probabilities. These probabilities can then be thought of as the knowledge that the system has with respect to certain corpora. The predictive system emerges with two important components: a) the knowledge it has, and b) the capacity to predict the next word given an input.

Each of these components have different strategies and algorithms involved which can differ from implementation to implementation, but the essence is roughly the same:

Knowledge is represented by defining probabilities.
Prediction is obtained by input sequences of words, interpreting probabilities, and introducing random choices.

Notice that probability and randomness are important parts of the system. There are some theoretical and practical reasons for this, but one of the main points for us is the intrinsic stochastic nature of predictive text.

4.3 Large Language Models

Notice that the above description doesn’t take into account the language’s grammar. This approach infers the structure of the language from the reference corpora -also known as the training data.

There are different language models used by linguists and computer scientists which have strengths and weaknesses depending on their use cases. It is important to notice that LLMs are a special type of language models that arose as a type of improvement from our simple predictive text system.

Besides the technical mathematical differences, we can focus on how LLMs define their knowledge and predict the next words.

As opposed to our predictive text system from before, LLMs are called large language models due to the amount of training data. Before, we could have the lyrics of a song, a few Wikipedia pages, or a book as possible corpora for estimating transition probabilities. In the case of LLMs, training data reaches the level of the entire internet. Yeah, the entire internet. Even beyond this, many of the current LLMs are trained on data sets including all digitized books, music, movies, etc. There have been multiple copyright lawsuits addressing the unauthorized usage of copyrighted material in training some of the biggest models.

On the prediction side, one of the most relevant differences from our example before is the dynamic context window used for predicting words. The main principle remains the same: given a sequence of words, what is the most likely word to follow. However, LLMs take the entire word sequence given in a prompt as opposed to just a fixed context window. This remains true when prompting follow-ups. Now, the LLMs doesn’t only take the new request as the context window, but also the initial prompt, together with the output it itself produced about it.

A technical note: LLMs are trained not on words but on tokens. These can be words, chunks of words, punctuation marks, mathematical symbols, coding symbols and instructions, etc. For example, when writing “7x8=” the LLM will predict that the most probable follow up is a “5” and then a “6”.

4.4 Training

Using the entire internet to train an LLM requires, not only a lot of data, but also a lot of computing. Even though the specific mathematical details of training an LLM and estimating transition probabilities in a Markov chain are different, the core principle remains the same: we need to represent knowledge on how to predict the next word given a sequence of previous words. This knowledge is represented by numbers which will be stored and are known as the parameters of the model. At the time of writing this document, current models range between a few tenths of billions to over a trillion parameters (\(10^9-10^{12}\)).

All of these numbers are the result of solving or estimating mathematical equations based on the tokenization of the training data. This means to, first, clean and break the training data into a sequence of tokens. In order to incorporate mathematical models based on these tokens, it is needed to represent these tokens as numbers -more specifically as vectors-, which can be thought of as arrays of numbers.

This process is called encoding. We can think of encoding as a mathematical dictionary that equates each token with a different vector. But how is this dictionary built?

The challenge is to make this dictionary as useful as possible. This means that these numbers should represent tokens and how they interact with each other in a useful way. An initial strategy addressing this problem was to have a universal dictionary that can be used for every type of language model. However, a practical issue occurs with polysemic words such as “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.” Here, humans are able to understand the meaning of such sentence due to the word buffalo changing meaning depending on the position in the sentence. In order to allow for this flexibility, LLMs don’t assume a given encoding for a word, but compute the encoding for each word in the sequence, allowing it to be different at different places. This allows for the context to be updated as more information is included in the text.

This is important not only for words that can have multiple meanings, but also that can change meaning depending on other words:

Yeah!

Yeah, right.

The ability to change the encoding of tokens depending on the context is referred to as auto-encoding. This adds to the computation needed for training LLMs.

After the auto-encoding stage, most LLMs have a combination of neural networks that are used for computing possible next tokens. In simple terms, neural networks are mathematical functions that depend on certain parameters. The parameters are determined by minimizing the error between the predicted tokens and the actual tokens present in the training data.

These are highly intensive computational tasks due to the number of parameters (between \(10^9\) and \(10^{12}\)). For this, several computers (data centers) are used where the training process takes a few months of non-stop computations.

At this moment, some of the newest models took between 1.5 to 3 months of training, using between 10,000 and 25,000 GPUs (graphic processing unit, which are specially efficient at matrix multiplication), spending an estimated 5GWh - 60GWh of power. For reference, this would be the equivalent to the annual electricity consumption of a small town with 50,000 homes. Once the training stage is completed, the energy consumption decays and is comparable to a regular internet action, such as visit wikipedia or post something on Facebook.