$n$-gram Models¶

An $n$-gram model is the simplest kind of language model. It is a probabilistic model of language. That is, it assigns probabilities to sequences of words.

Think about the following sentence fragment:

The cat sat on the _____

What word should fill in the blank?

If you said

mat

then we agree! But why do we agree?

Well, you might argue that our internal language models have both assigned mat a high probability, given the preceding context.

Because we considered 5 words to predict the 6th word, we may have accessed something in our minds that behaves like a 6-gram language model - that is, an $n$-gram model with $n=6$.

Mathematical definition of probability¶

An $n$-gram language model defines a probability distribution over all known words, given some preceding context.

Let's define all these words mathematically, and very carefully!

Probability distributions¶

A probability distribution $p(x)$ is a function that takes an event, $x$, and outputs a probability $p(x)$. For instance, here is a probability distribution describing the process of flipping a fair coin:

$$ p(x=\text{HEADS}) = 1/2 $$ $$ p(x=\text{TAILS}) = 1/2 $$

Easy enough, right? We expect half of all coin flips to be heads, and the other half to be tails. Notice that the probabilities are between 0 and 1, and that they sum to 1:

In [ ]:
def p_coin_flip(x):
    if x == "HEADS":
        return 1/2
    elif x == "TAILS":
        return 1/2
    else:
        return 0

print('p(HEADS):', p_coin_flip("HEADS"))
print('p(TAILS):', p_coin_flip("TAILS"))

s = 0
for outcome in ["HEADS", "TAILS"]:
    s += p_coin_flip(outcome)
print('     sum:', s)
p(HEADS): 0.5
p(TAILS): 0.5
     sum: 1.0

Another simple example is rolling a fair die (like in a board game):

$$ p(x=1) = 1/6 $$ $$ p(x=2) = 1/6 $$ $$ p(x=3) = 1/6 $$ $$ p(x=4) = 1/6 $$ $$ p(x=5) = 1/6 $$ $$ p(x=6) = 1/6 $$

There are six faces on the die, and we have defined $x$ to mean "how many dots are on the side facing up after rolling it." All the sides are equally likely to end up face-up, so all six events have the same probability. Notice again that the probabilities are between 0 and 1, and that they sum to 1.

In [ ]:
def p_die6(x):
    if x == 1:
        return 1/6
    elif x == 2:
        return 1/6
    elif x == 3:
        return 1/6
    elif x == 4:
        return 1/6
    elif x == 5:
        return 1/6
    elif x == 6:
        return 1/6
    else:
        return 0

print('p(1):', p_die6(1))
print('p(2):', p_die6(2))
print('p(3):', p_die6(3))
print('p(4):', p_die6(4))
print('p(5):', p_die6(5))
print('p(6):', p_die6(6))

s = 0
for outcome in range(1, 7): # 1, 2, 3, 4, 5, 6
    s += p_die6(outcome)
print(' sum:', s)
p(1): 0.16666666666666666
p(2): 0.16666666666666666
p(3): 0.16666666666666666
p(4): 0.16666666666666666
p(5): 0.16666666666666666
p(6): 0.16666666666666666
 sum: 0.9999999999999999

Properties that probability distributions must satisfy¶

Let's call the set of all possible events $\mathcal{X}$ (sometimes called the support). For a coin flip,

$$ \mathcal{X} = \{\text{HEADS}, \text{TAILS}\} $$

For a die roll,

$$ \mathcal{X} = \{1, 2, 3, 4, 5, 6\} $$

Then, we need $p(x)$ to satisfy two properties. First, probabilities are restricted in their possible values:

$$ \forall x \in \mathcal{X} : 0 \leq p(x) \leq 1 $$

This means that for every possible outcome $x$, we need to define $x$'s probability to be between 0 and 1. A probability of 0 means it is impossible, while a probability of 1 means it is certain. Of course, nothing can be less likely than impossible, and nothing can be more likely than certain!

Second, probabilities need to be compatible with each other:

$$ \sum_{x\in\mathcal{X}} p(x) = 1 $$

This means that something must happen! If we consider every possible outcome, SOME outcome MUST happen - so the probabilities add up to 1. We are 100% certain that something will happen.

In [ ]:
def is_valid_probability_distribution(p, outcomes):

    """
    Check if a probability distribution `p` with support `outcomes` is valid.
    """

    # sum up total probability - it should be exactly 1
    s = 0

    # consider all outcomes in the support
    for outcome in outcomes:

        # make sure the probability assigned is valid
        if p(outcome) < 0 or p(outcome) > 1:
            return False # if not, the distribution is invalid!

        s += p(outcome) # if it is, keep going - add up the contribution

    # check if the sum is close to 1 - we can have precision mistakes :(
    return abs(s - 1) < 0.00001

print(
    '         coin flip:',
    is_valid_probability_distribution(p_coin_flip, ["HEADS", "TAILS"])
)

print(
    'six-sided die roll:',
    is_valid_probability_distribution(p_die6, [1, 2, 3, 4, 5, 6])
)
         coin flip: True
six-sided die roll: True

Conditional probability¶

Conditional probabilities tell us about how a probability distribution changes depending on what information we know.

Let's start by thinking about coin flips and dice.

Imagine that we carry out the following process:

  • Flip a coin. Assign the outcome of the coin flip to the variable $x$.
  • If heads, roll a 4-sided die.
  • If tails, roll a 6-sided die.
  • Assign the outcome of the die roll to the variable $y$.

Then, $p(x)$ looks familiar:

$$ p(x=\text{H}) = 1/2 $$ $$ p(x=\text{T}) = 1/2 $$

What about $p(y)$? Things are a little more complicated here. It's a bit easier to start by talking about $p(y|x)$, read as "the probability of $y$ given $x$."

If the coin is heads, 5 and 6 become impossible, because we are rolling the 4-sided die:

$$ p(y=1|x=\text{H}) = 1/4 $$ $$ p(y=2|x=\text{H}) = 1/4 $$ $$ p(y=3|x=\text{H}) = 1/4 $$ $$ p(y=4|x=\text{H}) = 1/4 $$ $$ p(y=5|x=\text{H}) = 0 $$ $$ p(y=6|x=\text{H}) = 0 $$

If the coin is tails, we have the 6-sided die:

$$ p(y=1|x=\text{T}) = 1/6 $$ $$ p(y=2|x=\text{T}) = 1/6 $$ $$ p(y=3|x=\text{T}) = 1/6 $$ $$ p(y=4|x=\text{T}) = 1/6 $$ $$ p(y=5|x=\text{T}) = 1/6 $$ $$ p(y=6|x=\text{T}) = 1/6 $$

Notice how the context of the coin flip influenced our expectations regarding what numbers we might see? But we still don't quite know what $p(y)$ looks like...

Joint probability¶

Joint probabilities tell us about how likely it is that multiple events occur together. We write this as $p(x,y)$, which is read "probability of $x$ and $y$." In general,

$$ p(x,y) = p(y|x) * p(x) $$

Think about this. In order to get to the point where we roll the 4-sided die, we first need to get heads ($p(x=\text{H})=1/2$) when we flip the coin. Therefore, we will only ever consider the 4-sided die half the time.

Now, within this subset of the possible events that can happen, we can consider our $p(y|x=\text{H})$ from above. The total probability of, for instance, getting heads and then rolling a 3 is the probability of getting heads (1/2) times the probability of now getting a 3 given we just got heads (1/4).

In the above example, we have:

$$ p(x=\text{H},y=1) = p(y=1|x=\text{H}) * p(x=\text{H}) = 1/4 * 1/2 = 1/8 $$ $$ p(x=\text{H},y=2) = p(y=2|x=\text{H}) * p(x=\text{H}) = 1/4 * 1/2 = 1/8 $$ $$ p(x=\text{H},y=3) = p(y=3|x=\text{H}) * p(x=\text{H}) = 1/4 * 1/2 = 1/8 $$ $$ p(x=\text{H},y=4) = p(y=4|x=\text{H}) * p(x=\text{H}) = 1/4 * 1/2 = 1/8 $$ $$ p(x=\text{H},y=5) = p(y=5|x=\text{H}) * p(x=\text{H}) = 0 * 1/2 = 0 $$ $$ p(x=\text{H},y=6) = p(y=6|x=\text{H}) * p(x=\text{H}) = 0 * 1/2 = 0 $$

$$ p(x=\text{T},y=1) = p(y=1|x=\text{T}) * p(x=\text{T}) = 1/6 * 1/2 = 1/12 $$ $$ p(x=\text{T},y=2) = p(y=2|x=\text{T}) * p(x=\text{T}) = 1/6 * 1/2 = 1/12 $$ $$ p(x=\text{T},y=3) = p(y=3|x=\text{T}) * p(x=\text{T}) = 1/6 * 1/2 = 1/12 $$ $$ p(x=\text{T},y=4) = p(y=4|x=\text{T}) * p(x=\text{T}) = 1/6 * 1/2 = 1/12 $$ $$ p(x=\text{T},y=5) = p(y=5|x=\text{T}) * p(x=\text{T}) = 1/6 * 1/2 = 1/12 $$ $$ p(x=\text{T},y=6) = p(y=6|x=\text{T}) * p(x=\text{T}) = 1/6 * 1/2 = 1/12 $$

Overall, we can calculate $p(y)$ by summing up all the different ways we can get each possible value of $y$. There are two ways to get a 1, for instance:

  • We can get heads and then roll a 1 on the 4-sided die (probability $1/2 * 1/4 = 1/8$)
  • We can get tails and then roll a 1 on the 6-sided die (probability $1/2 * 1/6 = 1/12$)

$$ p(y=1) = 1/8 + 1/12 \approx 0.208 $$ $$ p(y=2) = 1/8 + 1/12 \approx 0.208 $$ $$ p(y=3) = 1/8 + 1/12 \approx 0.208 $$ $$ p(y=4) = 1/8 + 1/12 \approx 0.208 $$ $$ p(y=5) = 0 + 1/12 \approx 0.083 $$ $$ p(y=6) = 0 + 1/12 \approx 0.083 $$

Let's simulate this and see if it comes out that way:

In [ ]:
import random

# create a mapping that has inputs 1...6 and all outputs are 0
p_y = dict.fromkeys([1, 2, 3, 4, 5, 6], 0)

N = 1000 # number of observations we will make

for i in range(N): # 0, ..., N-1
    coin_flip = random.randint(0, 1) # randomly choose 0 or 1
    if coin_flip == 0:
        die_roll = random.randint(1, 4) # randomly choose 1, 2, 3, or 4
    else:
        die_roll = random.randint(1, 6) # randomly choose 1, ..., 6
    p_y[die_roll] += 1 # count our observation

for key, value in p_y.items():
    print(key, value / N) # divide by N to get observed probability
1 0.202
2 0.196
3 0.232
4 0.199
5 0.085
6 0.086

Learning a unigram model¶

Coin flips and dice are boring. Let's talk about natural language.

Let $p(w|c)$ mean the probability of observing a word $w$ given some preceding context $c$. $c$ can be anything we want - it can be the preceding word, the preceding 2 words, 3 words, ... or even nothing at all!

Let's start from the simplest case, which is indeed the case where $c$ is nothing at all - this is called a unigram model. Why? Because every word is considered totally independently of any other - it is a 1-gram!

Mathematically speaking, the unigram distribution is defined as follows. Consider a sequence of $n$ words:

$$ w_1, w_2, w_3, ... w_{n-2}, w_{n-1}, w_n $$

The unigram distribution is defined such that:

$$ p(w_n | w_1, w_2, ... w_{n-1}) = p(w_n) $$

That is, the preceding context is totally irrelevant - it makes no difference what words we have seen in the past!

This will be a very bad model of language, obviously. Nonetheless, let's train one!

The NLTK library is a great tool to explore this. Here is the Brown corpus, one of the first corpora developed for computational linguistics research:

In [ ]:
import nltk
from nltk.corpus import brown

try:
    words = brown.words()
except:
    nltk.download('brown')
    words = brown.words()

words
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
Out[ ]:
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

Here is a simple way to define probabilities of words. Just count them up, and divide by the sum of all the words!

For instance, in:

the cat sat on the mat

We have 6 words, and 2 of them are "the," so we'll assign $p(\text{the}) = 2/6$.

In [ ]:
unigram_counts = {} # dictionary, a mapping

for word in words: # iterate over all the words
    if word in unigram_counts: # if we have already seen the word
        unigram_counts[word] += 1 # add 1 to its count
    else: # if this is a new word
        unigram_counts[word] = 1 # add a new entry to the dictionary, value 1

# we can access any word's count like this:
print(f'  The: {unigram_counts["The"]}')
print(f'hello: {unigram_counts["hello"]}')
  The: 7258
hello: 4

Let's now divide the counts by the sum of all the counts, giving us probabilities:

In [ ]:
# iterate over key-value pairs in dict
for key, value in unigram_counts.items():
    # re-assign the values with the normalized values
    # len(words) is the number of words in the corpus
    unigram_counts[key] = value / len(words)
In [ ]:
def p_unigram(word):
    if word in unigram_counts:
        return unigram_counts[word]
    else:
        return 0

print(f'  p(The): {p_unigram("The")+p_unigram("the")}')
print(f'p(hello): {p_unigram("hello")+p_unigram("Hello")}')
  p(The): 0.06025790739171472
p(hello): 8.611840246918683e-06

We can also confirm our probability distribution is valid! We can call .keys() on our dictionary to get the support (and .values() to get the probabilities). We can also cast them to list type for later convenience.

In [ ]:
unigram_support = list(unigram_counts.keys())
unigram_probs = list(unigram_counts.values())
is_valid_probability_distribution(p_unigram, unigram_support)
Out[ ]:
True

Great! Let's try to generate some text from our unigram distribution.

In [ ]:
def generate_unigrams(sequence_length: int):
    sequence = [] # initialize empty sequence
    for i in range(sequence_length): # sample this many words
        choice = random.choices(
            unigram_support, # choose from the support
            weights=unigram_probs # given their probabilities
        )
        sequence += choice # add on to the sequence
    return sequence
In [ ]:
' '.join(generate_unigrams(20))
Out[ ]:
'a all a Administration one subdivision , in critical displayed of Forsythe and example walked on Report when . Minutes'

Not very good, right?

Let's try higher order $n$-grams...

Learning a bigram model¶

It's not too hard to turn this corpus into a bigram model - we just need to be careful about:

  • indexing correctly
  • accessing counts correctly
In [ ]:
# nested dictionary
bigram_counts = {}

# there will always be one fewer bigrams than unigrams
for i in range(len(words) - 1):

    # our bigram is "word1 word2"
    word1, word2 = words[i], words[i+1]

    # every word gets its own dictionary!
    if word1 not in bigram_counts:
        bigram_counts[word1] = {}

    # if it's a new bigram, set up a count
    if word2 not in bigram_counts[word1]:
        bigram_counts[word1][word2] = 1
    else: # otherwise just count
        bigram_counts[word1][word2] += 1
In [ ]:
print(bigram_counts['Fulton']['County'])
6

So, the bigram "Fulton County" occurs 6 times in the Brown corpus. Great!

How do we get probabilities from this dictionary of dictionaries?

Well, it depends on what we want.

Joint probabilities from the counts¶

Remember that the joint probability $p(x,y)$ tells us the overall probability of seeing $x$ and $y$ together, out of all possible combinations of $x$ and $y$.

If we look at any particular count in our dictionary of dictionaries, we know how many times we observed that bigram. Of course, the sum of all the counts is the total number of bigrams we saw.

So, we can divide the two to make a claim about "what proportion of all bigrams are this particular bigram?" That is our joint probability.

In [ ]:
s = 0 # total number of bigrams observed
for word1 in bigram_counts:
    for word2 in bigram_counts[word1]:
        s += bigram_counts[word1][word2]

bigram_joint_probs = {} # new dictionary of dictionaries
for word1 in bigram_counts: # for each word1
    bigram_joint_probs[word1] = {} # set up a new dictionary
    for word2 in bigram_counts[word1]: # for each word2
        # compute p(x,y) and store it
        bigram_joint_probs[word1][word2] = bigram_counts[word1][word2] / s

# p(Fulton County)
print(bigram_joint_probs['Fulton']['County'])

# earlier, we found that "Fulton County" occurs 6 times
# we also know that a corpus of N words has N-1 bigrams:
print(6 / (len(words) - 1))
5.1671085979825885e-06
5.1671085979825885e-06

Conditional probabilities from the counts¶

That's all fine and dandy, but an $n$-gram model uses conditional probabilities. How do we get those?

The right way to think about this is that each word gets its own probability distribution.

Consider a single dictionary from the counts:

In [ ]:
i = 0 # easy way to just look at a few entries
for word2 in bigram_counts['Fulton']:
    print(word2, bigram_counts['Fulton'][word2])
    i += 1
    if i == 5:
        break
County 6
Superior 2
legislators 2
taxpayers 1
ordinary's 1

We can think about this dictionary as defining a probability distribution conditioned on having just seen the word "Fulton."

We just need to normalize the counts within each dictionary:

In [ ]:
bigram_cond_probs = {} # new dictionary of dictionaries
for word1 in bigram_counts: # for each word1
    bigram_cond_probs[word1] = {} # set up a new dictionary
    s = 0
    for word2 in bigram_counts[word1]: # for each word2
        s += bigram_counts[word1][word2] # add up the counts
    for word2 in bigram_counts[word1]: # for each word2
        # compute p(x|y) and store it
        bigram_cond_probs[word1][word2] = bigram_counts[word1][word2] / s

# p(County|Fulton)
print(bigram_cond_probs['Fulton']['County'])

# sanity check! it should be:
# count(Fulton County) / count(Fulton)
count_fulton_county = bigram_counts['Fulton']['County']
count_fulton = sum(bigram_counts['Fulton'].values())
print(count_fulton_county / count_fulton)
0.35294117647058826
0.35294117647058826

Ok! Let's generate!

We'll sample a starting word from our good-ole unigram model, and from there, we'll condition on whatever word we last sampled to produce our next word!

In [ ]:
def generate_bigrams(sequence_length: int):

    sequence = generate_unigrams(1) # initialize sequence with a unigram gen.

    for i in range(sequence_length - 1): # sample this many words

        last_word_dict = bigram_cond_probs[sequence[-1]]

        choice = random.choices(
            list(last_word_dict.keys()), # choose from the support
            weights=list(last_word_dict.values()) # given their probabilities
        )
        sequence += choice # add on to the sequence

    return sequence
In [ ]:
' '.join(generate_bigrams(20))
Out[ ]:
'up good look at four hundred years , and affied unto him . `` I hold up to show a'

While this sentence probably didn't make much sense - parts of it are probably meaningful, right!? Isn't that odd???

Higher-order models¶

3-grams?

In [ ]:
trigram_counts = {}

# count
for i in range(len(words) - 2):

    bigram_context = (words[i], words[i + 1])

    if bigram_context not in trigram_counts:
        trigram_counts[bigram_context] = {}

    third_word = words[i + 2]

    if third_word not in trigram_counts[bigram_context]:
        trigram_counts[bigram_context][third_word] = 1
    else:
        trigram_counts[bigram_context][third_word] += 1

# normalize
for bigram_context in trigram_counts:
    s = 0
    for third_word in trigram_counts[bigram_context]:
        s += trigram_counts[bigram_context][third_word]
    for third_word in trigram_counts[bigram_context]:
        trigram_counts[bigram_context][third_word] /= s
In [ ]:
def generate_trigrams(sequence_length: int):

    sequence = generate_bigrams(2)

    for i in range(sequence_length - 2):

        bigram_context = (sequence[-2], sequence[-1])

        last_word_dict = trigram_counts

        choice = random.choices(
            list(last_word_dict[bigram_context].keys()),
            weights=list(last_word_dict[bigram_context].values())
        )

        sequence += choice

    return sequence
In [ ]:
' '.join(generate_trigrams(20))
Out[ ]:
'of Af . Thus if the revenues from any kind of case studies of fluids for hydraulically operated equipment ,'

4-grams???

In [ ]:
four_gram_counts = {}

# count
for i in range(len(words) - 3):

    trigram_context = (words[i], words[i + 1], words[i + 2])

    fourth_word = words[i + 3]

    if trigram_context not in four_gram_counts:
        four_gram_counts[trigram_context] = {}

    if fourth_word not in four_gram_counts[trigram_context]:
        four_gram_counts[trigram_context][fourth_word] = 1
    else:
        four_gram_counts[trigram_context][fourth_word] += 1

for trigram_context in four_gram_counts:
    s = 0
    for fourth_word in four_gram_counts[trigram_context]:
        s += four_gram_counts[trigram_context][fourth_word]

    for fourth_word in four_gram_counts[trigram_context]:
        four_gram_counts[trigram_context][fourth_word] /= s
In [ ]:
def generate_four_grams(sequence_length: int):

    sequence = generate_trigrams(3)

    for i in range(sequence_length - 3):

        trigram_context = (sequence[-3], sequence[-2], sequence[-1])

        last_word_dict = four_gram_counts

        choice = random.choices(
            list(last_word_dict[trigram_context].keys()),
            weights=list(last_word_dict[trigram_context].values())
        )

        sequence += choice

    return sequence
In [ ]:
' '.join(generate_four_grams(20))
Out[ ]:
"Iliad has two words for the Hound Of Heaven's Pursuit ) by judicial fiat . They didn't . The Department's"

Wow. This looks like actual text, right?

Concluding Remarks¶

$n$-grams are really cool, and are still in use for research purposes all the time! I am currently working on research regarding how well neural networks (like ChatGPT) can learn different $n$-gram distributions!

They are also basically how ChatGPT works, believe it or not! Generative transformer models like ChatGPT use a very complex statistical framework that is more robust at modeling this behavior than simply counting.

What LLMs do is try to learn the right probability distribution (each dictionary within a dictionary) for EVERY possible context! There are many clever mathematical and engineering breakthroughs that have happened that allow this to be possible, and I won't cover them here.

Discussion Questions¶

  • Do you think that humans are $n$-gram models? Why or why not?
  • Were you surprised by how well $n$-grams seem to mimic real text? Or do you think they are bad at this?
  • Many linguists find it upsetting that simple statistical models are so accurate. Others take comfort in the fact that statistical models struggle to accurately predict certain phenomena. Do you lean one way or the other after this exercise?
  • Did you enjoy this exercise? What questions do you still have?

Things for you to explore...¶

...if you are new to Python:¶

Think Python is an excellent, free book to learn Python.

NLTK, the Python library we used, has a short book which I have not read all the way through, but which has some nice tutorials.

...if you are interested in machine learning and deep learning:¶

The scikit-learn Python library has just about every ML algorithm you could possibly dream of, along with great tutorials.

PyTorch is probably my favorite Python library ever. It is used to build deep learning models, and I use it constantly for all sorts of fun research and experiments.

HuggingFace is a great collection of Python libraries with open-source AI models, datasets, evaluation metrics, and lots more.

...if you are interested in natural language processing (NLP):¶

Jurafsky and Martin is the bible of NLP. It is a free book, and is regularly updated. This notebook covers a lot of the same material as Chapter 3.

spaCy and Stanza (Stanford's NLP tools) are excellent for performing "classic" NLP tasks.

...if you want to learn about cutting-edge research¶

I like Google Scholar the best to look up papers. A lot of the excellent NLP work is published in ACL venues.

And all that is the tip of the iceberg. If you are ever curious about:

  • computer science
  • linguistics
  • mathematics
  • physics
  • graduate school (master's or PhD)
  • Georgetown

Feel free to reach out! drd92@georgetown.edu will be my email for at least the next 5 years while I complete my PhD.