BERT's word embeddings¶

BERT is one of the seminal works in natural language processing (NLP). A team of Google researchers trained BERT to encode sequences of text tokens (sub-words) into useful numerical representations. These representations could then be used for many important NLP tasks, like question answering or logical inference.

Models like BERT build on earlier work such as word2vec and GloVe, which are algorithms for learning numerical vectors as representations of words.

In this notebook, I look at the static word embeddings learned by BERT. That is, BERT has learned a vector for every single word it knows (the bottom layer of the model, before being passed through the encoder layers to produce contextualized representations) - where do these vectors live in relation to each other?

In [ ]:
import os
import pandas as pd
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from sklearn.decomposition import PCA
import plotly.express as px
import plotly.graph_objects as go

os.makedirs('../plotly', exist_ok=True)
In [ ]:
vocab = AutoTokenizer.from_pretrained(
    "google-bert/bert-base-uncased",
).vocab
words_in_order = sorted(vocab.keys(), key = lambda x: vocab[x])

X = AutoModelForMaskedLM.from_pretrained(
    "google-bert/bert-base-uncased",
    torch_dtype=torch.float16,
    device_map="cpu",
    attn_implementation="sdpa"
).bert.embeddings.word_embeddings.weight.detach()
In [ ]:
dims = (2, 3, 4)

pcas = dict.fromkeys(dims)

for dim in dims:
    pca = PCA(n_components=dim)
    pcas[dim] = pca.fit_transform(X)
In [ ]:
dfs = dict.fromkeys(dims)

for dim in dims:
    d = {
        'word': words_in_order
    }
    for i in range(1, dim+1):
        d[f'pc{i}'] = pcas[dim][:, i-1].tolist()
        
    dfs[dim] = pd.DataFrame(d)
In [ ]:
legend_dict = dict(
    orientation='h',
    y=-0.15,
)

marker_dict = dict(
    size=3,
    opacity=0.4,
)

layout = go.Layout(
    margin = go.layout.Margin(
        b=20,
        t=50,
    )
)
In [ ]:
fig = px.scatter(
    dfs[2],
    x='pc1',
    y='pc2',
    title=f'BERT: PCA in 2 dimensions',
    hover_data={'word': True, 'pc1': False, 'pc2': False},
    height=450,
    width=800
)
fig.update_layout(legend=legend_dict, margin=layout.margin, title_x=0.5)
fig.update_traces(marker=marker_dict)
fig.show()
fig.write_html('../plotly/bert_pca_2.html')
In [ ]:
fig = px.scatter_3d(
    dfs[3],
    x='pc1',
    y='pc2',
    z='pc3',
    title=f'BERT: PCA in 3 dimensions',
    hover_data={'word': True, 'pc1': False, 'pc2': False, 'pc3': False},
    height=450,
    width=800
)
fig.update_layout(legend=legend_dict, margin=layout.margin, title_x=0.5)
fig.update_traces(marker=marker_dict)
fig.show()
fig.write_html('../plotly/bert_pca_3.html')
In [ ]:
fig = px.scatter(
    dfs[3],
    x='pc1',
    y='pc2',
    color='pc3',
    title=f'BERT: PCA in 3 dimensions',
    hover_data={'word': True, 'pc1': False, 'pc2': False, 'pc3': False},
    height=450,
    width=800
)
fig.update_layout(legend=legend_dict, margin=layout.margin, title_x=0.5)
fig.update_traces(marker=marker_dict)
fig.show()
fig.write_html('../plotly/bert_pca_3_color.html')
In [ ]:
fig = px.scatter_3d(
    dfs[4],
    x='pc1',
    y='pc2',
    z='pc3',
    color='pca4',
    title=f'BERT: PCA in 4 dimensions',
    hover_data={'word': True, 'pc1': False, 'pc2': False, 'pc3': False, 'pc4': False},
    height=450,
    width=800
)
fig.update_layout(legend=legend_dict, margin=layout.margin, title_x=0.5)
fig.update_traces(marker=marker_dict)
fig.show()
fig.write_html('../plotly/bert_pca_4.html')

References¶

  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., NAACL 2019)
  • Efficient Estimation of Word Representations in Vector Space (Mikolov et al., ICLR 2013)
  • GloVe: Global Vectors for Word Representation (Pennington et al., EMNLP 2014)