Principal component analysis of BERT's static embeddings

BERT’s word embeddings

BERT is one of the seminal works in natural language processing (NLP). A team of Google researchers trained BERT to encode sequences of text tokens (sub-words) into useful numerical representations. These representations could then be used for many important NLP tasks, like question answering or logical inference.

Models like BERT build on earlier work such as word2vec and GloVe, which are algorithms for learning numerical vectors as representations of words.

In this notebook, I look at the static word embeddings learned by BERT. That is, BERT has learned a vector for every single word it knows (the bottom layer of the model, before being passed through the encoder layers to produce contextualized representations) - where do these vectors live in relation to each other?

Figures

Below are a few projections of BERT’s static embedding space.

Notebook

Below is a notebook you can use to visualize them yourself.

Download this Jupyter notebook




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Creating a distribution with a specific entropy using PyTorch
  • Lecture on n-grams
  • LaTeX workshop
  • Training a CNN on the MNIST dataset
  • Getting started with PyTorch in Google Colab