“Embedding” has become a central concept in machine learning. The term is pervasive. I’ve been thinking a bit about how we got here, and about what the word even means anymore.
You used to be able to read ML papers that didn’t even use the word “embedding”. I swear. We don’t even have to go back that far. When Tomas Mikolov invented word2vec and wrote about it with his colleagues circa 2012-13, his two most highly cited papers didn’t even use the word [1, 2]. He wrote about vectors and representations. It’s word2vec, not word2embed. If we fast forward five years, in the ELMo paper word2vec vectors are “traditional word type embeddings” [3]. We can find the word 24 times in 15 pages.1 A year later the BERT paper came out and had embeddings 32 times in 16 pages [4]. These are rookie numbers compared to some of the papers I’ll cite further down.
I jest. Overall this is a good trend. It’s useful to consolidate on a term and for people everywhere to have an intuitive understanding of what it means. Before this, papers had to distinguish their “vectors” and explain what they meant by “representations” or “spaces” of various sorts.2 Now they don’t even define “embedding”. The word itself may have changed meaning, from referring to a specific mathematical concept to now having a pure ML meaning for an audience that might be mostly unaware of the mathematical predecessor. I’d argue that the ML use is distinct enough to be its own meaning of the word. Vocabulary evolves.
But can many of us now actually define embeddings? Do we agree on what it means? The downside of a word that’s understood in itself without reference to a definition is that the word’s meaning can change. It’s now remarkably hard to pin down what an embedding is.
(Avoiding) falsehoods about embeddings
Let’s talk about rules that we can’t use to narrow down embeddings.
Embeddings are not restricted to one modality of data. While “word embeddings” are very common, we now have image embeddings, entity embeddings, position embeddings, product embeddings, whatever. Any input can be embedded.
Embeddings are not restricted to early layers of a neural network. Although it is common to quickly transform text or entities into their own embedding spaces at the lowest layers of a neural network, these are not the only types of neural network layers that we call embeddings. Often we’ll see the penultimate layer in a larger model considered as an embedding and used for vector search.
Embeddings are not necessarily one of the last learned layers of a network. See the above description of early layers.
Embedding layers are not fundamentally different types of neural network layers. Most typically, they are traditional linear layers, aside from sometimes using a lookup table instead of a one-hot encoded vector.
Embeddings don’t have to use embedding layers as defined in frameworks like PyTorch or Keras. See the earlier example of considering a penultimate layer—usually created as a traditional linear layer without the lookup functionality of built-in embedding layers—as an embedding.
Embeddings don’t have to specifically refer to an atomic unit. “Word embeddings” led to “sentence embeddings” and “paragraph embeddings”, and more recent language models use a context window that doesn’t need any such boundary. Furthermore, embeddings can refer to more abstract internal state, such as ELMo’s “deep contextualized” embeddings which can be “a function of all of the internal layers” [3].
A single embedding doesn’t have to refer to one single type of input. We now have “deep multimodal embeddings” from deep layers form out of various inputs connected together [5].3
Embeddings don’t have to be from a neural network at all. These days they are, but that’s because neural networks are the predominant model type for multidimensional data. While the embedding concept doesn’t make as much sense in a decision tree, there are other model types that have analogies to neural network embeddings.
Embeddings don’t have to be directly comparable in vector space. Famously from word2vec and a million examples since, “vector("King") - vector("Man") + vector("Woman") results in a vector that is closest to the vector representation of the word Queen” [1]. This type of output is very common and useful for vector search with highly optimized cosine (or another distance metric) similarity search. However, embeddings do not always have this property. When used in early layers, the resulting vectors merely need to be useful for the rest of the neural network. Even when used in later layers embeddings may not have distance relationships, unless specifically tuned for that property. See the Sentence-BERT paper, which shows that the last layer of BERT performs poorly when two items are compared with distance metrics [6].4 You could argue that this layer of BERT is just a normal neural network layer and not an embedding, but you’d be behind general usage of the term. The Sentence-BERT paper explicitly says these are “known as BERT embeddings” and their use yields “bad sentence embeddings” [6]. In this terminology, it’s their subsequent usage by other ML practitioners, not their construction or purpose, that defines them as embeddings.
Embeddings don’t have to translate a high-dimensional vector into a low-dimensional vector. This one is more debatable; many definitions disagree on this point. But consider, for example, the categorical features paper in which day of week is embedded from 7 values into 6 dimensions, months from 12 to 6, years in the corpus from 3 to 2, and price promotions from 2 to 1 [7].5 While these are all reductions in dimensions, they certainly aren’t all reductions from high dimensions to low dimensions, even relatively. The point of embeddings in that paper is representation rather than dimensionality. For another example, early GPT models have a “position embedding” with 768 dimensions from a context length of 1024, and surely we couldn’t simultaneously rate 1024 as high dimensional and 768 as low dimensional. More generally, considering how float dimensions are already much more expressive than ordinal or one-hot encoded inputs, there isn’t anything fundamentally stopping modelers from even increasing the number of dimensions. It is fairly normal for neural network designs to take a small number of features as input and immediately widen to more neurons in the next layer, and one could nowadays describe that as transforming features into a “feature embedding” with higher dimensionality.
Putting most of that together, embeddings may or may not be directly comparable with distance metrics, they may or may not compress dimensionality, they may come from any layer in a neural network or plausibly from other model types, they might refer to single inputs or to a combination of all features, and something can become known as an embedding retroactively due to usage.
Does that leave us with anything for a meaningful definition?
We can start to see how pretty much any layer in a neural network, of pretty much any model purpose, can be considered an embedding. We’re not terribly far off from that usage.
Most papers don’t bother to define embeddings in general, or instead they describe the properties of the embeddings they want to use. Some want compression, some want cosine similarity.
Google has a course that valiantly attempts a definition: “An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors”. This definition has significant overlap with common use of the term, but neither contains all uses nor is fully contained by them. Not all transformations that are now described as embeddings involve dimensionality reduction from high-dimensional vectors to low-dimensional vectors. In the other direction, this definition isn’t exclusive enough. If I use hashing algorithms to encode strings into small numbers, transforming from long representations into a shorter fixed-length representation, is that an embedding? Is every dimensionality reduction an embedding?
For such a commonly used concept, “embedding” has become hard to define. We now intrinsically relate to the concept of embeddings because they have become so prominent in machine learning, so we don’t need a definition to know what they are. It’s like the widely repeated phrase “If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck”. We have had a general of ML practitioners raised since we started calling these things “embeddings”.
I much prefer the definition that Roy Keyes created:
Embeddings are learned transformations to make data more useful
There we go. It emphasizes “learned”. While I suppose it is possible for someone to hand-define a custom dimensional space for some input, it doesn’t seem feasible and these days it isn’t done. For all of my earlier citations, it is fundamentally important that their embeddings are learned by a model. The algorithm discovers useful relationships between data by learning how to define its embedding space. “Transformations” emphasizes that the data changes. And “more useful” is helpful too, that the transformations are of central importance in the learning process.
Pushing the boundary
There are some trade-offs with writing the shortest definition of something. For one, in most uses embeddings are about the space of resulting features, rather than the function which does the transformation. Yet I wouldn’t want to use the word “space”, because that’s a less common term. If someone needs to learn what a “space” is, the easiest way to teach them might be by describing embeddings, which would be quite circular. The simplest suitable modification I could make to Roy Keyes’s definition is “Embeddings are how data changes when applying learned transformations to make the data more useful”, and that’s starting to get a bit wordy. It also might not be quite right, since it can be interpreted as referring to specific data rather than the space itself.
I’m also unsure of how the definition could be more restricted. Here’s a question to ponder: is a probability, from a model, an embedding? By Google’s definition it could be, since we translate inputs from more dimensions down to a single one. Roy Keyes’s definition might still fit. The probability is certainly learned, it is derived from a transformation from the original inputs, and it comes from data. The most debatable part might be “more useful”. We could say that embeddings are more useful than their inputs considering how successful transfer learning has been at applying embeddings to other tasks. A probability might only be applicable to a single task. But there is a long successful history of applying existing probabilities to new problems. Further, for some forms of data, a particular probabilistic output might be the only purpose for the data. The data might not be useful for anything else, such that the probability might genuinely be more useful than the input data.
So, if we’re too strictly bound to that definition, probabilities can be embeddings. Now that I’ve written it down, I’m sure it’ll happen, if it hasn’t already. But please, let’s not. It fails the duck test. If it looks like an embedding, we use it like an embedding, and people call it an embedding, then it probably is an embedding
We’ve already stretched the definition of embeddings far enough that it’s become our general purpose word for describing the space of pretty much any neural network layer. If we push it any further then people might give up on using the word entirely, and either cycle back to older terms (I’m all for “latent spaces”) or, I fear, invent new ones.
Embeddings usually refer to a space that retains much of the information from the source data, compressing information into fewer dimensions and transforming it to make it more useful for downstream layers or tasks. A probability is too lossy. The simplest modification to the definition I could come up with is “Embeddings are learned transformations to make data more useful without much information loss”.
Putting both changes together, we get:
Embeddings are how data changes when applying learned transformations to make the data more useful without much information loss
That’s not the shortest definition anymore, nor the cleanest. By the time you get to the end of the sentence you might forget where the sentence started. Besides, I’ll have to revise it once people actually invent “probability embeddings”, which could be any day now.6
So I think I’ll just stick with “Embeddings are learned transformations to make data more useful”, which is a joy to say and captures the breadth of how we use the word now.
[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS.
[2] Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. International Conference on Learning Representations.
[3] Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep Contextualized Word Representations. North American Chapter of the Association for Computational Linguistics.
[4] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805.
[5] Sung, J., Lenz, I., & Saxena, A. (2015). Deep multimodal embedding: Manipulating novel objects with point-clouds, language and trajectories. 2017 IEEE International Conference on Robotics and Automation (ICRA), 2794-2801.
[6] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. ArXiv, abs/1908.10084.
[7] Guo, C., & Berkhahn, F. (2016). Entity Embeddings of Categorical Variables. ArXiv, abs/1604.06737.
Counting all variations. “Embedding”, “embeddings”, “embed”, etc.
Not that these words aren’t still heavily used.
68 embedding references in 8 pages, but who’s counting?
87 embedding references in 11 pages. Okay, I’m counting.
96 in 9.
I actually searched for this.