Vector embeddings explained

Posted by Venkatesh Subramanian on September 10, 2024 · 6 mins read

What are embeddings?
Embeddings are a way of representing unstructured data such as text, sentences, documents, images etc. using floating point numbers. High dimensional complex data is compressed into lower dimensional vector while retaining the most important semantic and syntactic information. However, similar to lossy compression, some fine-grained details or nuances are inevitably lost during the process.

Examples of Cat, King, Queen, Man et al.!
Let’s take an example of representing an animal like cat on just 2 dimensions- say the number of legs and diet. In this case the x,y point representing the cat in the coordinate system will be relatively close to the x,y point representing a dog- as both have 4 legs and a similar carnivorous diet. Whereas the point representing herbivores like cows or birds like crows may be away in a different set of points.
Expand the thinking from 2 dimensions to n dimensions, and now you can represent any entity based on its “n” features anywhere in this high dimensional space. The distance between any two entities can also be calculated using the similar techniques we use in 2 or 3 dimensional spaces, so it is easy to figure which entities are similar and how similar are they.
A cat in higher dimensional space may have attributes such as size, animal vs object, domestic vs wild, relationship with human, quadruped vs biped, predatory index, nocturnal vs diurnal, feline traits index, social behaviors, speed or agility, skin texture, sharpness of teeth, aggression etc.
The actual significance of each dimension emerges from data and is often complex to directly interpret.
AI models convert a word such as “cat” into embeddings using a combination of training on large datasets and mathematical transformations. Training data would give enough examples of how the word like “cat” may appear in various contexts, and relationship between words that are used together. Initially the AI model will assign a random vector to each unique word. Over time, these vectors get updated based on how words co-occur in text. Cat and dog may end up with similar embeddings as they have lot of commonalities and appear in similar contexts such as being man’s loved pets or 4 legged domestic creatures in our homes.
Models will use optimisation algorithms such as gradient descent to minimise the error in model’s prediction of the vector embedding values over time. Consider the classic example of “King - Man” = (read nearly equals) to “Queen - Woman”. The model learns that there is a similar difference or vector offset between “King” and “Man”, as there is between “Queen” and “Woman”. It also learns that “King” and “Queen” are similar as they are both types of royalty. Similarly “Man” and “Woman” are similar as they are both human genders. Vector pointing from “Man” to “King” represents the concept of “male royalty”, while the vector pointing from “Woman” to “Queen” represents the concept of “female royalty”. Hence the mathematical subtraction above would work, and model uses this to encode the relationship in the embeddings space. It can generalise this approach to other sets of words such as cities, countries, professions etc. This allows it to perform analogical reasoning and other relational tasks in a mathematical way.

So why do we embed versus use raw text?
Searching through mathematical vector space is much faster than parsing large raw documents. So you get a lot of runtime efficiency.
Clustering, classification, recommendation tasks will be significantly more efficient in lower dimensional vector space versus raw data spectrum. Tradeoff with embeddings is loss of detail in return for speed. However, one can use the embeddings side by side with raw data. So use the embeddings to search and plot, and use the raw text to extract granular details once the data is located using embeddings!

Some popular use cases where these techniques are used:

  • Search engines such as Google, Bing etc.
  • Legal document search and discovery such as LexisNexis.
  • E-commerce search and recommendations such as Amazon and Walmart.
  • Document search in enterprise knowledge base.
  • Customer service chatbots.
  • Media recommendations such as Netflix.
  • Research paper retrieval and search such as Semantic Scholar.

Summarizing
Vector embeddings and raw data analysis perfectly complement each other in real world applications. Embeddings compress raw unstructured noisy data into dense, meaningful representations that power the efficiency of search, while the raw data itself preserves the nuances and context, once the application is ready to zoom in from the vector space into the actual atoms and bits of data. While we have used text examples of simple words, you can extrapolate this to sentences, documents, blog posts, images, multimedia etc.
In a future article we will dive into the power of vector databases to store and process this type of embeddings.


Subscribe

* indicates required

Intuit Mailchimp