In an era increasingly dominated by Generative AI and Large language models, it is good to step back and review the architecture that powers all this magic: Transformers.
No, we’re not referring to shape-shifting robots, but rather to the revolutionary framework that fuels the likes of Siri, Google Translate, and ChatGPT.
Back in 2017, a pivotal paper titled “Attention Is All You Need” by Vaswani et al. marked a pivotal turning point in the field of artificial intelligence. This paper introduced a novel architectural approach that would eventually evolve into the Transformers we recognize today. Departing from the constraints of linear processing, the paper introduced the concept of self-attention, enabling models to simultaneously focus on different elements of input data. This laid the foundation for groundbreaking AI models like BERT and GPT, reshaping the landscape of natural language processing and extending its influence far beyond.
The Transformer model comprises two core components: the Encoder and the Decoder, which can be fused to form the Encoder-Decoder architecture for tasks like machine translation.
Let’s delve into each component.
The Encoder: Responsible for processing input sequences and converting them into continuous representations known as “encoder hidden states”. These states are vector embeddings that capture contextual information of input tokens, including their semantics and also the positional encoding or the order of words.
The encoder comprises multiple layers, each featuring two primary sub-layers:
a)Multi-head self-attention layer: This layer computes attention weights among all input tokens, enabling each token to focus on relevant counterparts in the sequence. Multiple sets of learnable weight matrices, or “heads,” allow the model to capture different aspects of token relationships. Think of them as distinct readers, each with expertise in specific aspects of the sequence.
For example, imagine you have a sentence, “The quick brown fox jumps over the lazy dog”. One head might be good at understanding word proximity and focus on how each word relates to its neighboring words. Another head might be excellent at recognizing subject-verb relationships and capture connections between “fox” and “jumps.” Together, these heads can better understand the sentence from various angles.
b)Feed-Forward Neural Network Layer: Post self-attention, the encoder leverages independent feed-forward neural networks for each token. This further refines and processes contextual representations.
The Decoder: Taking encoder hidden states, the decoder generates output sequences token by token, employing both self-attention and cross-attention mechanisms. Similar to the encoder, the decoder comprises multiple layers, with each layer containing three sub-layers:
a)Masked Multi-Head Self-Attention Layer: This layer permits the decoder token to attend to other decoder tokens while excluding future tokens. This “masking” ensures the model lacks access to future data during decoding, crucial for auto-regressive generation. Auto-regression is just a fancy way of saying that next output token will be constrained based on previous tokens.
b)Multi-Head Cross-Attention Layer: Allowing the decoder to focus on encoder hidden states, this layer facilitates relevant input sequence integration during output generation.
c)Feed-Forward Neural Network Layer: Analogous to the encoder, the decoder also employs feed-forward neural networks to refine contextual representations.
Self-Attention Mechanisms:
Self-attention mentioned above employs query, key, and value vectors on the input sequence it processes.
Let’s use the analogy of searching a database of documents for certain content, say on renewable energy. Each document is represented as a vector in a high-dimensional space, capturing its unique semantic content. You represent your project’s focus on renewable energy sources as a query vector in the same high-dimensional space. This query vector captures the essence of what you’re looking for. By computing the similarity (dot product, cosine similarity, etc.) between your query vector and the key topic vectors of all documents, you determine how closely each documents’s content or the value vector matches your project’s focus. Transformer’s self-attention works in a similar manner on its input data too.
The Transformer’s impact spans a plethora of use cases. In sentiment analysis, the encoder processes input text via self-attention and neural networks for accurate predictions. ChatGPT uses the decoder to generate text-based responses in conversations, while image captioning unites encoder’s visual feature representation with decoder’s contextual text generation, blending linguistic and visual understanding seamlessly.
As AI’s evolution gains momentum, Transformers stand as the transformative cornerstone, reshaping our world in ways once deemed unimaginable.