LLM

How to scale Large Language Models (LLMs) to infinite context?

Sarfraz Nawaz

CEO and Founder of Ampcome

headings

Author :

Sarfraz Nawaz

Sarfraz Nawaz is the CEO and founder of Ampcome, which is at the forefront of Artificial Intelligence (AI) Development. Nawaz's passion for technology is matched by his commitment to creating solutions that drive real-world results. Under his leadership, Ampcome's team of talented engineers and developers craft innovative IT solutions that empower businesses to thrive in the ever-evolving technological landscape.Ampcome's success is a testament to Nawaz's dedication to excellence and his unwavering belief in the transformative power of technology.

Topic

LLM

Imagine having an assistant who keeps forgetting the tasks or points of previous days. Though she may be able to process your current tasks, she will not be able to execute complex tasks that require references to past briefings or meeting notes.

This is what happens with current Large Language Models. LLMs can be great at question answering, and generating code, images, and videos. However, their limited context window hinders their ability to understand complex tasks or maintain a coherent conversation.

If there is a long query, the LLM fails to process the entire prompt, misses the crucial details and produces inaccurate results.

Even if there is a long conversation, the LLM fails to build on the previous ideas and stands hallucinating.

This has restricted the abilities of LLMs and their effective applications across industries.

A recent breakthrough from Google introduces a revolutionary technique called “infini-attention" that scales the context window of LLM to infinity with bonded memory and computing.

This blog aims to decode this “infini-attention” technique.

‍

What is Google’s Infini-attention technique?

Google researchers released a paper “Leave No Context Behind” where they introduced a novel technique, “infini-attention". This technique aims to scale the context length of Large Language Models with bonded memory and computing.

With a slight modification in the Transformer architecture of the LLM, the technique enables the LLM to handle infinitely long contexts efficiently process longer queries and generate exceptionally accurate results.

That’s not it. the technique also boosts the memory power of the LLM, with the introduction of compressive memory. Harnessing the combination of compressive memory and local attention the researchers found an innovative way for the LLMs to generate high context-relevant output.

Let’s understand the whole process.

The infini-attention technique integrates the compressive memory into the standard Transformers attention mechanism.

Unlike the typical attention mechanism where the memory footprint grows quadratically with the sequence length, the infini-attention technique maintains a fixed parameter to store and retrieve information to produce results.

This is achieved by combining the local attention mechanism with the long-term linear attention mechanism which is made possible through compressive memory. It is the process that enables the continuous collection, storage and updation of the entire context history.

In other terms, instead of discarding the old key values and queries, infini-attention stores them in compressive memory. It then reuses the KV and retrieves values using attention query states for processing subsequent sequences.

The final output is the result of the combination of values retrieved from long-term memory and local attention context.

Let’s understand this in a more simple language.

LLMs are like our forgetful friends. They do pay attention to our recent conversations. But whenever you bring up past conversations, they tend to zone out.

This makes it difficult for them to process complex conversations. While it's frustrating for you to have a meaningful conversation with them.

So, in a crux, the LLMs with limited context window and memory often hallucinate and fail to execute complex tasks when the conversations stretch long.

Now Google's research paper talks about a solution – infini-attention that uses compressive memory to store and update context history and retrieve them when necessary.

Imagine this compressive memory as a box where your friend stores summaries or notes of your past conversations. Now when your friend needs to refer to any past conversations, he simply opens the box and gets the info. He then uses the past info in context to present conversation to offer you ideal suggestions or make meaningful conversations.

Similarly, when the LLM needs to refer to any past conversations, it uses a special attention query to retrieve information and combine it with local attention to produce coherent and context-relevant output.

Local attention is the conversation happening at the present moment. For the LLM to generate relevant output, it not only needs insights from past conversations but also pays attention to the present conversation/query.

By combining information from the memory box and the current conversation, the LLM can understand things much better. This allows it to process even super long conversations, just like you can have a deep discussion with a friend who remembers everything!

Here is the link to the “Leave No Context Behind” paper.

‍

How does infinite context length open doors to new LLM applications?

Infini-attention techniques in LLMs hold the potential to unlock a new wave of LLM applications that were previously limited by context restrictions.

Here are some exciting possibilities:

Advanced Question Answering

LLMs could become adept at answering complex questions that require reasoning across vast amounts of text. Imagine a system that can analyze legal documents, medical literature, or historical archives to answer intricate queries with pinpoint accuracy.
‍

Real-time Conversation with Context

Chatbots and virtual assistants could transcend their current limitations. By remembering past interactions within a conversation, they could provide more personalized and relevant responses, fostering a natural flow of dialogue.
‍

Enhanced Document Summarization

Infini-attention could enable LLMs to not only summarize factual information but also capture the essence of arguments, opinions, and the overall sentiment within lengthy documents. This would be invaluable for researchers, journalists, and anyone needing to grasp the core ideas of extensive texts.
‍

Code Generation with Deeper Understanding

LLMs could analyze entire codebases and generate more comprehensive and relevant code snippets. They might even be able to identify potential bugs or suggest improvements based on the broader context of the code's functionality.
‍

Personalized Education and Training

Imagine educational tools that tailor learning experiences based on a student's entire learning history. LLMs could track a student's progress across different subjects, identify areas needing improvement, and suggest personalized learning materials that consider the full context of their knowledge.
‍

Creative Writing with Richer Context

LLMs could become powerful partners in creative writing. By referencing a vast database of literary works and understanding the nuances of storytelling, they could assist writers in crafting narratives with richer context, character development, and plot consistency.

These are just a few examples, and the possibilities are truly endless. As Infini-attention technology matures, we can expect even more innovative applications to emerge, transforming how we interact with information, learn, and create.

‍

Conclusion

There has been other research to scale the context length and improve the memory of LLMs. For example, the LongRoPE by Microsoft is capable of significantly extending the context window of pre-trained Large Language Models (LLMs) from the typical 2048 tokens to an impressive 2048k tokens (2 million tokens).

Another example is MEGALODON by Meta. This architecture overcomes the limitations of Transformers in handling long sequences, which stem from their quadratic complexity and poor length extrapolation. Megalodon builds on the MEGA architecture, incorporating elements like exponential moving average with gated attention. It further enhances these features by introducing a complex exponential moving average (CEMA), a timestep normalization layer, and a normalized attention mechanism, thereby boosting both capability and stability.

However, no other technique has been successful in achieving remarkable results that Google’s “infini-attention” did.

The approach scales naturally to handle million-length input sequences and outperforms baselines on long-context language modelling benchmarks and book summarization tasks.

The Google researchers in the paper successfully demonstrated the infini-attention technique on 8B and 1B models to perform a 1M context-length book summarization task.

Is finding the right tech partner to unlock AI benefits in your business hectic?

Ampcome is here to help. With decades of experience in data science, machine learning, and AI, I have led my team to build top-notch tech solutions for reputed businesses worldwide.

Let’s discuss how to propel your business!

If you are into AI, LLMs, Digital Transformation, and the Tech world – do follow Sarfraz Nawaz on LinkedIn.

Author :

Sarfraz Nawaz

Topic

LLM