Researchers at MIT have found a solution to the problem of AI chatbots’ deteriorating conversations, enabling them to maintain nonstop conversations without crashing or slowing down.
When users continuously converse with chatbots like ChatGPT, the large language models powering the technology begin to collapse, leading to communication issues. At times, they can even hallucinate facts.
However, some researchers have identified the root cause and discovered a way to allow conversations to flow without the need to restart the software.
Their approach modifies the key-value cache, essentially the conversation memory central to many large language models. In specific methods, when the cache exceeds its capacity, it ejects the earliest data entries, which can lead to the model’s failure. However, by preserving these initial data points in its memory, they were able to push the chatbot to keep engaging without any significant issues.
A new technique named “StreamingLLM” can handle infinite text input without any drop in accuracy by using key tokens that guide the model’s decisions and caching recent tokens.
The result: 22x faster inference.https://t.co/RDeTUZ6up6
— Brian Roemmele (@BrianRoemmele) October 3, 2023
By using a technique known as StreamingLLM, the researchers were able to ensure the model stayed efficient even during conversations that extended beyond four million words. Compared to another approach that prevents crashes by frequently re-evaluating portions of previous conversations, StreamingLLM proved to be over 22 times quicker.
As a result, this could help chatbots sustain lengthy conversations without the need for constant reboots, which means that the AI assistants are far more effective for activities such as copywriting, editing, or code generation.
Why are AI chatbots crashing?
Large language models transform user queries into token representations, using an attention mechanism to generate new text by assessing how these tokens relate to each other within an “attention map.”
This process, crucial for producing human-like text, relies on storing recent tokens in a ‘KV Cache.’ However, the cache’s capacity limitations and the subsequent massive size of the attention map can slow down computations and degrade performance when the cache overflows, as seen when encoding complex documents like academic papers.
Researchers have attempted to address these issues with a “sliding cache” strategy, which replaces the oldest tokens with new ones, though this often results in a significant drop in text quality as soon as tokens are removed.
A new approach detailed in the paper suggests keeping the first token in the cache to maintain model performance, even when the cache limit is surpassed. This counterintuitive strategy is effective despite the seemingly unrelated nature of the first and last words in extensive texts and books, leading to discoveries about the underlying reasons for this phenomenon. It offers insights into improving large language model efficiency.
The lead author of the StreamingLLM paper, graduate student Guangxuan Xiao, said, “Now, with this method, we can persistently deploy these large language models. We could use these chatbots in some new applications by making a chatbot that we can always chat with and that can always respond to us based on our recent conversations.”
Feeling incredibly excited and proud that StreamingLLM made it to MIT’s homepage and has been accepted by ICLR 2024! Can’t wait to show it in Vienna! https://t.co/3F6jcYU0lm
— Guangxuan Xiao (@Guangxuan_Xiao) February 13, 2024
Among the co-authors included electrical engineering and computer science associate professor Song Han, who is also a member of the MIT-IBM Watson AI Lab and a distinguished scientist of NVIDIA, Meta AI research scientists Yuandong Tian and Mike Lewis, as well as Carnegie Mellon University assistant professor Beidi Chen.
The first token
The researchers say the first token is called an “attention sink” in the process.
Han added: “We need an attention sink, and the model decides to use the first token as the attention sink because it is globally visible — every other token can see it. We found that we must always keep the attention sink in the cache to maintain the model dynamics.”
During the development of StreamingLLM, researchers found that positioning four attention sink tokens at the start of the sliding cache achieves the best performance.
Despite the success, the model cannot remember words not stored in the cache. However, the researchers plan to target this limitation by investigating methods to retrieve tokens that have been removed or enable the model to memorize previous conversations.
Featured image: Canva
The post AI chatbots could converse all day without crashing, new research finds appeared first on ReadWrite.