How AI Chatbots can become up to 92% cheaper while still maintaining quality

Cutting costs for AI chatbots by up to 92% without sacrificing too much quality – that is the potential impact our study on compressing the context sent to Large Language Models (LLMs). These sophisticated models, like the one powering ChatGPT (a widely used free AI chatbot), can engage in human-like conversations, providing users with a seamless and interactive experience. However, to maintain context and coherence, the chatbot needs to include the entire conversation history in every input to the LLM. This creates significant challenges as the conversation grows longer, such as the chatbot reaching the maximum input length for the LLM, or runaway costs from a constantly growing conversation.

To address these issues, we analyzed and compared four different context compression techniques:

Including only user questions (exclude LLM answers)
Summarizing the conversation using the FullContext method developed by Mikael Törnwall at KTH
Asking the LLM itself to summarize the conversation using a novel prompt
Employing a specialized model called Cohere Summarize

For the larger dataset, we employed a cutting-edge method called LLM-as-a-Judge, where a more capable LLM judged the answers using the different summarization techniques pairwise. All the methods proved to be fairly consistent with space saving ranging from 90% to 95%. The FullContext method achieved an impressive 77% of the recollection performance compared to including the whole history, while simultaneously reducing the input size by 92%. The other methods had between 65% and 53% of the quality against the control group. Method 1: only user queries, performed the worst but we suspect this method has greater potential in compressing coding- and programming-related conversations, which needs to be further studied.

These findings have significant implications for enterprises leveraging LLMs in their AI products. A 92% reduction in input size could lead to substantial cost savings. However, it is important to note that most summarization techniques also use LLMs, which means there is still a cost to produce the summaries. Therefore, this approach is most applicable when conversations continue for longer after the compression has been applied.

Moreover, the study's results also address the challenge of LLMs' input length limit. The proven context compression techniques can be used to artificially expand the window with a summarized version of the original context, allowing a large context to be compressed into one that fits within the input length limit for any given LLM chatbot.

The study's results could make AI chatbots more accessible and affordable for businesses, enabling wider adoption and innovation. As AI continues to advance and integrate into our daily lives, research in this area becomes increasingly important. By addressing the challenges of efficiency and cost, we can ensure that AI technology remains sustainable and beneficial to society in the long run. Our study contributes to this ongoing effort, paving the way for more accessible, efficient, and context-aware AI conversations.

Read full Bachelor Thesis:

https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-351103