Post
349
Anthropic just released a chunk improvement technique that vastly improves RAG performance! π₯
Crash reminder: Retrieval Augmented Generation (RAG) is a widely-used technique for improving your LLM chatbot's answers to user questions.
It goes like this: instead of generating an LLM answer straight away, it just adds a previous step called Retrieval, that retrieves relevant documents from your knowledge base through semantic search, and just appends the top K documents to the prompt. β‘οΈ As a result, the LLM answer is grounded in context.
βοΈ The difficulty with this retrieval step is that when you split your documents into chunks that will be retrieved, you lose context. So importance chunks could be missed.
π‘ Anthropic's just released blog post shows that you can add some context to each chunk, with one LLM call. Then you embed the original chunk + a bit of added context, so that the embedding is much more representative of the document in its context!
π€ Isn't that crazy expensive? Well it would have been before, but not so much anymore with their new Prompt caching feature that makes duplicating thousands of requests with the same prompt much less expensive. They give an indicative price tag of only $1.02 per million chunks processed!
β And this vastly improves performance on their benchmark!
Read their blog post π https://www.anthropic.com/news/contextual-retrieval
Crash reminder: Retrieval Augmented Generation (RAG) is a widely-used technique for improving your LLM chatbot's answers to user questions.
It goes like this: instead of generating an LLM answer straight away, it just adds a previous step called Retrieval, that retrieves relevant documents from your knowledge base through semantic search, and just appends the top K documents to the prompt. β‘οΈ As a result, the LLM answer is grounded in context.
βοΈ The difficulty with this retrieval step is that when you split your documents into chunks that will be retrieved, you lose context. So importance chunks could be missed.
π‘ Anthropic's just released blog post shows that you can add some context to each chunk, with one LLM call. Then you embed the original chunk + a bit of added context, so that the embedding is much more representative of the document in its context!
π€ Isn't that crazy expensive? Well it would have been before, but not so much anymore with their new Prompt caching feature that makes duplicating thousands of requests with the same prompt much less expensive. They give an indicative price tag of only $1.02 per million chunks processed!
β And this vastly improves performance on their benchmark!
Read their blog post π https://www.anthropic.com/news/contextual-retrieval