This is a terrific explanation - thank you. When it comes to "non-parametric memory", is this typically prefetched and the LLM sees this as a local cache of recent information or is this a real time lookup based on the user's query?
Thanks! I'm glad you found this useful. I think the answer depends. In most scenarios, I'd think this would be a real-time look up since it's not possible to predict what the exact words in the user's query will be, right? But it seems like this is an active area of research. I just came across this post that talks of a "semantic" cache to speed up LLMs: https://portkey.ai/blog/reducing-llm-costs-and-latency-semantic-cache/
This is a terrific explanation - thank you. When it comes to "non-parametric memory", is this typically prefetched and the LLM sees this as a local cache of recent information or is this a real time lookup based on the user's query?
Thanks! I'm glad you found this useful. I think the answer depends. In most scenarios, I'd think this would be a real-time look up since it's not possible to predict what the exact words in the user's query will be, right? But it seems like this is an active area of research. I just came across this post that talks of a "semantic" cache to speed up LLMs: https://portkey.ai/blog/reducing-llm-costs-and-latency-semantic-cache/