RAG would still be useful for cost savings assuming they charge per token, plus ...

nostrebored · on Feb 15, 2024

This is going to be the real differentiator.

HN is very focused on technical feasibility (which remains to be seen!), but in every LLM opportunity, the CIO/CFO/CEO are going to be concerned with the cost modeling.

The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Maybe this changes with managed vector search offerings that are opaque to the user. The context goes to a preprocessing layer, an efficient cache understands which parts haven't been embedded (new bloom filter use case?), embeds the other chunks, and extracts the intent of the prompt.

mediaman · on Feb 15, 2024

Agreed with this.

The leading ability AI (in terms of cognitive power) will, generally, cost more per token than lower cognitive power AI.

That means that at a given budget you can choose more cognitive power with fewer tokens, or less cognitive power with more tokens. For most use cases, there's no real point in giving up cognitive power to include useless tokens that have no hope of helping with a given question.

So then you're back to the question of: how do we reduce the number of tokens, so that we can get higher cognitive power?

And that's the entire field of information retrieval, which is the most important part of RAG.

golol · on Feb 15, 2024

The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Really? Because to my understanding the compute necessary to generate a token grows linearly with the context, and doesn't the OpenAI billing reflect that by seperating prompt and output tokens?