The two errors, then, were that the LLM hallucinated something, and that a human trusted the LLM without reasoning about its answer. The fix for this common pattern is to reason about LLM outputs before making use of them.
A big problem now both internally to a company and externally is that official support channels are being replaced by chatbots, and you really have no option but to trust their output because a human expert is no longer available.
If I post a question to the internal payment team's forum about a critical processing issue and some "payments bot" replies to me, should I be at fault for trusting the answer?
I know this is happening with external customer support, but is this really happening internally at big companies? Preventing you from talking to a human in the correct department about an issue feels like a bomb waiting to explode.
There is at least an effect that chatbots have become the primary line and support, and even if you are not prevented from talking to a human, the managers of the humans you would talk to have decided that since the chatbot is there, it is inappropriate for them to be spending much time supporting coworkers in other departments when the chatbot can do it.
So to a degree, corporate politics can sort of discourage it.
I'm sure it is. Thankfully I don't work for a company this large any more, but when I was employed by a multinational with 30K+ employees, our IT department was outsourced to India and you had to get through a couple layers of phone tree/webchat hell to actually talk to a real person. I could easily see companies of this size replacing their support with LLM nonsense.
Teams are heavily incentivized to incorporate AI in their internal workflows. At Meta it is a requirement, and will come up in your performance review if you fail to do so.
Yes, of course, and the company which removes human experts should expect things to fail in the manner that things usually fail when you remove your internal experts.
1. Check frequency (between every single time and spot checks).
2. Check thoroughness (between antagonistic in-depth vs high level).
I'd agree that, if you're towards the end of both dimensions, the system is not generating any value.
A lot of folks are taking calculated (or I guess in some cases, reckless) risks right now, by moving one or both of those dimensions. I'd argue that in many situations, the risk is small and worth it. In many others, not so much.
If "the level of awareness that created a problem, cannot be used to fix the problem", then you're asking too much if you expect a human to reason about an LLM output when they are the ones that asked an LLM to do the thinking for them to begin with.
This feels like a rediscovering/rewording of Kernighan's Law:
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." ~ Brian Kernighan
When organizational incentives penalize NOT using AI and firing the bottom x% regularly then are you really surprised LLM outputs aren't being scrutinized?
It's more like, the LLM "hallucinated" (I hate that term) and automatically posted the information to the forum. It sounds like the human didn't get a chance to reason about it. At least not the original human that asked the LLM for an answer
I’m not in AI, but what is happening is that it is building output from the long tail of its training data? Instead of branching down the more common probability paths, something in this interaction had it travel into the data wilderness?
So I asked AI to give it a good name, and it said “statistical wandering” or “logical improv”.