A well-trained LLM that lacks any malevolent data...
The scale needed to produce an LLM that is fluent enough to be convincing precludes fine-grained filtering of input data. The usual methods of controlling an LLM essentially involve a broad-brush "don't say stuff like that" (RLHF) that inherently misses a lot of subtlties.
And even more, defining malevolent data is extremely difficult. Therapists often go along with things a patient say because otherwise they break rapport. But therapists have to balk once the patient dives into destructive delusions. But data of a therapy can't be easily labeled with "here's where you have to stop", just to name one problem.
It's remarkable how many people are uncritically talking of "malevolent data" as it is was a well-defined concept that everyone knows is the source of bad things.
A simple good search reveals ... this very thread as a primary source on the topic of "malevolent data" (ha, ha). But it should be noted that all other sources mentioning the phrase define it as data intentionally modified to produce a bad effect. It seems clear the problems of badly behaved LLMs don't come from this. Sycophancy, notably, doesn't just appear out of "sycophantic data" cleverly inserted by the association of allied sycophants.
I don't find it very remarkable that when one person makes up a term that's pretty easy to understand, other people in the same conversation use the same term.
In the context of this conversation, it was a response to someone talking about malevolent human therapists, and worried about AIs being trained to do the same things. So that means it's text where one of the participants is acting malevolently in those same ways.
For me, hearing this fantastical talk of "malevolent data" is like hearing people who know little about chemistry or engines saying "internal combustion cars are fine long as we don't run them on 'carbon-filled-fuel'". Otherwise, see my comment above.
Sure, it's not literally impossible. There are ICE cars that run on hydrogen. But you can't practically adapt an existing gasoline car to run on hydrogen. My point is that mobilizing terminology gives people with no knowledge of details the illusion they can speak reasonably about the topic.
> But you can't practically adapt an existing gasoline car to run on hydrogen.
You can do it pretty practically. Figuring out a supply is probably worse than the conversion itself.
> My point is that mobilizing terminology gives people with no knowledge of details the illusion they can speak reasonably about the topic.
"mobilizing terminology"? They just stuck two words together so they wouldn't have to say "training data that has the same features as a conversation with a malevolent therapist" or some similar phrase over and over. There's no expertise to be had, and there's no pretense of expertise either.
And the idea of filtering it out is understandable to a normal person: straightforward and a ton of work.
The scale needed to produce an LLM that is fluent enough to be convincing precludes fine-grained filtering of input data. The usual methods of controlling an LLM essentially involve a broad-brush "don't say stuff like that" (RLHF) that inherently misses a lot of subtlties.
And even more, defining malevolent data is extremely difficult. Therapists often go along with things a patient say because otherwise they break rapport. But therapists have to balk once the patient dives into destructive delusions. But data of a therapy can't be easily labeled with "here's where you have to stop", just to name one problem.