Except you can't be sure it isn't producing nonsense when you do this, and generally the model(s) will be overconfident. This has been studied, see e.g. https://openreview.net/pdf?id=E6LOh5vz5x
> An alternative way to obtain uncertainty estimates from LLMs is to prompt them directly. One benefit of this approach is that it requires no access to the internals of the model. However, this approach has produced mixed results: LLMs can sometimes verbalize calibrated confidence levels (Lin et al., 2022a; Tian et al., 2023), but can also be highly overconfident (Xiong et al., 2024). Interestingly, Xiong et al. (2024) found that LLMs typically state confidence values in the range of 80-100%, usually in multiples of 5, potentially in imitation of how humans discuss confidence levels. Nevertheless, prompting strategies remain an important tool for uncertainty quantification, along with measures based on the internal state (such as MSP).
GP is obviously wrong, and probably doesn't know about calibration and/or that it isn't even clear how to calibrate frontier models in the manner we need, given how complex and expensive the training is, and how tricky calibration becomes in e.g. mixture-of-experts and chain of thought approaches.
I suspect that introducing the calibration concept might be a case of too much too soon for some people.
As far as I understand it, the various probability matrices boil down to: what token has the highest likelihood of coming next, given this set of input tokens. Which then all gets chucked away and rebuilt when the most likely token is appended to the input set.
Objective assessment of internal state - again, to my non-expert eye - doesn’t appear to have any way to surface to me.
Big-if my rough working understand is more or less correct - your calibration point makes a lot of sense to me. I’m not sure that it would make sense to someone who eg considers some form of active thinking process that is intellectualising about whether to output this or that token.
Common misconception. As far we know, LLMs are not calibrated, i.e. their output "probabilities" are not in fact necessarily correlated with the actual error rates, so you can't use e.g. the softmax values to estimate confidence. It is why it is more accurate to talk about e.g. the model "logits", "softmax values", "simplex mapping", "pseudo-probabilities", or even more agnostically, just "output scores", unless you actually have strong evidence of calibration.
To get calibrated probabilities, you actually need to use calibration techniques, and it is extremely unclear if any frontier models are doing this (or even how calibration can be done effectively in fancy chain-of-thought + MoE models, and/or how to do this in RLVR and RLHF based training regimes). I suppose if you get into things like conformal prediction, you could ensure some calibration, but this is likely too computationally expensive and/or has other undesirable side-effects.
EDIT: Oh and also there are anomaly detection approaches, which attempt to identify when we are in outlier space based on various (e.g. distance) metrics based on the embeddings, but even getting actual probabilities here is tricky. This is why it is so hard to get models to say they "don't know" with any kind of statistical certainty, because that information isn't generally actually "there" in the model, in any clean sense.
I don't know if we are talking past each other, but I don't think this conversation is about absolute probabilities? The question is about relative uncertainty, and the softmax values are just fine for that.
It is too computationally expensive, which is why nobody does this for production inference. But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs.
> The question is about relative uncertainty, and the softmax values are just fine for that.
They really aren't, especially if you consider the chain of thought / recursive application case, and also that you can't even assume e.g. a difference of 0.1 in softmax values means the same relative difference from input to input, or that e.g. an 0.9 is always "extremely confident", and etc. You really have no idea unless you are testing the calibration explicitly on calibration data.
> But there are alignment tools to extract out these latent-space probabilities for researchers in the frontier labs
You can get embeddings: if you can get calibrated probabilities, you'll need to provide a citation, because this would be a huge deal for all sorts of applications.
Relative probabilities. That means comparing 2+ alternatives, and we're only talking about the model's worldview, not objective reality. The math for that is relatively straightforward. "Yes" could be 0.9, and ok that means nothing. But If we artificially constraint outputs to "Yes" and "No", and calculate the softmax for Yes to be 0.7 and No to be 0.3, that does lead to a straightforward probability calculation. [Not the naïve calculation you would expect, because of how softmax is computed. But you can derive an equation to convert it into normalized probabilities.]
And now I'm certain we're taking past each other. I'm not talking about calibrated probabilities at all. Just the notion of "how confident do I feel about this?" which is what I interpreted the question above to be about. You can get that out of an LLM, with some work.
> But If we artificially constraint outputs to "Yes" and "No", and calculate the softmax for Yes to be 0.7 and No to be 0.3, that does lead to a straightforward probability calculation. [Not the naïve calculation you would expect, because of how softmax is computed. But you can derive an equation to convert it into normalized probabilities.]
There is nothing straightforward about this, and no, there is no such formula.
> I'm not talking about calibrated probabilities at all. Just the notion of "how confident do I feel about this?"
If all you care about is vibes / feels, sure. If you actually need numerical guarantees and quantitative estimates to make your "feelings" about confidence mean something to rigorously justify decisions, you need calibration. If you aren't talking about calibration in these discussions, you are missing probably the most core technical concept that addresses these issues seriously.
We're talking about artificial intelligence. Making computers think the way people do. People are are notoriously miscalibrated on their own self-assessed probabilities too.
Finding a way to objectively calibrate a sense of "how confident do I feel about this?" would be fantastic. But let's not move goal posts. It would still be incredibly useful to have a machine that can merely matches the equivalent statement of confidence or uncertainty that a human would assign to their mental model, even if badly calibrated.
IMO it is you who are moving the goalposts, most likely in an attempt to hide the fact you were unaware of calibration before this discussion.
> It would still be incredibly useful to have a machine that can merely matches the equivalent statement of confidence or uncertainty that a human would assign to their mental model, even if badly calibrated.
If human feelings are badly calibrated, they are useless here too, so no, I don't agree. Things like "confidence" only matter if they are actually tied to real outcomes in a consistent way, and that means calibration.
> I'm pretty sure they are actively trained to avoid it.
I'm not sure who is doing what training exactly, but I can say that (inconsistently!) some of my attempts to get it to solve problems that have not yet actually been solved, e.g. the Collatz conjecture, have it saying it doesn't know how to solve the problem.
Other times it absolutely makes stuff up; fortunately for me, my personality includes actually testing what it says, so I didn't fall into the sycophantic honey trap and take it seriously when it agreed with my shower thoughts, and definitely didn't listen when it identified a close-up photo of some solanum nigrum growing next to my tomatoes as being also tomatoes.
> Besides, like, what would you do if you asked your $200/mo AI something and it blanked on you?
I'd rather it said "IDK" than made some stuff up. Them making stuff up is, as we have seen from various news stories about AI, dangerous.
"Well-unknown" questions are maybe the one situation where LLMs will say "I don't know", simply because of all the overwhelming statements in its training data referring to the question as unknown. It'd be interesting to see how LLMs would adapt to changing facts. Suppose the Collatz conjecture was proven this year, and the next the major models got retrained. Would they be able to reconcile all the new discussion with the previous data?
It's not hard to get them to say "I don't know", and they will do so regularly. It's hard to get them to say "I don't know" reliably (i.e. to say it when they don't actually know and to not say it when they do know). And in general even for statements or tasks they do 'know' (i.e. normally get right), they will occasionally get wrong.
Except you can reject the very (stupid) question / framing, in which case, the response is to either close the tab, or respond in a particular response style, neither of which makes the data more informative. This kind of clumsy stuff is just dumb with what we know now, edutainment distraction for the HN crowd.
There was a time when there were no separate names for blue and green in the Japanese language. Some languages right now have concepts of fundamental colors like navy blue and light blue, where English rolls it into a single "blue". Naming colors is highly cultural and changes over time. The idea that colors have boundaries is fascinating from both psychological as well as linguistic perspectives.
The framing seems stupid if you take the naive perspective that your language's way of dividing colors is the only valid one. Exercises like this and discussions that follow help expand perspectives.
Yes, very annoying, we know from extensive work in psychometrics that single-item, binary / forced-choice items produce junk responses that are heavily contaminated with response styles (answer in most socially-desirable way, select closest response to mouse/finger, select same response as last time, select random response, etc). Just give people an out ("Diagree with the question / premises", "Prefer not to answer", "Unsure / Can't decide", etc) and make sure you have e.g. a 5-7 point Likert-type scale for multiple items, or up to an 11-point scale for single items.
This kind of site / demo does none of the above, and so can't even be trusted for directional effects (the direction of response may simple be due to the type of people responding, etc).
This is the wrong way to do it, psychometrically, see here: https://news.ycombinator.com/item?id=47929056. You need to provide people gradations, or you get junk responses / abandonment, and your instrument doesn't measure what you think.
Wrong way to do it. We know from psychometrics that forced binaries like this just create junk (people disagree with the question, so just choose a forced answer based on some heuristic for each such question like "closest to my mouse / finger" or "most socially desirable" or "same as last time"). So you aren't measuring what you think when you force choice like this.
If you're going to go with linguistic self-report and a single item, you really want something like an 11-point Likert scale. A smart design might get e.g. a person's rating of "blue-ness vs. green-ness" on an 11-point scale, then determine the optimal cutpoint via e.g. clustering, logistic regression, or some other method, to really get something meaningful.
Is it really junk though? There are several comments in this thread like “people tell me I call stuff blue that they think is green and this quiz confirms that.”
Forced binary choices on single-item, self-report questions produce scientific junk, absolutely. This kind of design / approach encourages not only magnitude errors, but also sign errors (you can't even trust the direction of the observed effect).
IMO, growing up, unless you lived under a rock, it seems obvious to me that you will have experienced different people pointing at the same colour and uttering very different colour labels (pink vs. red, blue vs. green, black vs. deep blue/purple, etc) from the labels you might have applied yourself. Differing/shared colour perception isn't exactly a rare kind of topic (almost is like the canonical stoner topic, also common online), so I'd be a bit surprised if this demo is actually introducing anyone to this concept already. Any excitement is surely from other implications people think the demo has.
But unfortunately there are no interesting implications from what this site shows. Yes, it demonstrates the boring fact that: "it isn't clear how different people assign different color labels to the same physical stimuli" (and yes, this is FALSELY assuming that everyone's monitors/screens are the same too), but if you didn't already know this... I'm not sure exactly what social context you could have possibly grown up in.
Sapir-Worf and its ilk (if we don't have the language/concept, we can't perceive the difference/thing) are widely disproven and debunked, and don't even pass the smell test (learning new concepts and perceiving new things would be impossible). That kind of thinking is so tedious and decades out of date with modern cognitive science, neuroscience, psychology, etc.
But that is wrong. This doesn't test colour perception or vision, it tests verbal classification of colour perception into a forced binary. Everyone could be perceiving the colour qualia 100% identically, but simply choosing different linguistic cutpoints, meaning you can't say this is about vision / perception at all (it may just be about language use).
Agreed, there is no clear premise. Of course that different people looking at the same object will use different colour words is a triviality that anyone over, say, 10 years old knows. If that's the premise of the site, it is boring. People are getting excited because they think this implies something about differences in vision or perception... but it doesn't, that requires much more cleverness to test.
Asinine and meaningless. Forces a classification on something that obviously anyone with fully-functioning colour vision will classify as "aquamarine" or "turquoise" or etc.
This has nothing at all to do with colour perception, or, if actual differences in perception are involved, this test fails to distinguish those from individual differences in assignment to linguistic categories.
EDIT: To actually test something like this, you need to make an assumption that cannot easily be tested or supported by evidence.
E.g. say we could all agree that, generally, blue + orange is a more pleasant pairing than blue + green. One might then imagine a series of images using orange + varying interpolations between blue and green, with the prompt being "is this combination of colours more or less aesthetically pleasing than the last". The average cutpoint could then be interpreted as a subjective judgement of where e.g. teals become "more blue", from an aesthetic / complementary standpoint. But this test does nothing of the sort.
reply