One point that I think is under-discussed in the AI bias area:
While it is true that using an algorithmic process to select candidates may introduce discrimination against protected groups, it seems to me that it should be much easier to detect and prove than with previous processes with human judgement in the loop.
You can just subpoena the algorithm and then feed test data to it, and make observations. Even feed synthetic data like swapping in “stereotypically black” names for real resumes of other races, or in this case adding “uses a wheelchair” to a resume. (Of course in practice it’s more complex but hopefully this makes the point.)
With a human, you can’t really do an A/B test to determine if they would have prioritized a candidate if they hadn’t included some signal; it’s really easy to rationalize away discrimination at the margins.
So while most AI/ML developers are not currently strapping their models to a discrimination-tester, I think the end-state could be much better when they do.
(I think a concrete solution would be to regulate these models to require a certification with some standardized test framework to show that developers have actually attempted to control these potential sources of bias. Google has done some good work in this area: https://ai.google/responsibilities/responsible-ai-practices/... - though there is nothing stopping model-sellers from self-regulating and publishing this testing first, to try to get ahead of formal regulation.)
>With a human, you can’t really do an A/B test to determine if they would have prioritized a candidate if they hadn’t included some signal; it’s really easy to rationalize away discrimination at the margins.
Which is part of the reason that discrimination doesn't have to be intentional for it to be punishable. This is a concept known as "disparate impact". The Supreme Court has issued decisions[1] that a policy which negatively impacts a protected class and has no justifiable business related reason for existing can be deemed discrimination regardless of the motivations behind that policy.
Justifiable business reason is still a strong bar. For example, with no evidence in either direction for a claim there is no justifiable business reason even if the claim is somewhat intuitive. So if you want to require high-school diplomas because you think people who have them will do the job better you better track that data for years and be prepared to demonstrate it if sued. If you want to use IQ tests because you anticipate smarter people will do the job better you better have IQ tests done on your previous employee population demonstrating the correlation before imposing the requirement.
EDIT: my parent edited and replaced their entire comment, it originally said "you can't use IQ tests even if you prove they lead to better job performance". I leave my original comment below for posterity:
This is not true, IQ tests in the mentioned Griggs v. Duke Power Co. (and similar cases) were rejected as disparate impact specifically because the company provided no evidence they lead to better performance. To quote the majority opinion of Griggs:
> On the record before us, neither the high school completion requirement nor the general intelligence test is shown to bear a demonstrable relationship to successful performance of the jobs for which it was used. Both were adopted, as the Court of Appeals noted, without meaningful study of their relationship to job performance ability.
They said "it’s really easy to rationalize away discrimination at the margins." My reply was pointing out that there is little legal protection in rationalizing away discrimination at the margins because tests for disparate impact require the approach to also stand up holistically which can't easily be rationalized away.
I think perhaps you are looking at a different part of the funnel; disparate impact seems to be around the sort of requirements you are allowed to put in a job description. Like “must have a college degree”.
However the sort of insidious discrimination at the margin I was imagining are things like “equally-good resumes (meets all requirements), but one had a female/stereotypically-black name”. Interpreting resumes is not a science and humans apply judgement to pick which ones feel good, which leaves a lot of room for hidden bias to creep in.
My point was that I think algorithmic processes are more testable for these sorts of bias; do you feel that existing disparate impact regulations are good at catching/preventing this kind of thing? (I’m aware of some large-scale research on name-bias on resumes but it seems hard to do in the context of a single company.)
>disparate impact seems to be around the sort of requirements you are allowed to put in a job description.
That is a common example, but it is much broader than what goes on a job ad. For example, I have heard occasional rumblings about how whiteboard interviews are a hiring practice that would not stand up to these laws (IANAL).
>My point was that I think algorithmic processes are more testable for these sorts of bias
Yes, this is true, but that doesn't really matter. If there is consistent discrimination happening at the margins, that will be evident holistically. If that is evident holistically and there is no justification for it, that is all we need. We don't need to run resumes through an algorithm to show that discrimination is happening at an individual level. We just need to show that a policy negatively impacts a protected group and that the policy is not related to job performance.
>do you feel that existing disparate impact regulations are good at catching/preventing this kind of thing?
I think the bigger problem than the regulations is that there is an inherent bias against these type of cases actually being pursued. First, it is difficult to identify this as an individual so people don't know when it is happening. Additionally, people fear the retribution that would come from pursuing this legally. People don't want to be viewed as a pariah by future employers so they often will simply move on even if their accusations are valid.
Yes, but a holistic test requires a realistic counterfactual. That's the problem. There is no way to evaluate that counterfactual for a human interviewer.
It is true that extreme bias/discrimination will be evident, but smaller bias/discrimination, particularly in an environment where the pool is small (say, black women for engineering roles) is extremely hard to prove for a human interviewer. Your sample size is just going to be too small. On the other hand, if you have an ML algorithm, you can feed it arbitrary amounts of synthetic data, and get precise loadings on protected attributes.
If you ever intent to study law, become involved in a situation dealing with disparate impact, or are at the receiving end of disparate impact, knowing the legal definition may be helpful too. The DoJ spells[1] out the legal definition of disparate impact as so:
ELEMENTS TO ESTABLISH ADVERSE DISPARATE IMPACT UNDER TITLE VI
Identify the specific policy or practice at issue; see Section C.3.a.
Establish adversity/harm; see Section C.3.b.
Establish disparity; see Section C.3.c.
Establish causation; see Section C.3.d.
My point is that by the plain meaning of words you're right, disparate impact means any two groups impacted differently, regardless of anything else. In law, it means that an employment, housing, etc. policy has a disproportionately adverse impact on members of a protected class compared to non-members of that same class. It's much more specific and narrowly defined.
I agree that discrimination would be a lot easier to objectively prove after the fact, but it also would be far easier to occur in the first place, since many hiring managers would blindly "trust the AI" without a second thought.
From my experience working on projects where we trained models, usually it’s obviously completely broken the first attempt and requires a lot of iteration to get to a decent state. “Trust the AI” is not a phrase anyone involved would utter. It’s more like: trust that it is wrong for any edge case we didn’t discover yet. Can we constrain the possibility space any more?
"Trust the AI" could mean uploading a resume to a website and getting a "candidate score" from somebody else's model.
Because I'll tell you, there's millions of landlords and they blindly trust FICO when screening candidates. Maybe not as the only signal, but they do trust it without testing it for edge cases.
Definitely could be so, particularly in these early days where frameworks and best-practices are very immature. Inasmuch as you think this is likely, I suspect you should favor regulation of algorithmic processes instead of voluntary industry best-practices.
There is a very real danger of models being biased in a way that doesn't show up when you apply these crude hacks to inputs. It seems to me we have to be much more deliberate, much more analytical, and much more thorough in testing models if we want to substantially reduce or even eliminate discrimination.
Yes, you can A/B test the model if you can design reasonable experiments. You still don't have the general discrimination test because you have to define what a reasonable input distribution and what reasonable outputs are.
If an employer is looking to hire an engineer with a CS degree from a top-tier university, and they use an AI model to evaluate resumes and it returns a number of successes on black people very similar to the population distribution of graduates from those programs is the model discriminatory?
There are still hard problems here because any natural baseline you use for a model may in fact be wrong and designing a reasonable distribution of input data is almost impossibly hard as well.
Yes, in practice it’s actually way more complex than I gestured at. The Google bias toolkit I linked does discuss in much more detail, but I am not a data scientist and haven’t used it; I’d be interested in expert opinions. (They also have some very good non-technical articles discussing the general problems of defining “fairness” in the first place.)
I don’t think it’s adequate to attempt to prevent discrimination. Discrimination is core to our fundamental human rights. It’s necessary to succeed at preventing discrimination.
“We applied best practices in the field to limit discrimination” should not be an adequate legal defence if the model can be shown to discriminate.
To clarify further, just because you tried to prevent discrimination doesn’t mean you should be off the hook for the material harms of discrimination to a specific individual. Otherwise people don’t have a right to be protected against discrimination they only have a right to people ‘trying’ to prevent discrimination. We shouldn’t want to weaken rights that much even if it means we have to be cautious in how we adopt new technologies.
> With a human, you can’t really do an A/B test to determine if they would have prioritized a candidate if they hadn’t included some signal; it’s really easy to rationalize away discrimination at the margins.
Not for individual candidates, no. But you can introduce a parallel anonymized interview process and compare the results.
The problem with AI is that when it does make discriminatory decisions on hiring, is that it does so systematically and mechanically. Incidentally, systematic and discrimination are two words you never want to see consecutively on a letter from the EEOC or OFCCP.
The reason you never want to see those words together is that isolated discrimination may result in a single lawsuit but systemic discrimination is a basis for class action.
It's under-discussed as with any discussion of an empirical study of ML systems, ie., treating them as targets of analysis.
As soon as you do this, they're revealed to exploit only statistical coincidences and highly fragile heuristics embedded within the data provided. And likewise, pretty universally discriminatory when human data is inovlved
While it is true that using an algorithmic process to select candidates may introduce discrimination against protected groups, it seems to me that it should be much easier to detect and prove than with previous processes with human judgement in the loop.
You can just subpoena the algorithm and then feed test data to it, and make observations. Even feed synthetic data like swapping in “stereotypically black” names for real resumes of other races, or in this case adding “uses a wheelchair” to a resume. (Of course in practice it’s more complex but hopefully this makes the point.)
With a human, you can’t really do an A/B test to determine if they would have prioritized a candidate if they hadn’t included some signal; it’s really easy to rationalize away discrimination at the margins.
So while most AI/ML developers are not currently strapping their models to a discrimination-tester, I think the end-state could be much better when they do.
(I think a concrete solution would be to regulate these models to require a certification with some standardized test framework to show that developers have actually attempted to control these potential sources of bias. Google has done some good work in this area: https://ai.google/responsibilities/responsible-ai-practices/... - though there is nothing stopping model-sellers from self-regulating and publishing this testing first, to try to get ahead of formal regulation.)