Spain has become reliant on an algorithm to score how likely a domestic violence victim may be abused again and what protection to provide — sometimes leading to fatal consequences.
I really have a hard time deciding if that is the scandal the article makes it out to be (although there is some backpedaling going on). The crucial point is: 8% of the decisions turn out to be wrong or misjudged. The article seems to want us to think that the use of the algorithm is to blame. Yet, is it? Is there evidence that a human would have judged those cases differently?
Is there evidence that the algorithm does a worse job than humans? If not, then the article devolves onto blatant fear mongering and the message turns from "algorithm is to blame for deaths" into "algorithm unable to predict the future in 100% of cases", which of course it can't...
The article mentions that one woman (Stefany González Escarraman) went for a restraining order the day after the system deemed her at "low risk" and the judge denied it referring to the VioGen score.
One was Stefany González Escarraman, a 26-year-old living near Seville. In 2016, she went to the police after her husband punched her in the face and choked her. He threw objects at her, including a kitchen ladle that hit their 3-year-old child. After police interviewed Ms. Escarraman for about five hours, VioGén determined she had a negligible risk of being abused again.
The next day, Ms. Escarraman, who had a swollen black eye, went to court for a restraining order against her husband. Judges can serve as a check on the VioGén system, with the ability to intervene in cases and provide protective measures. In Ms. Escarraman’s case, the judge denied a restraining order, citing VioGén’s risk score and her husband’s lack of criminal history.
About a month later, Ms. Escarraman was stabbed by her husband multiple times in the heart in front of their children.
It also says:
Spanish police are trained to overrule VioGén’s recommendations depending on the evidence, but accept the risk scores about 95 percent of the time, officials said. Judges can also use the results when considering requests for restraining orders and other protective measures.
You could argue that the problem isn't so much the algorithm itself as it is the level of reliance upon it. The algorithm isn't unproblematic though. The fact that it just spits out a simple score: "negligible", "low", "medium", "high", "extreme" is, IMO, an indicator that someone's trying to conflate far too many factors into a single dimension. I have a really hard time believing that anyone knowledgeable in criminal psychology and/or domestic abuse would agree that 35 yes or no questions would be anywhere near sufficient to evaluate the risk of repeated abuse. (I know nothing about domestic abuse or criminal psychology, so I could be completely wrong.)
Apart from that, I also find this highly problematic:
[The] victims interviewed by The Times rarely knew about the role the algorithm played in their cases. The government also has not released comprehensive data about the system’s effectiveness and has refused to make the algorithm available for outside audit.
Could a human have judged it better? Maybe not. I think a better question to ask is, "Should anyone be sent back into a violent domestic situation with no additional protection, no matter the calculated risk?" And as someone who has been on the receiving end of that conversation and later narrowly escaped a total-family-annihilation situation, I would say no...no one should be told that, even though they were in a terrifying, life-threatening situation, they will not be provided protection, and no further steps will be taken to keep them from being injured again, or from being killed next time. But even without algorithms, that happens constantly...the only thing the algorithm accomplishes is that the investigator / social worker / etc doesn't have to have any kind of personal connection with the victim, so they don't have to feel some kind of way for giving an innocent person a death sentence because they were just doing what the computer told them to.
Final thought: When you pair this practice with the ongoing conversation around the legality of women seeking divorce without their husband's consent, you have a terrifying and consistently deadly situation.
the only thing the algorithm accomplishes is that the investigator / social worker / etc doesn’t have to have any kind of personal connection with the victim
This even works for people pulling the trigger. Following orders, sed lex dura lex, et cetera ad infinitum.
IMO this place is far more an echo chamber than Reddit. Both places have their share of team based opinions but reddits diversity IMO is better at surfacing it.
An algorithm is never to blame, some pencil necked desk jockey decided the criteria to get help that was used to create the algorithm, the blame is entirely on them.
That said, I doubt it would make any difference if a human was in the loop. An algorithm is still al algorithm, even if it's applied by a human. We usually just call that a "policy" though. People have been murdered by the paper sea for decades before we started calling it "algorithms".
Since 2007, about 0.03 percent of Spain’s 814,000 reported victims of gender violence have been killed after being assessed by VioGén, the ministry said. During that time, repeat attacks have fallen to roughly 15 percent of all gender violence cases from 40 percent, according to government figures.
“If it weren’t for this, we would have more homicides and gender-based violence,” said Juan José López Ossorio, a psychologist who helped create VioGén and works for the Interior Ministry.
So no, not a scandal, it seems it is helping, but perhaps could be better. At least that's my read.
It reminds me of the debate around self driving cars. Tesla has a flawed implementation of self driving tech, that's trying to gather all the information it needs through camera inputs vs using multiple sensor types. This doesn't always work, and has led to some questionable crashes where it definitely looks like a human driver could have avoided the crash.
However, even with Tesla's flawed self driving, They're supposed to have far fewer wrecks than humans driving. According to Tesla's safety report, Tesla's in self driving mode average 5-6 million miles per accident vs 1-1.5 million miles for Tesla drivers not using self driving (US average is 500-750k miles per accident).
So a system like this doesn't have to be perfect to do a far better job than people can, but that doesn't mean it won't feel terrible for the unlucky people who things go poorly for.
The Teslas in self driving mode tend to be used on main roads, and most accidents per mile happen on the small side streets. People are also much safer where Teslas are driven than the these statistics suggest.
My impression from the article is more that they're not doing any kind of garbage-in assessment: nobody is making sure they're getting answers about the right person (eg: some women date more than one guy) and some women don't feel safe giving accurate answers to the police, and there aren't good failsafes available for when it's wrong; you're forced to hire legal counsel and pursue a change via the courts.
The article is not about how the AI is responsible for the death. It's likely that the woman would have died in the counterfactual.
The question is not "how effective is AI"? The question is should life or death decisions be made by an electrified Oracle at Delphi. You must answer this question before "is AI effective" becomes relevant.
If somebody was adjudicating traffic court with Tarot cards, would you ask: well how accurate are the cards compared to a judge?
Decisions should be made by whomever or whatever is most effective. That's not even a debate. If the tarot cards were right more often than the judge, fire the judge and get me a deck. Because the judge is clearly ineffective.
You can't privilege an approach just because it sounds more reasonable. It also has to BE more reasonable. It's crazy to say "I'm happy being wrong because I'm more comfortable with the process"
The trick of course is to find fair ways to measure effectiveness accurately and make sure it's repeatable. That's a rabbit hole of challenges.
The crucial point is: 8% of the decisions turn out to be wrong or misjudged.
The article says:
Yet roughly 8 percent of women who the algorithm found to be at negligible risk and 14 percent at low risk have reported being harmed again, according to Spain’s Interior Ministry, which oversees the system.
Granted, neither "negligible" or "low risk" means "no risk", but I think 8% and 14% are far too high numbers for those categories.
Furthermore, there's this crucial bit:
At least 247 women have also been killed by their current or former partner since 2007 after being assessed by VioGén, according to government figures. While that is a tiny fraction of gender violence cases, it points to the algorithm’s flaws. The New York Times found that in a judicial review of 98 of those homicides, 55 of the slain women were scored by VioGén as negligible or low risk for repeat abuse.
So in the 98 murders they reviewed, the algorithm put more than 50% of them at negligible or low risk for repeat abuse. That's a fucking coin flip!
So in the 98 murders they reviewed, the algorithm put more than 50% of them at negligible or low risk for repeat abuse. That’s a fucking coin flip!
This is not at all how you interpret that number.
Let's say there is a group of 100 people. 8 get killed. If you randomly assigned them into two groups, the expected number would be that you would be right about 4 and wrong about 46.
But say you predict that 96 will be fine, and 4 would be murdered....and all 4 of those are murdered...well, about the 8 killed you would only have a 50% chance of being right.
Just a coin flip? Even though you were right 96% of the time?
You'll get that result without an algorithm as well unfortunately. A domestic violence interview often doesn't result in you getting the truth of what happens because the victim is often economically and emotionally dependent on their partner. It's helpful to have an algorithm that makes you ask the right questions but there's still no way I know of to get the right answers of those questions from a victim 100 percent of the time.
The algorithm itself is just a big "whatever". The key issue here is that some assumptive piece of shit decided to conclude, based on partial information, that those women would be safe in the future.
Not because they can't be done right and you can't teach people to use them.
But because there's a slippery slope of human nature where people want to offload the burden of decision to a machine, an oracle, a die, a set of bird intestines. The genie is out and they will do that again and again, but in a professional organization, like police, one can make a decision of creating fewer opportunities for such catastrophes.
The rule is that people shouldn't use machines above their brains, as one other commenter says, and they should only use this in a logical OR with their own judgment made earlier, as another commenter says, but the problem is in human nature and I'd rather not introduce this particular point of failure to police, politics, anything juridical and military.
Even when given the best and most sophisticated tools and equipment available, police will manage to fuck things up at every opportunity because they're utterly incompetent.
But the system seems to be better than police officers. Which is entirely believable. Humans have all kinds of biases that make the decisions we make far less than desirable.
Per the article, it has decreased the risk of repeated violence and, according to an expert, its the best systen we have. Why would you want to go back to a worse system? This is using our brains in an attempt to overcoming our biases.
The way to use these kinds of systems is to have the judge came to an independent decision, then, after that's keyed in, the AI spits out theirs and whichever predicts more danger is then acted on.
Relatedly, the way you have an AI select people and companies to get spot-checked by tax investigators is not to show investigators the AI scores, but mix in AI suspicions among a stream of randomly selected people.
Relatedly, the way you have AI involved in medical diagnoses is not to tell the human doctor results, but suggest additional tests to be made. The "have you ruled out lupus" approach.
And from what I've heard the medical profession actually got that right from the very beginning. They know what priming and bias is. Law enforcement? I fear we'll have to ELI5 them the basics for the next five hundred years.
I don't think there's any AI involved. The article mentions nothing of the sort, it's at least 8 17 years old (according to the article) and the input is 35 yes/no questions, so it's probably just some points assigned for the answers and maybe some simple arithmetic.
Edit: Upon a closer read I discovered the algorithm was much older than I first thought.
Sounds like an expert system then (just judging by the age) which was AI before the whole machine learning craze, in any case you need to take the same kind of care when integrating them into whatever real-world structures there are.
Medicine used them with quite some success problem being they take a long time to develop because humans need to input expert knowledge, and then they get outdated quite quickly.
Back to the system though: 35 questions is not enough for these kinds of questions. And that's not an issue of number of questions, but things like body language and tone of voice not being included.
so it’s probably just some points assigned for the answers and maybe some simple arithmetic.
Why yes, that's all that machine learning is, a bunch of statistics :)
Having worked in making software for almost 3 decades, including in Finance both before and after the 2008 Crash, this blind reliance on algorithms for law enforcement and victim protection scares the hell out of me.
An algorithm is just an encoding of whatever the people who made it think will happen: it's like using those actual people directly, only worse because by need an algorithm has a fixed set of input parameters and can't just ask more questions when something "smells fishy" as a person would.
Also making judgements by "entering something in a form" has a tendency to close people's thinking - instead of pondering on it and using their intuition to, for example, notice from the way people are talking that they're understating the gravity of the situation, people filling form tend to mindlessly do it like a box-ticking exercise - and that's not even going into the whole "As long as I just fill the form my ass is covered" effect when the responsability is delegated to the algorithm that leads people to play it safe and not dispute the results even when their instincts say otherwise.
For anybody who has experience with modelling, using computer algorithms within human processes and with how users actually treat such things (the "computer says" effect) this shit really is scary at many levels.
It's a sentiment at least as old as the first things that we now call computers.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
About 20 new cases of gender violence arrive every day, each requiring investigation. Providing police protection for every victim would be impossible given staff sizes and budgets.
I think machine-learning is not the key part, the quote above is. All these 20 people a day come to the police for protection, a very small minority of them might be just paranoid, but I'm sure that most of them had some bad shit done to them by their partner already and (in an ideal world) would all deserve some protection.
The algorithm's "success" in defined in the article as reducing probability of repeat attacks, especially the ones eventually leading to death.
The police are trying to focus on the ones who are deemed to be the most at risk. A well-trained algorithm can help reduce the risk vs the judgement of the possibly overworked or inexperienced human handling the complaint? I'll take that. But people are going to die anyway. Just, hopefully, a bit less of them and I don't think it's fair to say that it's the machine's fault when they do.
The computer response should be treated as just an indication and in all cases a human needs to decide to override that
Otherwise we’ll all become useless pieces of a simulation
I went to the bank to ask a loan and then it got rejected because the computer said I didn’t met the parameters by just 40 euro. Ah ok, I told the clerk, just lower the amount that I’m asking or spread it over a longer period. No, because after the quote is done and I signed the authorization for the algorithm to perform credit score, it can’t do it again in 3 months. What?? Call a supervisor and let them override it, 40 euro is so minimal that it’s not that big issue. No, impossible. So that means each single employee in the bank is just an interface to the computer and can be fired at will?
Despite this article, I'm still not convinced that the algorithms aren't better. The policy states that people need to use their best judgement and can override the algorithms. The article argues that the algorithms are being over relied on. The article mentions in passing, however, that the statistics were worse before the algorithm was introduced.
The point of the matter is, best judgement can be shitty. Your average cop has no idea what questions to ask without a list and how important they are per research. Some suggestions are too continue using the tool but use things like psychologists to administer it. The only way you could reasonably have a psych on call for every police station is to make it a remote interview, which frankly doesn't seem better to me.
In the end, the unstated problem is resources and how best to utilize them to prevent the violence. I'm sure Spain's policy could be improved but shoring it up with an algorithm is a good practice.
Algorithms aren't AI. They're standardization measures in cases like this. Hell you don't even need computers for many of them. We use tons in healthcare to classify risk, decide on treatment options, and even decide on how much medication to give. They're particularly present in psychiatric care.
I have no issues with using ML to predict outcomes. It's going to be wrong sometimes, so will humans. The system just needs review and input from humans understanding the model.