Zorba the Hutt (zorbathut) wrote,
Zorba the Hutt
zorbathut

on decisions, accuracy, spam, and the american justice system

As humans, we often find it necessary to make decisions based on inadequate data. For example, "which car should I buy?", "is this person serious?", or perhaps, "who wrote the book of love?". (Incidentally, it is my educated position on the last question that Murphy wrote the book of love.) Obviously these questions can be difficult, and often there is no perfect answer. But that's OK because we're humans and we don't need perfect answers, we'll settle for Good Enough answers.

Usually.

There is a specific type of question that shows up very often. The yes/no question. Given a large-but-insufficient amount of data, decide between two results. No second-guessing. No "well, sort of." Answer the question, dammit, and you'd better be right about it.

Example: Is this email spam? How about this one? This one over here?

This is not an easy question. Humans can't do it. (No, really. They can't. There are too many borderline cases.) Computers are basically fucked from the very beginning. None of this stops us from trying, of course, and that's what this is all about.

One very good way of organizing decisions on questions like these is to set up some sort of continuous scale, from "yes" to "no", and then plot our result on it. And often, a good way to start this scale is to start assigning point values. Bright red text is worth 1.0 Spam Points. A little link that says "unsubscribe me" is worth 0.3 Spam Points. A sentence involving the words "viagra", "huge", or "barely legal" is worth 3.0 Spam Points. And so we process a large email, and determine that its final value is 7.8 Spam Points, and so we consign it to the junk pile. Or maybe it's only 0.2 Spam Points, and so that's a good email.

But what about the middle cases? For example, maybe your loving and slightly odd midwest grandma is prone to sending long rambling emails about her and the girls playing with their new 18" steel-hard tools on the farm. ("The farm equipment outlet was having a sale! Huge discounts like this seem barely legal.") Obviously we now have a problem. Perhaps Grandma's email is 5.7 Spam Points, so now we'll claim 5.7 and below is not-spam, and 7.8 and above is spam, and you can quickly see what happens here. We narrow down the gap further and further and eventually we end up with a single number that is our Dividing Line and then, whoops, in comes a spam where the spammer used bright *blue* text instead of bright *red* text and we've got spam marked as a real email and our poor enduser is suckered into buying $50 of v1agr4 at low low prices.

Incidentally, this is roughly what SpamAssassin does.

I mean, the Spam Points thing. Not the buying of viagra.

With spam, we have a bit of an escape. We can set up a range inside which where we're Not Sure and send it to the user for further review. But inevitably, the outlying fringes of both spam and not-spam will end up, not in the No Man's Land in the middle, but in the actual incorrectly-marked segments. So we've sort of patched up our problem, but in the end we've created a new problem, which has the unfortunate property that it's identical to our old problem, to which we respond "well, fuck."

There's no solution to this. But I'll talk a little about a common fallacy.

If you're trying to improve a process, you need some way of knowing you've improved it. Many people never seem to understand this, and go on long reorganizational sprees based on vague whims, only to undo it all next week because the latest fad has changed. Real Designers don't do this, or at least, when they do, it works. And the only way to tell if it works is to design a useful metric.

Many people - and these are statistics you should be very, very wary of indeed - will speak of "accuracy". Our product is 74% accurate! Yeah, well ours is 98% accurate! This is a totally meaningless number. No, really. It is. Let's imagine we send in a test suite of one thousand emails, and 980 of them are spam. And our carefully designed spam filter blocks every last one of them. Correctly categorized emails: 980. Total emails: 1000. Accuracy: 98%. Happy customers: 0.

Even better, let's take a product that's 50% accurate, and use it for our email for a few months with whatever proportion of spam the real world contains. What percentage of the emails that arrive in your inbox will be spam? What percentage of important emails will be lost? I'll let you figure out the answer to this. Come back when you have a number for me.

Really, what we need is false positives and false negatives. How many items were flagged as spam that aren't? How many items weren't flagged as spam, but are? Our cheerful "let's call everything spam" filter up there gets an impressive 0% false negatives and an equally impressive (in an entirely different way) 100% false positives. Now we're getting somewhere. What if I told you that our "50% accurate" filter was 80% false negatives and 20% false positives? We'll miss 20% of our important mail, and 80% of the spam will get through. Note, however, that we can't tell you what percentage of your inbox will be spam - if you get 100 times as much spam as real mail, you'll still be mostly sifting through spam.

Let's step back a minute. Let's say that the Spam Point system we designed before is giving us 5% false positives and 5% false negatives. But we work in a corporate environment, and those false positives really suck. Those could be important emails! What can we do?

Well, it's really simple. Remember our number line, with points plotted on it, where the threshold for "spam vs not spam" is somewhere around 6.8 Spam Points? We shift the threshold up a bit, and make it harder for things to be flagged as spam. Maybe we can shift it up to 2% false positives. But, now that "spam" is a harder designation to get, we're up to 8% false negatives. There ain't no such thing as a free lunch.

Most good spam filters are tunable in this way, so you can tell it just how much you care about your personal email reaching you vs. all the spam reaching you.

Clearly, though, we can't actually improve the filter by just turning the threshold up and down. We can decide which side of the tradeoff better suits us, but we can't really do anything good without penalizing ourselves also.

Now we're finally getting to the point of this whole entry.

Justice is another good example of this same problem. You have an iffy set of facts (the defendant drives a black convertible, the suspect was observed driving a dark convertible. The defendant just bought a new vacation home, the suspect made away with over 2 million dollars. But - whoops - the defendant has dark hair, and the suspect caught on tape is blonde!) and you need to make a decision. Guilty or not guilty?

Obviously it'd be nice to get perfect results. But, let's be honest here, it's just not going to happen. Many crimes never even get good suspects attached to them - and every once in a while, there's going to be someone who was in the wrong place at the wrong time in the wrong clothes buying exactly the wrong vacation to the Cayman Islands. These things happen. And, for one thing, we've got to choose our cutoff point. 0% false positives? Okay, that's possible, but you're going to get basically 99% false negatives. 0% false negatives? Sure, maybe - once you arrest the vast majority of the country, as well as all the other countries.

It's just not feasible.

And so most people attempt to get the balance shifted one way or another. More lenient judges, less lenient judges. It's not really as binary as I've made it out to be - obviously "one month is jail" is quite far away from "the death penalty". And, just to make the whole war even harder, people don't agree on what punishments are appropriate for what crimes. Or even which crimes are, in fact, crimes. So it's basically a mess.

But, stepping back again, there's one other way you can improve a filter like this. You can improve the algorithm. Perhaps we discover that most spammers are using bright blue text instead of bright red text. Or maybe we add a little chunk of code such that people you've emailed before are automatically whitelisted and marked as "not spam". Or maybe we get really clever and realize that "unsubscribe me" or "horse-sized" are relatively innocuous on their own, but when you combine the two, hoo baby! And suddenly our filter has changed from 5%/5% to 3%/3%.

As we say in the business, "Woot!"

And this, here, is where I'm going with this entire Frankensteinien entry!

How do we improve the American Justice System in some way other than nudging the cutoff points back and forth?

Unfortunately I have absolutely no idea. And that's where this entry ends.

Aren't you glad you read this journal?

-------

As a footnote, some of you may have realized that the *reason* I'm thinking about all this is that I'm playing with something similar at work. But aside from saying, no, it's not a spam filter or an automated jury, I'm not going to say what it is. You're all just going to have to wait.

Also, the project I was on when I joined is Google Desktop, but they were going into crunch release mode almost exactly when I got there. Which is no fun. Now I'm on something else.
Subscribe
  • Post a new comment

    Error

    default userpic

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 6 comments