Fraud detection at GoCardless (YC 11). We use the same tech for training and pro...

bayonetz · on March 8, 2017

When using sklearn, I've seen a lot of folks just pickle the model and use that as the interchange format. I like the human-readable interchange format you are using better. I assume you just rolled your own. Why not something like PMML?

angusb · on March 8, 2017

Yep, we made our own. I haven't heard of PMML before - quite cool! What we've made is a bit more readable for what we're using it for though, IMO. Looks like this:

    {
        "intercept": 1.0,
    
        "features": {
            "feature_1": {
                "coefficient": 1.0,
                "range": [0.1, 10.0],
                "mean_feature_score": 1.0,
                "imputation_value": 1.0
            },
            {
                ....
            }
        }
    }

sandGorgon · on March 9, 2017

Is this open source? We were looking for something like this.

angusb · on March 9, 2017

Sadly not. I'd be totally up for open sourcing if there's clear demand. If you can find it, send me an email at angus@{company_I_work_at}.com

Note that it's very tied down to our use case right now: only compatible with Logistic Regression, and currently it assumes fixed hyperparameters (will change this in future though), assumes a production pipeline of min-max scaling, imputation, then classification.

cf · on March 8, 2017

PMML is fairly verbose and limited to a particular set of models. It's often easier to pickle the models and then keep tagged versions. I think a human readable format could be created, but since most models are just a pile of numbers it's unclear what is gained.

angusb · on March 8, 2017

For Logistic Regression we find human readable config makes a lot of sense. It's pretty intuitive if there aren't too many features - if the model starts behaving weirdly, we can sometimes track it down to a change in a single feature using this (especially when viewing recent git diffs).

cf · on March 8, 2017

Sure. I tend to keep my postprocessing of a model under version control. In particular, what features were most helpful for predictions.

angusb · on March 9, 2017

Can't really talk about features on here :(

tnecniv · on March 8, 2017

What kind of features do you look at? Obviously I don't expect you to be able to talk specifics, but I'm curious about the generalities.

Also, how did you settle on logistic regression? Have you tried any other models?

angusb · on March 9, 2017

Can't really talk about features on here. Any smart fraudster should be watching every single thing I say :)

We're using logistic regression not because it performs the best, but because it's the most understandable. When cases get flagged for manual review people need to know exactly what seems dodgy about the account, and with Logistic Regression you can read the exact contribution from each feature to the final fraud probability. Seen as the features mean something real and tangible (unlike in neural nets), this means a manual reviewer immediately knows which aspects of someone's behaviour are out of the ordinary when they get presented with a new case (we have a really nice internal UI for presenting this). This saves several minutes per case which really adds up.

Performance-wise Logistic Regression is good, but it can't automatically learn non-linearities in a feature value and its propensity for fraud, and it can't learn about two features that together should indicate a probability of fraud greater than the sum of its parts* . If this becomes a problem for us we'll start looking into nonlinear models where the inner workings are somewhat communicable to the manual review team.

* You can alter feature definitions manually to capture nonlinearities (e.g. a feature which is "user_has_done_x_and_has_done_y_too", but this is very very manual, and needs to be potentially rewritten/manually re-optimised on every retrain. We don't do this.

gringou · on March 9, 2017

Just a note on human readability of models: for sure glm gives you a human readeable representation for "free" but there are many ways to get the same kind of readability for neural Networks. Great article, though, cheers!

angusb · on March 9, 2017

Ah interesting! Blind spot in my knowledge right there, thanks for pointing it out

jackgolding · on March 9, 2017

I like this approach - first medium data real world solution I've read