Fraud detection at GoCardless (YC 11). We use the same tech for training and production classification: logistic regression (sklearn).
-------
Training/retraining
- train on an ad-hoc basis, every few months right now moving to more frequently and regularly as we streamline the process
- training done locally, in memory (we're "medium data" so no need for distributed process), using a version-controlled ipython notebook
- we extract model and preprocessing parameters from the model and preprocessors that were fit in the retraining process, dump to a json config file in the production classification repo
-------
Production classification
- we classify activity in our system on a nightly cron*
- as part of the cron we instantiate the model using config dumped from the retraining process. This means the config is fully readable in the git history (amazing for debugging by wider eng. team if something goes wrong)
- classifications and P(fraud) gets sent to the GoCardless core payments service which then decides whether to escalate cases for manual review
-------
* We're a payments company, processing Direct Debit bank-to-bank transfers. Inter-bank Direct Debit payments happen slowly (typically >2 days) so we don't need a live service for classifications.
Quite simple as production ML services go, but it's currently only 2 people working on this (we're hiring!).
When using sklearn, I've seen a lot of folks just pickle the model and use that as the interchange format. I like the human-readable interchange format you are using better. I assume you just rolled your own. Why not something like PMML?
Yep, we made our own. I haven't heard of PMML before - quite cool! What we've made is a bit more readable for what we're using it for though, IMO. Looks like this:
Sadly not. I'd be totally up for open sourcing if there's clear demand. If you can find it, send me an email at angus@{company_I_work_at}.com
Note that it's very tied down to our use case right now: only compatible with Logistic Regression, and currently it assumes fixed hyperparameters (will change this in future though), assumes a production pipeline of min-max scaling, imputation, then classification.
PMML is fairly verbose and limited to a particular set of models. It's often easier to pickle the models and then keep tagged versions. I think a human readable format could be created, but since most models are just a pile of numbers it's unclear what is gained.
For Logistic Regression we find human readable config makes a lot of sense. It's pretty intuitive if there aren't too many features - if the model starts behaving weirdly, we can sometimes track it down to a change in a single feature using this (especially when viewing recent git diffs).
Can't really talk about features on here. Any smart fraudster should be watching every single thing I say :)
We're using logistic regression not because it performs the best, but because it's the most understandable. When cases get flagged for manual review people need to know exactly what seems dodgy about the account, and with Logistic Regression you can read the exact contribution from each feature to the final fraud probability. Seen as the features mean something real and tangible (unlike in neural nets), this means a manual reviewer immediately knows which aspects of someone's behaviour are out of the ordinary when they get presented with a new case (we have a really nice internal UI for presenting this). This saves several minutes per case which really adds up.
Performance-wise Logistic Regression is good, but it can't automatically learn non-linearities in a feature value and its propensity for fraud, and it can't learn about two features that together should indicate a probability of fraud greater than the sum of its parts* . If this becomes a problem for us we'll start looking into nonlinear models where the inner workings are somewhat communicable to the manual review team.
* You can alter feature definitions manually to capture nonlinearities (e.g. a feature which is "user_has_done_x_and_has_done_y_too", but this is very very manual, and needs to be potentially rewritten/manually re-optimised on every retrain. We don't do this.
Just a note on human readability of models: for sure glm gives you a human readeable representation for "free" but there are many ways to get the same kind of readability for neural Networks. Great article, though, cheers!
-------
Training/retraining
- train on an ad-hoc basis, every few months right now moving to more frequently and regularly as we streamline the process
- training done locally, in memory (we're "medium data" so no need for distributed process), using a version-controlled ipython notebook
- we extract model and preprocessing parameters from the model and preprocessors that were fit in the retraining process, dump to a json config file in the production classification repo
-------
Production classification
- we classify activity in our system on a nightly cron*
- as part of the cron we instantiate the model using config dumped from the retraining process. This means the config is fully readable in the git history (amazing for debugging by wider eng. team if something goes wrong)
- classifications and P(fraud) gets sent to the GoCardless core payments service which then decides whether to escalate cases for manual review
-------
* We're a payments company, processing Direct Debit bank-to-bank transfers. Inter-bank Direct Debit payments happen slowly (typically >2 days) so we don't need a live service for classifications.
Quite simple as production ML services go, but it's currently only 2 people working on this (we're hiring!).