Each day some trillion dollars change hands in any sort of financial transaction. Hidden in this money ocean there are also fraudulent actors, which attempt to sanitize illicit money by making its provenance appear normal. No one can manually verify all transactions, so there are a number of machine learning algorithms trying to spot the fraudulent ones. The problem is how we evaluate which algorithm is better than the other.
Ada Matilde Gige found an interesting problem in the normal evaluation workflow of money laundering detectors during her master work, supervised by Lasse Alsbirk and myself. This resulted in the paper I’m writing about today: “Evaluating fraud detection algorithms in a decentralized scenario“, which appeared a couple of days ago on the Royal Society Open Science journal.
The problem lies in the information the algorithms receive during their training phase. One issue with working in this field is that data is hard to come by: because it involves financial transaction, real world data is tightly controlled by the financial institutions. The solution is to use synthetic data. Luckily, IBM has created a nice, handy, and quite big dataset that simulates a world’s economy – many different banks, currencies, countries – that is free to use. In it, they injected fraudulent transactions following money laundering patterns and identified them. Naturally, it became the cornerstone of the evaluation workflow of the anti-money laundering detectors.
I’m going to oversimplify the machine learning pipeline, but essentially it works by splitting data in two parts: the training set and the test set. You can give the algorithm, say, 80% randomly selected transactions to train on. The algorithm will hopefully learn the pattern by seeing which transactions were actually fraudulent and which one were not. To know whether it actually learned, you will then give it the 20% transactions it never saw before, hide whether they were fraudulent or not, and ask the algorithm to guess.
And the algorithms guess quite well! In one version of the IBM dataset, the state of the art has a precision of 85% – meaning 85% of all flagged transactions are actually fraudulent – and a recall of 74% – meaning it captures 74% of all fraudulent transactions: a transaction has only one chance out of four to pass unnoticed.
However, Ada and Lasse argued that this standard machine learning pipeline does not work for the money laundering case. The IBM dataset contains data from different banks and different countries. If you take a random 80% sample of all transactions to train on, you’re de facto assuming that there exists an international agency that can access data from all the banks in all the countries in the world. Does this agency exist? No. Should it exist? Probably not, as it would wield an amount of power over the world that even the purest of the angels would be tempted to abuse.
What happens in the real world is that the banks will only analyze the transactions that they process directly. They will flag some transactions as suspicious and send them to the national authority. The national authority will receive only the suspicious data from all the banks. Since this is still more than can be managed manually, they will also run some sort of machine learning, to identify the most suspicious transactions. It is a two-step pipeline with some non-trivial interactions between two machine learning algorithms. This is the realistic pipeline the money laundering detectors should be evaluated with.
What happens if we do? In the same scenario as before, the precision takes a bit of a hit: it goes down from the original 85% to 75%. This is unfortunate, but not extremely critical. The real problem is with recall, which plummets from 74% to 24%. Now a fraudulent transaction has three chances out of four to pass unnoticed, the failure case has tripled!
And here’s another serving of food for thought. Remember that in this realistic pipeline we have to apply two detectors: one for the banks and one for the central authority. If we were to apply the state of the art method, the one that performs the best in the one-step evaluation, we would get a worse result, with recall going down to 7%. To get to 24% we need either the banks or the central authority to use a simpler method.
Why is this happening? Likely, it’s a combination of two things. At the first step, each bank can only use its own data. This means that each independent training set is much smaller and therefore an over-complex state of the art method might not be able to learn enough with this little amount of data. At the second step, the central authority does not receive a random sample of transactions any more, but a selected sample of the most suspicious ones. This likely breaks the assumptions on which a money laundering detector is founded.
The good news is that all of this is fixable, although it requires a lot of coordination between all the stakeholders. Banks can partner and use federated learning, each keeping their own data but being able to exploit what has been learned by the algorithm on the other banks’ data. Central authorities should coordinate with the banks and agree on which algorithms each should be using, because if these actors don’t talk and each use the state of the art method, this will likely result in a worse outcome. Finally, machine learning practitioners should use our updated evaluation pipeline in two steps to get a proper picture of how a machine learning algorithm is used in the real world.



































