Best Practices in Prototyping

Common Prototyping Problem – Information Leakage

Information leakage – when information from the future is incorrectly and inadvertently used as part of an AI/ML prediction task – is one of the most pernicious problems that can affect AI/ML prototyping efforts, confounding data scientists.

Good models use relevant, available information – inputs – about the past and current states in order to make an inference – prediction – about the future (e.g., is a customer going to churn?) or about other data the model does not have access to (e.g., is a customer engaging in money laundering?).

Information leakage occurs when models inadvertently have access to future information presented as model inputs. For example, in a customer attrition prediction problem, information leakage can be overt – for instance, the customer may have closed their account but not completed the transaction – or it can be more nuanced – for instance, if the customer engages in a transaction that is only available to those who have already engaged services with other businesses.

The information leakage challenge exists because most AI/ML problems have a strong temporal element. Data therefore have to be carefully represented over time. But most real-world data sets are complex, come from disparate databases, are updated at differing frequencies and time granularities, and follow complex business rules. Often, no one individual at a company understands all the data in scope for a problem. Plus, data scientists are often unaware of the underlying data complexities.

Information leakage often presents itself in the form of terrific model results during prototyping efforts, but poor results when models are transferred into production.

It can be easy to diagnose information leakage if prototyping results are “too good to be true.” However, in some cases information leakage can occur even when results seem to be reasonable.

One example of information leakage comes from IBM. In 2010, data scientists at IBM developed a machine learning model to predict potential customers who would purchase IBM software products. The inputs to the model included information about each potential customer as of 2006, and the goal was to predict who would become a customer by 2010. However, the IBM team did not have access to historical customer websites from 2006. They used current website data from 2010 as an input to train the model, thinking this could substitute for “real” 2006 data.

Infographic

Figure 32: Overview of IBM machine learning model built in 2010 to predict which customers would purchase IBM products

At first IBM was pleased to see very good results from the machine learning model. But upon analyzing the relative weights of feature contributions to model predictions, the team quickly realized a disappointing fact: The top distinguishing characteristic that caused the model to predict which customers would purchase IBM products was the customer website data. At first, that seemed to be a reasonable input to the model. But because the website data was current as of 2010 when the model was trained, the data included names of IBM products that customers had already purchased. Put simply, the IBM team accidentally included labels identifying who became a customer in their training data.

Infographic

Figure 33: IBM team had information leakage since model inputs – website data – included explicit labels of outputs (who became customers by 2010)

Another example of information leakage comes from our own work in AI/ML-based fraud detection. In this case, we were seeking to predict cases of electricity theft using information from smart connected electricity meters, work order systems, electricity grid systems, customer information systems, and fraud investigation systems. The data volumes and data complexity were significant.

One of the first prototype versions of our models had terrific performance. But we soon realized that the model used a specific work order code – one of scores of codes – to predict a theft event from the official fraud database. It turned out that the fraud investigation system was time-delayed and the work order code was an early entry made by some investigators after a fraud event – so not predictive – in order to mark a specific customer as a potential fraud case prior to the official adjudicated database entry.

This kind of issue can be incredibly complex to debug, especially in feature spaces with many thousands of features and data from dozens of databases. Some approaches to address information leakage issues of this nature involve “masking out” a buffer time period before labels – for instance, not using information that is in close temporal proximity, say two to three days, of the label being predicted. The specific configuration of the mask requires an understanding of the business problem to be solved and the nuances of the data sets and databases.

Other approaches involve programmatically analyzing the correlations between variables and labels and closely examining those that appear to be “too good to be true.”

Finally, examining feature contributions/explainability of the AI/ML algorithm can provide valuable clues regarding potential information leakage events.