There are many analytics successes, but we rarely hear about failures. Here are some stories and examples of what NOT to do in business analytics.
Below are insights from two leading analytics professionals, Richard Boire and
Ted Senator, in response to my request for examples
Decline Effect in Business Analytics - failure to replicate in KDnuggets News 12:n14.
Richard Boire, (Boire Filler Group),
writes about this case many years ago (he noted that the error was not his):
This case represents a classic scenario of missing one minute detail in the whole data mining process which ultimately led to a disaster scenario for a major Canadian bank. The case involved a logistic response model being built by an external supplier (not us) for acquisition of new customers regarding a given bank product of a well-known Canadian bank. The model was built and worked very well when looking at validation results.
This model was then implemented and actioned on within a future marketing campaign. During the development process, the tools that were used both generated the solution as well as the validation results.
However, during the scoring process, the tool did not automatically generate the score. The user had to take the output equation results from the model development process and generate a scoring routine to score a given list of bank customers. In scoring, the user had to manually create the score by multiplying coefficients with variables. As part of this process, there was also a transformation of this equation to a logistic function. As part of this transformation, the user had to multiply the entire equation by -1. This fact of multiplying the equation by -1 was forgotten by the user when scoring the list of eligible customers. Guess what happened. Names with the highest scores represented the worst names with the opposite scenario happening for the lowest scores. The campaign went out by targetting names with the highest scores which ultimately resulted in horrific results.
When the supplier did the backend against a control random group of names promoted across all model deciles, they flipped the sign the right way to -1 and validated that the model worked quite well. Unfortunately, this did not appease the client's unhappiness as the bulk of their campaign names represented so-called targetted names within the top few deciles but who were in fact the worst names . From a net eligible unverse of 500M names, the client ended up losing well in excess of $100M.
This scenario might have been prevented if there were checks and balances as part of the implementation process. By checking score distributions as well as the model variable means within the targetted deciles during model development and the current list implementation, this error would have been caught . The user would have noted that significant changes in both score distribution as well as model variable means for the targetted deciles would have occurred between time of model development and the current list scoring run. They then would have investigated this further by checking their coding in further detail and would have caught the omission and corrected it by multiplying the equation by -1. They say that the devil is in the details , but in data mining the devil is in the data.
A similar thing happened during the first
KDD Cup 1997, where the goal was to select a subset of lapsed donors to contact.
An entry from one well-known company selected the worst possible candidates for mailing - their results were significantly worse than random ! Apparently their data miners switched the sign somewhere. Fortunately for them, the names of the contestants were kept anonymous.
Ted Senator (a leading researcher in AI/Data Mining, currently VP at SAIC, formerly at DARPA and FINCEN) wrote regarding the
The views he expressed are his personal views and do not necessarily represent the views of SAIC or any of its customers. T
It seems to me that there are two different effects here with potential similar manifestations: (1) overfitting and (2) feedback. These effects are not unrelated but they are distinct, especially with respect to the types of techniques appropriate to mitigate them.
Overfitting occurs when researchers assume that the past will be like the present and don't account for the fact that the actual data being fit should be thought of as a sample from a space of possible distributions rather than *the* actual distribution. A manifestation of overfitting is of course what we call "concept drift" - which may reflect a changing target concept or may reflect a stationary concept with different manifestations, but is typically not assumed to be adversarial.
Feedback occurs in adversarial domains, such as fraud detection.
(Pedro Domingos introduced this idea to the data mining community in 2004, see
Adversarial Classification, by Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, Deepak Verma
When the subjects of analysis become aware of the capabilities of a detection system based on a model, they consciously adapt their behavior to avoid, minimize, or reduce their likelihood of being detected. Often, as I explained in my
KDD2009 workshop paper , inducing this change in user behavior is a far more beneficial effect of deploying the detection model than the actual detections themselves because the modified user behaviors are (1) easier to detect, (2) more complicated and therefore more difficult to execute, either reducing the population of people capable of the bad behavior and/or reducing the likelihood that the bad behavior will achieve its intended effects.
What I observed in my work at NASD Regulation is that when we deployed a new model into our surveillance systems, the normalized number of detections was about 0.5. As we got some experience with the results from the model, it increased to its maximum. This typically occurred over several months. The number of detections then decreased to about 0.1 of the maximum, as users adapted their behavior to avoid triggering the detectors - after word spread of follow-up enforcement actions based on these new detectors. We considered this to be a major benefit, since our real goal was not to detect more fraud but rather to reduce the amount of fraud in the market. Many colleagues in other organizations that also have built fraud detection systems have told me that they have observed similar effects. I mentioned this phenomena in my KDD2000 paper.
Two more points:
1. The stock market or other trading environments, where an information advantage or market inefficiency disappears after people become aware of its existence and act on it, is another such "adversarial" domain that isn't in the general category of fraud detection.
2. One reason we have insisted on more rigorous evaluation criteria for "discovery" papers in the application track of KDD is that we often get papers that purport to have discovered something of import in some domain (e.g., a trading system that based on back testing could produce excess profits). Such papers are often examples of overfitting and therefore a potential embarrassment to the KDD community. The solution was to insist on "external validation" - which could be publication in a peer-reviewed journal in the field of the purported discovery or could be reliance on the so-called discovery by domain experts. So in the trading rule example, e would accept a paper that discovered an effect that passed peer review in a finance journal or a paper that discovered an effect based on which significant amounts of money were being invested. But we would reject a paper as lacking external validation if the discovered knowledge hadn't achieved either of these measures of external validation.