CMPS-MC24 Literature Survey: Successful Applications of Data Mining Matthew Sparkes January, 2004 Abstract This literature survey will focus on both the successful applications of data mining and those areas where it is proposed that it could be successful, and what techniques have been used in those instances.
1
Introduction
Data Mining is a technique of data analysis that allows the user to discover previously unknown relationships amongst data that can be applied to a problem usefully. This has become very important for many different industries as their ability to analyse data has become inadequate for the vast amount of data that they store. It allows organisations to become more profitable and efficient by furnishing them with information that ordinarily could never be discovered. This paper analyses different fields where data mining is currently being used and to what extent it has been successful.
2 2.1
Successful Applications of Data Mining Data Mining in Criminal Investigations
Data mining has been used in many fields of criminal investigation, as outlined here. The scopes of different types of investigation vary greatly and have been discussed seperately in this section. 2.1.1
Data Mining in Forensic Investigations
There are many different databases used in forensic investigations. The police can make use of vast records of fingerprints, DNA, and even records 1
as varied as shoe prints. These records all take one of three forms; images taken from a crime scene, images taken from suspects or images to be used as control samples (i.e. a shoe print from all popular shoe types). [Geradts] Various analysis techniques are used dependent upon what sort of image is in question. For fingerprint analysis, things such as euclidean distance and the Henry classification scheme are used. Searches based upon one type of data are simple enough, although some advanced analysis techniques have been developed to match them they are all structured similarly. Data mining, it has been suggested, can take this analysis to another level, looking for patterns amongst all of these different databases. This has not been implemented in the Geradts paper [Geradts], but it does state that it would be of huge benefit. It would enable police to discover links that would be impossible to find manually, it could conceivably solve currently unsolved crimes as the DNA database did when it was brought online. Another paper describes an implemented system in use in the American legal system called coplink. It uses a ’concept space’ which is a database with weighted associations, that is an evaluated relevance between all records. This allows users to discover other appropriate records, for example if a shoeprint is taken at a crime scene then other crimes of a similar nature where a similar shoeprint was found would be flagged. Likewise, known criminals with a record of that crime who are known to own that type of shoe will also be highlighted. [Hauck 02] 2.1.2
Data Mining in Credit Card Fraud Detection
The pattern of spending of most card users is surprisingly predictable, the weekly shop, average monthly disposable income spending and where it is spent do not vary drastically. Data mining can be used to find erroneous data such as a larger than usual payment, or a payment to an unlikely source. This is often a sign that credit card fraud is occurring. One suggested method of finding erroneous transactions is to use several algorithms to analyse each set of transactions and then to use a metaclassifier to collate these analyses. It would then use them to flag certain transactions as illegitimate far more accurately than if any one algorithm was used. [?] One limitation of this approach is that the algortithms do not learn in any dynamic sense, the paper states that adaptive algortithms must be used as spending patterns change and criminals also adapt and learn. This multi algorithm approach proved itself to be a very effective technique in experiments and has also lent itself well to finding illegal intrusions into networks as the two data sets (network and transaction logs) are inherently similar. [Chan 99]
2
2.1.3
Data Mining in Computer Network Security
The main aim of mining data for computer security is to examine logs to find unusual patterns such as an irregular log-in time. This can often help to discover illegal intrusions into the network. This is a successful application for data mining and helpfully the target data is generated by a computer so no data cleansing needs to be performed. Research into an entirely visual representation of network activity has been conducted, this is based on the premise that humans can take in visual data at 150 Mb/S. [Yurcik et al]. The idea is that one screen can show the state of many thousands of computers connected to the network, representing each as a two pixel square of varying colour. The information for this display is taken from network logs that have been data mined for any erroneous data that should be highlighted. Another visual system called mining alarming information from data streams (MAIDS) has been developed using clustering techniques on a stream of network information (currently synthesised logs) which uses pie charts to represent traffic and its classification assigned by the system (illegal/legal etc). [Dora et al] This is a particularly interesting package as tests have shown it to be very well suited not just to detecting illegal intrusions but also to monitoring any constant stream of structured data. This would enable it to be implemented in a real-time monitoring scenario.
2.2
Data Mining in Tax Administration
There are many areas of tax administration where data mining has proved useful. Perhaps the most successful application has been in selecting audit targets. As governments only have a limited amount of resources with which to enforce tax payment, they must choose carefully whom to audit in order to maximize their return. [Micci 04] This paper outlines a technique for achieving this although it merely uses Clementine, an existing commercial data mining package and breaks no new ground with its approach. It is however a perfect example of a simple use for data mining, taking existing data about tax offenders and trying to match that profile to current targets. A database of tax offenders is built up and patterns in that data can be found, these patterns are then searched for in the data space that includes all citizens. It has shown itself to be a useful tool in using an existing budget far more effectively and efficiently. [Micci 04]
3
2.3
Data Mining in Insurance Risk Assessment
In insurance risk assessment it can be beneficial to find previously unknown patterns in claim data. For example if it is generally assumed that young (under 25) male drivers are of a very high risk then it would be very profitable to discover that this demographic who also happen to drive classic cars are very low risks. This is found to be the case, often because of the amount of care taken over such cars. It would be highly profitable for an insurance company to discover this before it’s competitors as it would enable them to offer a specialist package for this demographic. It is often the case that the profit made from this previously unknown sector can easily cover the cost of the IT infrastructure needed to unearth it, with the additional benefit that the company has increased it’s market share.
2.4
Data Mining in Aviation Safety
Mining Data in aviation safety records is an important exercise to discover trends in accidents. This information may be used to ensure that that type of event does not happen in the future. One paper discusses the problems inherent in mining both structured and unstructured data. [Bloedorn] Queries could be made on both types but would only be useful if exactly the same string was found in two areas, this would obviously not flag all relevant data. The proposed strategy in this paper was the use of a hybrid of strict boolean, ordinal and vector based matching. [Bloedorn] The approach also makes use of streaming, whereby variations of the same word are treated the same in order to increase the amount of relevant results i.e. engineering, engineered and engineer. Stop words are also taken into account as in search engines like Google, meaning that words such as in, of, and are omitted due to their irrelevance and frequency in normal text. Another technique that uses clustering on streams of data, called MAIDS could also be very well suited to this field, although it is only really appropriate in real time due to it processing streams of data. It certainly could not work on unstructured text records, but it is easy to see this being used on black box data from aircraft in either real-time or simulated real-time after the event. I has proved itself on network intrusion problems and could conceivably extract erroneous data from the enormous number of metrics that are reguarly recorded on aircraft. [Dora et al]
4
2.5
Data Mining in Retail
The use of data mining in retail is becoming far more extensive, in recent years many large retail companies have increased the amount of data they store about customers and with that comes the greater need for the ability to analyse that data. By offering store discount cards such as the Tesco Clubcard it is possible to gather data about an individual and link that to what they buy. By mining this data many useful connections can be made between products, i.e. the customer buying Product A will often buy Product B in the same trip. These are said to be complementary products and may include pasta and pasta sauce, for example. These connections can be used in designing the layout of the store, placing these two products amongst impulse buy items and in separate aisles may greatly increase the average spend of a customer, this is often called market basket analysis. [Apte 97]
2.6
Data Mining in Targeted Marketing
As previously mentioned retail companies retain data about their customers, linking purchase information with customer information by using store loyalty cards. This data is also used to calculate who to send certain advertising material to, in order to maximise the possibility that the advert will be received well. For example if a user regularly buys petrol on a store card it is clear that they own a car, therefore it would be relevant to send information about car loans to that customer. [Apte 97] This type of data represents a powerful marketing tool and enables the company to both cut down on the cost of marketing and increase the positive result that marketing. Before data mining this would have been a paradox.
3
Conclusion
Data mining has found many different applications through necessity, because almost every organisation in the world is undergoing a data storage crisis. The amount of data in the world doubles every year and our capacity to analyse that data is simply not keeping up. Data warehousing provides a solution to the storage problem, but it is useless without analysis. Data mining can solve this issue, making the data warehouse not only justifiable but indispensible to the large modern corporation. Data mining is still a very young technology, there are three main type of data mining, clustering, predictive modeling and frequent pattern extraction, and these models are basically the same whatever use they are put to. The innovation and advancement is coming from programmers and analysts who are finding ever more 5
intelligent ways to use them to extract meaningful information from data. [Apte 97] Companies are making use of this new technology and it’s use is increasing, the worldwide spending on data mining is currently estimated at $539 million and if current growth continues it will reach $1.85 billion by 2006, this proves that there is a demand for fast and efficient analysis of vast amounts of data. Companies and organisations have been collecting information for decades and now it seems that they can finally start to use it productively rather than file it all away for occasional reference. [Leavitt 02] It may however, have negative connotations for the consumer, it raises privacy issues surrounding maintaining records of customer/citizen activity and companies must ensure that in trying to become more efficient they do not cross ethical and moral boundaries about what data they collect and how they use that information. [Nascio 04]
References [Yurcik et al] “Two Visual Computer Network Security Monitoring Tools Incorporating Operator Interface Requirements”, William Yurcik, James Barlow, Kiran Lakkaraju Mike Haberman, National Center for Supercomputing Applications (NCSA) [Clare]
“Data mining the yeast genome in a lazy functional language”, Amanda Clare Ross D. King, Computational Biology Group, Department of Computer Science, University of Wales Aberystwyth
[Micci 04] “Improving Tax Administration with Data Mining”, Daniele Micci-Barreca, PhD Satheesh Ramachandran, PhD, White Paper, Elite Analytics, LLC, 2004 [Bloedorn] “Mining Aviation Safety Data: A Hybrid Approach”, Eric Bloedorn, The MITRE Corporation [Nascio 04] “Think Before You Dig: Privacy Implications of Data Mining Aggregation”, National Association of State Chief Information Officers, USA, September 2004 [Geradts] “Data Mining in Forensic Image Databases”, Zeno Geradts, Jurrien Bijhold, Netherlands Forensic Institute [Leavitt 02] “Data Mining for the Corporate Masses?”, Neal Leavitt, Computer (Magazine), May 2002
6
[Apte 97] “Data Mining - An Industrial Research Perspective”, C. Apte, IEEE Computational Science and Engineering, April-June 1997 [Chan 99] “Distributed Data Mining in Credit Card Fraud Detection”, Philip K. Chan, Florida Institute of Technology Wei Fan, Andreas L. Prodromidis, and Salvatore J. Stolfo, Columbia University, IEEE Intelligent Systems, November/December 1999 [Hauck 02] “Using Coplink to Analyze Criminal-Justice Data”, Roslin V. Hauck, Homa Atabakhsh, Pichai Ongvasith, Harsh Gupta Hsinchun Chen, University of Arizona, IEEE Computing Practices, 2002 [Dora et al] “MAIDS: Mining Alarming Incidents from Data Streams”, Y. Dora Cai, David Clutter, Greg Pape, Jiawei Han, Michael Welge Loretta Auvil, University of Illinois at Urbana-Champaign U.S.A., Demonstration Proposal
7