Datamining

Datamining Essay, Research Paper

‘ Datamining is necessary for survival. If you don’t use it to predict a trend before your competitors, you’re dead’. Eric Brethenoux ( research director for advanced technologies at the Gartner Group ) is not the only person who believes that datamining is a technology which cannot be ignored, but what is it and how does it differ from more ‘traditional’ data analysis techniques ?[1]

Having a database is one thing, making sense of it is quite another. Datamining techniques are used to spot trends in data which many companies already have archived away somewhere, but until now have been unable to fully exploit. The power of datamining lies in the way it does not rely on narrow human queries to produce results, but instead uses technology developed in the fields associated with Artificial Intelligence. [4] In this way it can search for and identify relationships which humans may not have had the perception to see. A good way to envisage this is to think of the way a master chess player can distinguish between a computer and a human opponent. A computer will often make moves which a human wouldn’t execute because they don’t ‘look’ right. Humans have to minimize their ’search tree’ because we don’t have the power to explore a large amount of moves and outcomes in a sensible period of time. Thus a number of avenues will be eliminated becuase they don’t fit in with what we preconcieve to be ‘right’.

In the same way when humans are trying to find causal relationships within data we will often make preconceptions about what is, and what isn’t going to be there. In contrast to this computers have the power to not bother about minimising the search to the same extent and thus find relationships which would have not been considered by a human analyst. Datamining uses this fact to produce useful inferences concerning data that human analysts would never see. In this way the datamining user doesn’t exactly pose a question as much as ask the system to use a model to discover past patterns that predict future behaviour. This can result in valuable previously unknown facts. In contrast to this traditional methods rely on a human to feed in a question or hypotheseis. A typical OLAP or DSS question might be ‘ Did students at Imperial College drink more beer than students at other colleges last term ? Whereas a datamining equivalent might be more open ended, such as ‘Give me a model that identifies the most predictive characteristics of students beer drinking habits’.

Datamining is often confused with with on-line analytical processing (OLAP) and decision support systems ( DSS ), which use deductive reasoning. In contrast to these, datamining uses inductive reasoning. The best results are achieved when great oceans of data are avaliable, in a data Warehouse for example. However it can be done with less data, it’s just that you’re more likely to discover interesting, previously unthought of relationships within the data if you’ve got more to play with. Part of the power of datamining is that most of the systems available use more than one type of algorithm to search for patterns in the data, they use a combination of neural networks, induction, association, fuzzy logic, statistical analysis and visualisation. The idea being that the more ways you have of looking for something the more likely you are to find it. These algorithms are then used in one or more of the following ways to attack the data with which they are presented.

[1]

Predictive modeling: In OLAP it’s deductive reasoning, in datamining it’s inductive reasoning. Predictive modeling can be implemented in a variety of ways, including via neural networks or induction algorithms.

Database segmentation: The automatic partitioning of a database into clusters. It generally uses statistical clustering in its implementation.

Link analysis: Identifying connections between records, based on association discovery and sequence discovery.

Deviation detection: The detection and explanation for why records cannot be put into specific segments. This can be implemented via different kinds of statistics.

Systems involving these methods have already been used succesfully in areas such as credit card fraud detection, cancer research and target marketing. All of these are areas of great significance.[2] Clearly developers are taking the area very seriously and it’s not just specialist companies that are developing the technology. IBM are key players in the datamining market and they already have a number of products on the market, such as their new comprehensive datamining package – The Intelligent Miner toolkit which uses predictive modeling, database segmentation and link analysis. This system, along with many others uses preprocessing before the mining takes place. Preprocessing involves using more standard techniques, such as statistical analysis before letting the dataminer rip on the information. This helps to speed things up. Another interesting concept is the product recently released by a small company called DataMind. It uses a local search rover that crawls across databases like intelligent agents on the internet.

Oh dear ! Here comes my P45

The results produced by datamining are usually more general and hence more powerful than those produced by traditional techniques. For example the Knowledge Seeker system produced by Angoss delivers the results in the form of easy-to-grasp decision trees. These results can then be used to create a knowledge base of rules that predict the outcome of data-based decisions. In essence the dataminer is not only exposing contextual cause and effect relationships, it’s also delivering models which can be used to predict trends within the data. Thus making the previous role of data analysts redundant, but does this mean that they’re all going to be walking out of the door clutching their P45’s ? It doesn’t look that way at the moment. Despite the claims made by many manufacturers concerning the ease of use of their system, it still requires a pretty big chunk of data ‘nouse’ to get the whole system running. The results produced may already be in a form that requires little interpretation to make them useable by the people making the decisions, but many people have more fundamental problems to tackle before datamining is on the agenda. Companies which have adopted datamining are falling over themselves to sing it’s praises and brandish statements about how it’s drasticly improved revenue or sales or profit, so why isn’t everyone else jumping on the bandwagon ?

Datamining can be done on desktop size machines, but the results are much better if there is a whole warehouse of data for the mining algorithms to work on. Not everyone has facilities of this type avaliable to them yet, but when they do it is likely that they simply will no longer be able to afford to ignore datamining. If they want to survive that is.

——————————————————————————–

References:

Excavate Your Data by Cheryl Gerber. Datamation, May 1996, pp 40 – 43. An informative article that gives a great introduction to datamining. 9 / 10.

Data Distilleries Produce datamining products, and have a good page explaining what datamining is. Very useful. 8 / 10

Angoss Have a good Web page, giving details of their Knowledge Seeker datamining product. It gives a lot of applications, aswell as good description of the system from a users point of view. 7 / 10

HNC manufacture a datamining system called the Marksman. Limited information about Datamining can be found on their site. 6 / 10

‘Some Practical Applications of Neural Networks in Information Retrieval’ by Michael Philip Oakes and David Reid. ( Dept. of Computer Science, University of Liverpool ). British Computer Society 13th Information Retrieval Colloquium Edited by Tony McEnery. pp 167 – 185. Quite a technical paper with some relevant stuff on Neural Networks and info retrieval. 5/ 10