InuyashaBlog: 2008

Tuesday, December 16, 2008

Examples for data Mining Application

Examples for data Mining Application

US Treasury Department
Worth particular mention is a system developed by the Financial Crimes Enforcement Network (FINCEN) of the US Treasury Department called "FAIS". FAIS detects potential money-laundering activities from a large number of big cash transactions. The Bank Secrecy Act of 1971 required the reporting of all cash transactions greater than $10,000, and these transactions, of about fourteen million a year, are the basis for detecting suspicious financial activities. By combining user expertise with the system's rule-based reasoner, visualization facilities, and association-analysis module, FIAS uncovers previously unknown and potentially high-value leads for possible investigation. The reports generated by the FIAS application have helped FINCEN uncover more than 400 cases of money-laundering activities, involving more than $1 billion in potentially laundered funds. In addition, FAIS is reported to be able to discover criminal activities that law enforcement in the field would otherwise miss, e.g., connections in cases involving nearly 300 individuals, more than eighty front operations, and thousands of cash transactions.

Mellon Bank, USA
Mellon Bank has used the data on existing credit-card customers to characterize their behavior and they try to predict what they will do next. Using IBM Intelligent Miner, Mellon developed a credit card-attrition model to predict which customers will stop using Mellon's credit card in the next few months. Based on the prediction results, the bank can take marketing actions to retain these customers' loyalty.

Capital One Financial Group
Financial companies are one of the biggest users of data-mining technology. One such user is Capital One Financial Corp., one of the nation's largest credit-card issuers. It offers 3000 financial products, including secured, joint, co-branded, and college-student cards, Using data-mining techniques, the company tries to help market and sell the most appropriate financial product to 150 million potential prospects residing in its over 2-terabyte Oracle-based datawarehouse. Even after a customer has signed up, Capital One continues to use data mining for tracking the ongoing profitability and other characteristics of each of its customers. The use of data mining and other strategies has helped Capital One expand from $1 billion to $12.8 billion in managed loans over eight years. An additional successful data-mining application at Capital One is fraud detection.

American Express
Another example of data mining is at American Express, where dataware-housing and data mining are being used to cut spending. American Express has created a single Microsoft SQL Server database by merging its worldwide purchasing system, corporate purchasing card, and corporate card databases. This allows American Express to find exceptions and patterns to target for cost cutting.

MetLife, Inc.
MetLife's Intelligent Text Analyzer has been developed to help automate the underwriting of 260,000 life insurance applications received by the company every year. Automation is difficult because the applications include many free form text fields. The use of keywords or simple parsing techniques to understand the text fields has proven to be inadequate, while the application of full semantic natural-language processing was perceived to be too complex and unnecessary. As a compromise solution, the "information-extraction" approach was used in which the input text is skimmed for specific information relevant to the particular application. The system currently processes 20,000 life-insurance applications a month and it is reported that 89% of the text fields processed by the system exceed the established confidence-level threshold.

Data Mining Tools

Data Mining Tools

AgentBase/Marketeer
AgentBase/Marketeer is, according to its designers, the industry's first second-generation data-mining product. It is based on emerging intelligent-agent technology. The system comes with a group of wizards to guide a user through different stages of data mining. This makes it easy to use. AgentBase/Marketeer is primarily aimed at marketing applications. It uses several datamining methodologies whose results are combined by intelligent agents. It can access data from all major sources, and it runs on Windows95, Windows NT, and the Solaris operating system.

ANGOSS Knowledge Miner
ANGOSS Knowledge Miner combines ANGOSS Knowledge Studio with proprietary algorithms for clickstream analysis; it interfaces to Web logreporting tools.

Autoclass III
Autoclass is an unsupervised Bayesian classification system for independent data. It seeks a maximum posterior probability to provide a simple approach to problems such as classification, clustering, and general mixture separation. It works on Unix platforms.

BusinessMiner
BusinessMiner is a single-strategy, easy-to-use tool based on decision trees. It can access data from multiple sources including Oracle, Sybase, SQL Server, and Teradata. BusinessMiner runs on all Windows platforms, and it can be used stand-alone or in conjunction with OLAP tools.

CART
CART is a robust data-mining tool that automatically searches for important patterns and relationships in large data sets and quickly uncovers hidden structures even in highly complex data sets. It works on the Windows, Mac, and Unix platforms.

Clementine
Clementine is a comprehensive toolkit for data mining. It uses neural networks and rule-induction methodologies. The toolkit includes data manipulation and visualization capabilities. It runs on Windows and Unix platforms and accepts the data from Oracle, Ingres, Sybase, and Informix databases. A recent version offers sequence association and clustering for Web-data analyses.

Darwin (now part of Oracle)
Darwin is an integrated, multiple-strategy tool that uses neural networks, classification and regression trees, nearest-neighbor rule, and genetic algorithms. These techniques are implemented in an open client/server architecture with a scalable, parallel-computing implementation. The client-side unit can work on Windows and the server on Unix. Darwin can access data from a variety of networked data sources including all major relational databases. It is optimized for parallel servers.

DataEngine
DataEngine is a multiple-strategy data-mining tool for data modeling, combining conventional data-analysis methods with fuzzy technology, neural networks, and advanced statistical techniques. It works on the Windows platform.

Data Mining Suite
Data Mining Suite is a comprehensive and integrated set of data-mining tools. The main tools are IDIS (Information Discovery System) for finding classification rules, IDIS-PM (Predictive Modeler) for prediction and forecasting, and IDIS-Map for finding geographical patterns. Data Mining Suite supports client/server architecture and runs on all major platforms with different database-management systems. It also discovers patterns of users' activities on Web sites.

Data Surveyor
Data Surveyor is a single-strategy (classification) tool. It consists of two components: a front-end a back-end. The front-end is responsible for data mining using the tree-generation methodology. The back-end consists of a fast, parallel, database server where the data are loaded from a user's databases. The back-end runs on parallel Unix servers and the front-end works with Unix and Windows platforms.

DataMind
DataMind's architecture consists of two components: DataCruncher for serverside data mining and DataMind Professional for client-side specification and viewing results. It can implement classification, clustering, and association-rule technologies. DataMind can be set up to mine data locally or on a remote server, where data are organized using any of the major relational databases.

Datasage
Datasage is a comprehensive data-mining product whose architecture incorporates a data mart in its data-mining server. The user accesses Datasage through an interface operating as a thin client, using either a Windows client or a Java-enabled browser client.

DBMiner
DBMiner is a publicly available tool for data mining. It is a multiple-strategy tool and it supports methodologies such as clustering, association rules, summarization, and visualization. DBMiner uses Microsoft SQL Server 7.0 Plato and runs on different Windows platforms.

Decision Series
Decision Series is a multiple-strategy tool that uses artificial neural networks, clustering algorithms, and genetic algorithms to perform data mining. It can operate on scalable, parallel platforms to provide speedy solutions. It runs on standard industry platforms such as HP, SUN, and DEC, and it supports most of the commercial, relational database-management systems.

Decisionhouse
Decisionhouse is a suite of tightly integrated tools that primarily support classification and visualization processes. Various aspects of data preparation and reporting are included. It works on the Unix platform.

Delta Miner
Delta Miner is a multiple-strategy tool supporting clustering, summarization, deviation-detection, and visualization processes. A common application is the analysis of financial controlling data. It runs on Windows platforms and it integrates new search techniques and "business intelligence" methodologies into an OLAP front-end.

Emerald
Emerald is a publicly available tool still used as a research system. It consists of five different machine-learning programs supporting clustering, classification, and summarization tasks.

Evolver
Evolver is a single-strategy tool. It uses genetic-algorithm technology to solve complex optimization problems. This tool runs on all Windows platforms and it is based on data stored in Microsoft Excel tables.

GainSmarts
GainSmarts uses predictive-modeling technology that can analyze past purchases and demographic and lifestyle data to predict the likelihood of response and other characteristics of customers.

IBM Datajoiner
Datajoiner allows the user to view multivendor-relational and nonrelational, local and remote-geographically distributed databases as local databases to access and join tables without knowing the source locations.

IBM Intelligent Miner
Intelligent Miner is an integrated and comprehensive set of data-mining tools. It uses decision trees, neural networks, and clustering. The latest version includes a wide range of text-mining tools. Most of its algorithms have been parallelized for scalability. A user can build models using either a GUI or an API. It works only with DB2 databases.

KATE
KATE is a single, rule-based strategy tool consisting of four components: KATE-editor, KATE-CBR, KATE-Datamining, and KATE-Runtime. It runs on Windows and Unix platforms, and it is applicable to several databases.

Kensington 2000
Kensington 2000 is an internet-based knowledge-discovery and-management platform for the analyses of large and distributed data sets.

Kepler
Kepler is an extensible, multiple-strategy data-mining system. The key element of its architecture is extensibility through a "plug-in" interface for external tools without redeveloping the system core. The tool supports datamining tasks such as classification, clustering, regression, and visualization. It runs on Windows and Unix platforms.

Knowledge Seeker
Knowledge Seeker is a single-strategy desktop or client/server tool relying on a tree-based methodology for data mining. It provides a nice GUI for model building and letting the user explore data. It also allows users to export the discovered data model as text, SQL query, or Prolog program. It runs on Windows and Unix platforms, and accepts data from a variety of sources.

MATLAB NN Toolbox
A MATLAB extension implements an engineering environment (i.e. a computer-based environment for engineers to help them solve their common tasks) for research in neural networks and its design, simulation, and application. It offers various network architectures and different learning strategies. Classification and function approximations are typical data-mining problems that can be solved using this tool. It runs on Windows, Mac, and Unix platforms.

Marksman
Marksman is a single-methodology tool based on artificial neural networks. It provides a number of useful data-manipulation features, which are very important in preprocessing. Its design is optimized for the database-analysis needs of direct-marketing professionals, and it runs on PC/Windows platforms.

MARS
MARS is a logistic-regression tool for binary classification. It automatically handles missing values, detection of interaction between input variables, and transformation of variables.

MineSet
MineSet is comprehensive tool for data mining. Its features include extensive data manipulation and transformation capabilities, varius data-mining approaches, and powerful visualization capabilities. MineSet supports client/server architecture and runs on Silicon Graphics platforms.

NETMAP
NETMAP is a general purpose, information-visualization tool. It is most effective for large, qualitative, text-based data sets. It runs on Unix workstations.

Neuro Net
NeuroNet is a publicly available software for experimentation with different artificial neural-network architectures and types.

NeuroSolutions V3.0
NeuroSolutions V3.0 combines a modular, icon-based artificial neuralnetwork design, and it solves data-mining problems such as classification, prediction, and function approximation. Its implementations are based on advanced learning techniques such as recurrent backpropagation and backpropagation through time. The tool runs on all Windows platforms.

OCI
OCI is publicly available software for data mining. It is specially designed as a decision tree induction system for applications where the samples have continuous feature values.

OMEGA
OMEGA is a system for developing, evaluating, and implementing predictive models using the genetic-programming approach. It is suitable for the classification and visualization of data. It runs on all Windows platforms.

Partek
Partek is a multiple-strategy data-mining product. It is based on several methodologies including statistical techniques, neural networks, fuzzy logic, genetic algorithms, and data visualization. It runs on UNIX platforms.

Pattern Recognition Workbench (PRW)
Pattern Recognition Workbench (PRW) is a comprehensive multiple-strategy tool. It uses neural networks, statistical pattern recognition, and machinelearning methodologies. It runs on all Windows platforms using a spreadsheet-style GUI. PRW automatically generates alternative models and searches for the best solution. It also provides a variety of visualization tools to monitor model building and interpret results.

Readware Information Processor
Readware Information Processor is an integrated text-mining environment for intranets and the Internet. It classifies documents by content, providing a literal and conceptual search. The tool includes a ConceptBase with English, French, and German lexicons.

SAS (Enterprise Miner)
SAS (Enterprise Miner) represents one of the most comprehensive sets of integrated tools for data mining. It also offers a variety of data manipulation and transformation features. In addition to statistical methods, the SAS Data Mining Solution employs neural networks, decision trees, and SAS Webhound that analyzes Web-site traffic. It runs on Windows and Unix platforms and it provides a user-friendly GUI front-end to the SEMMA (Sample, Explore, Modify, Model, Assess).

Scenario
Scenario is a single-strategy tool that uses the tree-based approach to data mining. The GUI relies on wizards to guide a user through different tasks, and it is easy to use. It runs on Windows platforms.

Sipina-W
Sipina-W is publicly available software that includes different traditional data-mining techniques such as CART, Elisee, ID3, C4.5, and some new methods for generating decision trees.

SNNS
SNNS is a publicly available software. It is a simulation environment for research on and application of artificial neural networks. The environment is available on Unix and Windows platforms.

SPIRIT
SPIRIT is a tool for exploration and modeling using Bayesian techniques. The system allows communication with the user in the rich language of conditional events. It works on Windows platforms.

SPSS
SPSS is one of the most comprehensive integrated tools for data mining. It has data-management and data-summarization capabilities and includes tools for both discovery and verification. The complete suite includes statistical methods, neural networks, and visualization techniques. It is available on a variety of commercial platforms.

S-Plus
S-Plus is an interactive, object-oriented programming language for data mining. Its commercial version supports clustering, classification, summarization, visualization, and regression techniques. It works on Windows and Unix platforms.

STATlab
STATlab is a single-strategy tool that relies on interactive visualization to help a user perform exploratory data analysis. It can import data from common relational databases and it runs on Windows, Mac, and Unix platforms.

STATISTICA-Neural Networks
STATISTICA-Neural Networks is a single-strategy tool includes a standard backpropagation-learning algorithm and iterative procedures such as Conjugate Gradient Descent and Levenberg-Marquardt. It runs on all Windows platforms.

Strategist
Strategist is a tool based on Bayesian-network methodology to support different dependency analyses. It provides the methodology for integration of expert judgments and data-mining results, which are based on modeling of uncertainties and decision-making processes. It runs on all Windows platforms.

Syllogic
Syllogic Data Mining Tool is a toolbox that combines many data-mining methodologies and offers a variety of approaches to uncover hidden information. It includes several data-preprocessing and -transformation functions. It is available on Windows NT and Unix platforms and it supports most of the commercial relational databases.

Teradata Warehouse Miner
Teradata Warehouse Miner provides different statistical analyses, decisiontree methods, and regression methodologies for in-place mining on a Teradata database-management system.

TiMBL
TiMBL is a publicly available software. It includes several memory-based learning techniques for discrete data. A representation of the training set is explicitly stored in memory, and new cases are classified by extrapolation from the most similar cases.

TOOLDIAG
TOOLDIAG is a publicly available tool for data mining. It consists of several programs in C for statistical pattern recognition of multivariate numeric data. The tool is primary oriented toward classification problems.

WINROSA
WINROSA is a software tool that complements many other tools available for building fuzzy logic systems. It automatically generates fuzzy rules from the available data set. It works on Windows platforms.

Viscovery©SOMine
This single-strategy data-mining tool is based on self-organizing maps and is uniquely capable of visualizing multidimensional data. Viscovery©SOMine supports clustering, classification, and visualization processes. It works on all Windows platforms.

Weka (2.2)
Weka is a software environment that integrates several machine-learning tools within a common framework and a uniform GUI. Classification and summarization are the main data-mining tasks supported by the Weka system.

WUM
WUM 6.0 is a publicly available integrated environment for Web-log preparation, querying, and visualization of summarized activities on a Web site.

VISUALIZATION SYSTEMS FOR DATA MINING

VISUALIZATION SYSTEMS FOR DATA MINING

Many organizations, particularly within the business community, have made significant investments in collecting, storing, and converting business information into results that can be used. Unfortunately, typical implementations of business "intelligence software" have proven to be too complex for most users except for their core reporting and charting capabilities. Users' demands for multidimensional analysis, finer data granularity, and multiple data sources, simultaneously, all at Internet speed, require too much specialist intervention for broad utilization. The result is a report explosion in which literally hundreds of predefined reports are generated and pushed throughout the organization. Every report produces another. Presentations get more complex. Data is exploding. The best opportunities and the most important decisions are often the hardest to see. This is in direct conflict with the needs of front-line decision-makers and knowledge-workers who are demanding to be included in the analytical process.

Presenting information visually, in an environment that encourages the exploration of linked events, leads to deeper insights and more results that can be acted upon. Over the past decade, research on information visualization has focused on developing specific visualization techniques. An essential task for the next period is to integrate these techniques into a larger system that supports work with information in an interactive way, through the three basic components: foraging the data, thinking about data, and acting on data.

The vision of a visual data-mining system stems from the following principles: simplicity, visibility, user-autonomy, reliability, reusability, availability, and security. A visual data-mining system must be syntactically simple to be useful. Simple doesn't mean trivial or nonpowerful. Simple to learn means use of intuitive and friendly input mechanisms as well as instinctive and easy-to-interpret output knowledge. Simple to apply means an effective discourse between humans and information. Simple to retrieve or recall means a customized data structure that facilitates fast and reliable searches. Simple to execute means a minimum number of steps needed to achieve the results. In short, simple means the smallest, functionally sufficient system possible.

A genuinely visual data-mining system must not impose knowledge on its users, but instead guide them through the mining process to draw conclusions. Users should study the visual abstractions and gain insight instead of accepting an automated decision. A key capability in visual analysis, called visibility, is the ability to focus on particular regions of interest. There are two aspects of visibility: excluding and restoring data. The exclude process eliminates the unwanted data items from the display so that only the selected set is visible. The restore process brings all data back, making them visible again.

A reliable data-mining system must provide estimated error or accuracy of the projected information in each step of the mining process. This error information can compensate for the deficiency that an imprecise analysis of data visualization can cause. A reusable, visual, data-mining system must be adaptable to a variety of environments to reduce the customization effort, provide assured performance, and improve system portability. A practical, visual, data-mining system must be generally and widely available. The quest for new knowledge or deeper insights into existing knowledge cannot be planned. It requires that the knowledge received from one domain adapt to another domain through physical means or electronic connections. A complete, visual, data-mining system must include security measures to protect the data, the newly discovered knowledge, and the user's identity because of various social issues.

Through data visualization we want to understand or get an overview of the whole or a part of the n-dimensional data, analyzing also some specific cases. Visualization of multidimensional data helps decision-makers to

slice information into multiple dimensions and present information at various levels of granularity,

view trends and develop historical tracers to show operations over time,

produce pointers to synergies across multiple dimensions,

provide exception-analysis and identify isolated (needle in the haystack) opportunities,

monitor adversarial capabilities and developments,

create indicators of duplicative efforts,

conduct What-If Analysis and Cross-Analysis of variables in a data set.

Visualization tools transform raw experimental or simulated data into a form suitable for human understanding. Representations can take on many different forms, depending on the nature of the original data and the information that is to be extracted. However, the visualization process that should be supported by modern, visualization-software tools can generally be subdivided into three main stages: data preprocessing, visualization mapping, and rendering. Through these three steps the tool has to answer the questions: What should be shown in a plot? How should one work with individual plots? How should multiple plots be organized?

Data preprocessing involves such diverse operations as interpolating irregular data, filtering and smoothing raw data, and deriving functions for measured or simulated quantities. Visualization mapping is the most crucial stage of the process, involving design and adequate representation of the filtered data, which efficiently conveys the relevant and meaningful information. Finally, the representation is often rendered to communicate information to the human user.

Data visualization is essential for understanding the concept of multidimensional spaces. It allows the user to explore the data in different ways and at different levels of abstraction to find the right level of details. Therefore, techniques are most useful if they are highly interactive, permit direct manipulation, and include a rapid response time. The analyst must be able to navigate the data, change its grain (resolution), and alter its representation (symbols, colors, etc.).

Broadly speaking, the problems addressed by current information-visualization tools and requirements for a new generation fall into the following classes:

Presentation graphics-These generally consist of bars, pies, and line charts that are easily populated with static data and drop into printed reports or presentations. The next generation of presentation graphics enriches the static displays with a 3D or projected n-dimensional information landscape. The user can then navigate through the landscape and animate it to display time-oriented information.

Visual interfaces for information access-They are focused on enabling users to navigate through complex information spaces to locate and retrieve information. Supported user-tasks involve searching, backtracking, and history-logging. User-interface techniques attempt to preserve user-context and support smooth transitions between locations.

Full visual discovery and analysis-These systems combine the insights communicated by presentation graphics with an ability to probe, drill-down, filter, and manipulate the display to answer the "why" question as well as the "what" question. The difference between answering a "what" and a "why" question involves an interactive operation. Therefore, in addition to the visualization technique, effective data exploration requires using some interaction and distortion techniques. The interaction techniques let the user directly interact with the visualization. Examples of interaction techniques include interactive mapping, projection, filtering, zooming, and interactive linking and brushing. These techniques allow dynamic changes in the visualizations according to the exploration objectives, but they also make it possible to relate and combine multiple, independent visualizations. Note that connecting multiple visualizations by linking and brushing, e.g., provides more information than considering the component visualizations independently. The distortion techniques help in the interactive exploration process by providing a means for focusing while preserving an overview of the data. Distortion techniques show portions of the data with a high level of detail while other parts are shown with a much lower level of detail.

Three tasks are fundamental to data exploration with these new visualization tools:

Finding Gestalt-Local and global linearities and nonlinearities, discontinuities, clusters, outliers, unusual groups, and so on are examples of gestalt features that can be of interest. Focusing through individual views is the basic requirement to obtain a qualitative exploration of data using visualization. Focusing determines what gestalt of the data is seen. The meaning of focusing depends very much on the type of visualization technique chosen.

Posing queries-This is a natural task after the initial gestalt features have been found, and the user requires query identification and characterization technique. Queries can concern individual cases as well as subsets of cases. The goal is essentially to find intelligible parts of the data. In-graphical data analysis it is natural to pose queries graphically. For example, familiar brushing techniques such as coloring or otherwise highlighting a subset of data means issuing a query about this subset. It is desirable that the view where the query is posed and the view that present the response are linked. Ideally, responses to queries should be instantaneous.

Making comparisons-Two types of comparisons are frequently made in practice. The first one is a comparison of variables or projections and the second one is a comparison of subsets of data. In the first case, one compares views "from different angles"; in the second, comparison is based on views "of different slices" of the data. In either case, it is likely that a large number of plots are generated, and therefore it is a challenge to organize the plots in such a way that meaningful comparisons are possible.

Visualization has been used routinely in data mining as a presentation tool to generate initial views, navigate data with complicated structures, and convey the results of an analysis. Generally, the analytical methods themselves do not involve visualization. The loosely coupled relationships between visualization and analytical data-mining techniques represent the majority of today's state of the art in visual data mining. The process-sandwich strategy, which interlaces analytical processes with graphical visualization, penalizes both procedures with the other's deficiencies and limitations. For example, because an analytical process can't analyze multimedia data, we have to give up the strength of visualization to study movies and music in a visual data-mining environment. A stronger strategy lies in tightly coupling the visualization and analytical processes into one data-mining tool. Letting human visualization participate in the decision-making in analytical processes remains a major, challenge. Certain mathematical steps within an analytical procedure may be substituted by human decisions based on visualization to allow the same procedure to analyze a broader scope of information. Visualization supports humans in dealing with decisions that can no longer be automated.

Fuzzy Sets and Fuzzy Logic

Fuzzy Sets and Fuzzy Logic

In the previous chapters, a number of different methodologies for the analysis of large data sets have been discussed. Most of the approaches presented, however, assume that the data is precise. That is, they assume that we deal with exact measurements for further analysis. Historically, as reflected in classical mathematics, we commonly seek a precise and crisp description of things or events. This precision is accomplished by expressing phenomena in numeric or categorical values. But in most, if not all, real-world scenarios, we will never have totally precise values. There is always going to be a degree of uncertainty. However, classical mathematics can encounter substantial difficulties because of this fuzziness. In many real-world situations; we may say that fuzziness is reality whereas crispness or precision is simplification and idealization. The polarity between fuzziness and precision is quite a striking contradiction in the development of modern information-processing systems. One effective means of resolving the contradiction is the fuzzy-set theory, a bridge between high precision and the high complexity of fuzziness.

Fuzzy concepts derive from fuzzy phenomena that commonly occur in the real world. For example, rain is a common natural phenomenon that is difficult to describe precisely since it can rain with varying intensity, anywhere from a light shower to a torrential downpour. Since the word rain does net adequately or precisely describe the wide variations in the amount and intensity of any rain event, "rain" is considered a fuzzy phenomenon.

Often, the concepts formed in the human brain for perceiving, recognizing, and categorizing natural phenomena are also fuzzy. The boundaries ef these concepts are vague. Therefore, the judging and reasoning that emerges from them are also fuzzy. For instance, "rain" might be classified as "light rain", "moderate rain", and "heavy rain" in order to describe the degree of raining. Unfortunately, it is difficult to say when the rain is light, moderate, or heavy, because the boundaries are undefined. The concepts of "light", "moderate", and "heavy" are prime examples of fuzzy concepts themselves. To explain the principles of fuzzy sets, we will start with the basics in classical set theory.

Genetic Algorithms

Genetic Algorithms

There is a large class of interesting problems for which no reasonably fast algorithms have been developed. Many of these problems are optimization problems that arise frequently in applications. The fundamental approach to optimization is to formulate a single standard of measurement-a cost function-that summarizes the performance or value of a decision and iteratively improves this performance by selecting from among the available alternatives. Most classical methods of optimization generate a deterministic sequence of trial solutions based on the gradient or higher-order statistics of the cost function. In general, any abstract task to be accomplished can be thought of as solving a problem, which can be perceived as a search through a space of potential solutions. Since we are looking for "the best" solution, we can view this task as an optimization process. For small data spaces, classical, exhaustive search methods usually suffice; for large spaces special techniques must be employed. Under regular conditions, the techniques can be shown to generate sequences that asymptotically converge to optimal solutions, and in certain cases they converge exponentially fast. But the methods often fail to perform adequately when random perturbations are imposed on the function that is optimized. Further, locally optimal solutions often prove insufficient in real-world situations. Despite such problems, which we call hard-optimization problems, it is often possible to find an effective algorithm whose solution is approximately optimal. One of the approaches is based on genetic algorithms, which are developed on the principles of natural evolution.

Natural evolution is a population-based optimization process. Simulating this process on a computer results in stochastic-optimization techniques that can often outperform classical methods of optimization when applied to difficult, real-world problems. The problems that the biological species have solved are typified by chaos, chance, temporality, and nonlinear interactivity. These are the characteristics of the problems that have proved to be especially intractable to classical methods of optimization. Therefore, the main avenue of research in simulated evolution is a genetic algorithm (GA), which is a new, iterative, optimization method that emphasizes some facets of natural evolution. GAs approximate an optimal solution to the problem at hand; they are by nature stochastic algorithms whose search methods model some natural phenomena such as genetic inheritance and the Darwinian strife for survival.

Genetic algorithms (GA) are derivative-free, stochastic-optimization methods based loosely on the concepts of natural selection and evolutionary processes. They were first proposed and investigated by John Holland at the University of Michigan in 1975. The basic idea of genetic algorithms was revealed by a number of biologists when they used computers to perform simulations of natural genetic systems. In these systems, one or more chromosomes combine to form the total genetic prescription for the construction and operation of some organism. The chromosomes are composed of genes, which may take a number of values called allela values. The position of a gene (its locus) is identified separately from the gene's function. Thus, we can talk of a particular gene, e.g., an animal's eye-color gene with its locus at position 10 and its allela value as blue eyes.

Before going into details of the applications of genetic algorithms in the following sections, let us understand its basic principles and components. GAs encode each point in a parameter or solution space into a binary-bit string called a chromosome. These points in an n-dimensional space do not represent samples in the terms that we defined them at the beginning of this book. While samples in other data-mining methodologies are data sets given in advance for training and testing, sets of n-dimensional points in GAs are a part of a GA and they are produced iteratively in the optimization process. Each point or binary string represents a potential solution to the problem that is to be solved. In GAs, the decision variables of an optimization problem are coded by a structure of one or more strings, which are analogous to chromosomes in natural genetic systems. The coding strings are composed of features that are analogous to genes. Features are located in different positions in the string, where each feature has its own position (locus) and a definite allele value, which complies with the proposed coding method. The string structures in the chromosomes go through different operations similar to the natural-evolution process to produce better alternative solutions. The quality of new chromosomes is estimated based on the "fitness" value, which can be considered as the objective function for the optimization problem. The basic relations between concepts in natural evolution. Instead of single a point, GAs usually keep a set of points as a population, which is then evolved repeatedly toward a better overall fitness value. In each generation, the GA constructs a new population using genetic operators such as crossover and mutation. Members with higher fitness values are more likely to survive and participate in mating or crossover operations.

Artificial Neural Networks

Artificial Neural Networks

Work on artificial neural networks (ANNs) has been motivated by the recognition that the human brain computes in an entirely different way from the conventional digital computer. It was a great challenge for many researchers in different disciplines to model the brain's computational processes. The brain is a highly complex, nonlinear, and parallel information-processing system. It has the capability to organize its components so as to perform certain computations with a higher quality and many times faster than the fastest computer in existence today. Examples of these processes are pattern recognition, perception, and motor control. Artificial neural networks have been studied for more than four decades since Rosenblatt first applied the single-layer perceptrons to pattern-classification learning in the late 1950s

An artificial neural network is an abstract computational model of the human brain. The human brain has an estimated 1011 tiny units called neurons. These neurons are interconnected with an estimated 1015 links. Similar to the brain, an ANN is composed of artificial neurons (or processing units) and interconnections. When we view such a network as a graph, neurons can be represented as nodes (or vertices) and interconnections as edges. Although the term artificial neural network is most commonly used, other names include "neural network", parallel distributed-processing system (PDP), connectionist model, and distributed adaptive system. ANNs are also referred to in the literature as neurocomputers

A neural network, as the name indicates, is a network structure consisting of a number of nodes connected through directional links. Each node represents a processing unit, and the links between nodes specify the causal relationship between connected nodes. All nodes are adaptive, which means that the outputs of these nodes depend on modifiable parameters pertaining to these nodes. Although there are several definitions and several approaches to the ANN concept, we may accept the following definition, which views the ANN as a formalized adaptive machine:

DEF: An artificial neural network is a massive parallel distributed processor made up of simple processing units. It has the ability to learn from experiential knowledge expressed through interunit connection strengths, and can make such knowledge available for use.

It is apparent that an ANN derives its computing power through, first, its massive parallel distributed structure and, second, its ability to learn and therefore to generalize. Generalization refers to the ANN producing reasonable outputs for new inputs not encountered during a learning process. The use of artificial neural networks offers several useful properties and capabilities:

Nonlinearity - An artificial neuron as a basic unit can be a linear- or nonlinear-processing element, but the entire ANN is highly nonlinear. It is a special kind of nonlinearity in the sense that it is distributed throughout the network. This characteristic is especially important, for ANN models the inherently nonlinear real-world mechanisms responsible for generating data for learning.

Learning from examples - An ANN modifies its interconnection weights by applying a set of training or learning samples. The final effects of a learning process are tuned parameters of a network (the parameters are distributed through the main components of the established model), and they represent implicitly stored knowledge for the problem at hand.

Adaptivity: An ANN has a built-in capability to adapt its interconnection weights to changes in the surrounding environment. In particular, an ANN trained to operate in a specific environment can be easily retrained to deal with changes in its environmental conditions. Moreover, when it is operating in a nonstationary environment, an ANN can be designed to adopt its parameters in real time.

Evidential Response: In the context of data classification, an. ANN can be designed to provide information not only about which particular class to select for a given sample, but also about confidence in the decision made. This later information may be used to reject ambiguous data, should they arise, and thereby improve the classification performance or performances of the other tasks modeled by the network.

Fault Tolerance: An ANN has the potential to be inherently fault-tolerant, or capable of robust computation. Its performances do not degrade significantly under adverse operating conditions such as disconnection of neurons, and noisy or missing data. There is some empirical evidence for robust computation, but usually it is uncontrolled.

Uniformity of Analysis and Design: Basically, artificial neural networks enjoy universality as information processors. The same principles, notation, and the same steps in methodology are used in all domains)nvolving application of artificial neural networks.

To explain a classification of different types of ANNs and their basic principles it is necessary to introduce an elementary component of every ANN. This simple processing unit is called an artificial neuron.

TEXT MINING

TEXT MINING
Enormous amount of knowledge resides today in text documents that are stored either within the organization or outside of it. Text databases are rapidly growing because of the increasing amounts of information available in electronic form, such as electronic publications, digital libraries, e-mail, and the World Wide Web. Data stored in most text databases are semistructured, and special data-mining techniques, called text mining, have been developed for discovering new information from large collection of textual data.

In general, there are two key technologies that make online text mining possible. One is Internet searching capabilities and the other is the text analysis methodology. Internet searching has been around for a few years. With the explosion of Web sites in the past few years, numerous search engines designed to help users find content appeared practically overnight. Yahoo, Alta Vista, and Excite are three of the earliest. Search engines operate by indexing the content in a particular Web site and allowing users to search these indexes. With the new generation of Internet-searching tools, users can gain relevant information by processing smaller amount of links, pages, and indexes.

Text analysis, as a field, has been around longer than Internet searching. It has been a part of the efforts to make computers understand natural languages and it is commonly thought of as a problem for artificial intelligence. Text analysis can be used anywhere where there is a large amount of text that need to be analyzed. Although automatic processing of documents using different techniques does not allow the depth of analysis that a human can bring to the task, it can be used to extract key points, categorize documents, and generate summaries in a situation when a large number of documents makes manual analysis impossible. Market research, business-intelligence gathering, e-mail management, claim analysis, E-procurement, and automated help desk are only a few of the possible applications where text mining can be deployed successfully.

HITS AND LOGSOM ALGORITHMS

HITS AND LOGSOM ALGORITHMS
To date, index-based search engines for the Web have been the primary tools with which users searched for information. Experienced Web surfers can make effective use of such engines for tasks that can be solved by searching with tightly constrained keywords and phrases. These search engines are, however, unsuited for wide range of less precise tasks. How does one select a subset of documents with the most value from the millions that a search engine has prepared for us? To distill a large Wen-search topic to a size that makes sense to a human user, we need a means of identifying the topic's most authoritative Web pages. The notion of authority adds a crucial dimension to the concept of relevance: We wish to locate not only a set of relevant pages, but also those that are of the highest quality.

It is important that the Web consists not only of pages, but also hyperlinks that connect one page to another. This hyperlink structure contains an enormous amount of information that can help to automatically infer notions of authority. Specifically, the creation of a hyperlink by the author of a Web page represents an implicit endorsement of the page being pointed to. By mining the collective judgment contained in the set of such endorsements, we can gain a richer understanding of the relevance and quality of the Web's contents. It is necessary for this process to uncover two important types of pages: authorities, which provide the best source of information about a given topic and hubs, which provide a collection of links to authorities.

Hub pages appear in a variety of forms, ranging from professionally assembled resource lists on commercial sites to lists of recommended links on individual home pages. These pages need not themselves be prominent, and working with hyperlink information in hubs can cause much difficulty. Although many links represent the some kind of endorsement, some of the links are created for reasons that have nothing to do with conferring authority. Typical examples are navigation and paid advertisement hyperlinks. A hub's distinguishing feature is that they are potent conferrers of authority on a focused topic. We can define a good hub if it is a page that points to many good authorities. At the same time, a good authority page is a page pointed to by many good hubs. This mutually reinforcing relationship between hubs and authorities. At the same time, a good authority page is a page pointed to by many good hubs. This mutually reinforcing relationship between hubs and authorities serves as the central idea applied in the HITS algorithm (Hyperlink-Induced Topic Search) that searches for good hubs and authorities. The two main steps of the HITS algorithm are

Sampling component, which constructs a focused collection of web pages likely to be rich in relevant information, and

Weight-propagation component, which determines the estimates of hubs and authorities by an iterative procedure and obtains the subset of the most relevant and authoritative Web pages.

In the sampling phase, we view the Web as a directed graph of pages. The HITS algorithm starts by constructing the subgraph in which we will search for hubs and authorities. Our goal is a subgraph rich in relevant, authoritative pages. To construct such a subgraph, we first use query terms to collect a root set of pages from an index-based search engine. Since many of these pages are relevant to the search topic, we expect that at least some of them are authorities or that they have links to most of the prominent authorities. We therefore expand the root set into a base set by including all the pages that the root-set pages link to, up to a designated cutoff size. This set typically contains from 1000 to 5000 pages with corresponding links, and it is a final result of the first phase of HITS.

In the weight-propagation phase, we extract good hubs and authorities from the base set V by giving a concrete numeric interpretation to all of them. We associate a non-negative authority weight ap and a non-negative hub weight hp with each page p ∈ V. We are interested only in the relative values of these weights; therefore normalization is applied so that their total sum remains bounded. Since we do not impose any prior estimates, we set all a and h values to a uniform constant initially. The final weights are unaffected by this initialization.

Web Mining

WEB MINING
In a distributed information environment, documents or objects are usually linked together to facilitate interactive access. Examples for such information-providing environments include the World Wide Web (WWW) and on-line services such as America Online, where users, when seeking information of interest, travel from one object to another via facilities such as hyperlinks and URL addresses. The Web is a hypertext body of more than 800 million pages that continues to grow. It exceeds six terabytes of data on about three million servers. Almost a million pages are added daily; typically, pages change every few months and, therefore, several hundred gigabytes are changed every month. As the information offered in the Web grows daily, obtaining that information becomes more and more tedious. Even the largest search engines such as Alta Vista and HotBot index less than 18% of the accessible Web pages as on February 1999, down from 35% in late 1997. The main difficulty lies in the semistructured or unstructured Web content that is not easy to difficulty lies in the semistructured or unstructured Web content that is not easy to regulate and where enforcing a structure or standards is difficult. A set of Web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional print document collections. This level of complexity makes an "off-the-shelf" database-management and information retrieval solution very complex and almost impossible to use. New methods and tools are necessary. Web mining may be defined as the use of data-mining techniques to automatically discover and extract information from Web documents and services. It refers to the overall process of discovery, not just to the application of standard data-mining tools. Some authors suggest decomposing Web-mining task into four subtasks:

Resource finding- This is the process of retrieving data, which is either online or offline, from the multimedia sources on the Web, such as electronic newsletters, electronic newswire, newsgroups, and the text content of HTML documents obtained by removing the HTML tags.

Information selection and preprocessing - This is the process by which different kinds of original data retrieved in the previous subtask is transformed. These transformations could be either a kind of preprocessing such as removing stop words, stemming, etc. or a preprocessing aimed at obtaining the desired representation, such as finding phrases in the training corpus, representing the text in the first order logic form etc

Generalization -Generalization is the process of automatically discovering general patterns within individual Web sites as well as across multiple sites. Different general-purpose machine-learning techniques, data-mining techniques, and specific Web-oriented methods are used.

Analysis- This is a phase in which validation and/or interpretation of the mined patterns is performed.

There are three factors affecting the way a user perceives and evaluates Web sites through the data-mining process: a) Web-page content, b) Web-page design and c) overall site design including its structure. The first factor concerns the goods, services, or data offered by the site. The other factors concern the way in which the site makes content accessible and understandable to its users. We distinguish between the design of individual pages and the overall site design, because a site is not a simply a collection of pages: it is a network of related pages. The users will not engage in exploring it unless they find its structure simple and intuitive. Clearly, understanding user-access patterns in such an environment will not only help improve the system design (e.g., providing efficient access between highly correlated objects, better authoring design for WWW pages, etc.) but also be able to lead to better marketing decisions. Commercial results will be improved by putting advertisements in proper places, better customer/user classification, and understanding user requirements better through behavioral analysis.

No longer are companies interested in Web sites that simply direct traffic and process orders. Now they want to maximize their profits. They want to understand customer preferences and customize sales pitches to individual users. By evaluating a user's purchasing and browsing patterns, e-vendors want to serve up (in real time) customized menus of attractive offers e-buyers can't resist. Gathering and aggregating customer information into e-business intelligence is an important task for any company with Web-based activities. E businesses expect big profits from improved decision-making, and therefore e-vendors line up for data-mining solutions.

Borrowing from marketing theory, we measure the efficiency of a Web page by its contribution to the success of the site. For an on-line shop, it is the ratio of visitors that purchased a product after visiting this page to the total number of visitors that accessed the page. For a promotional site, the efficiency of the page can be measured as the ratio of visitors that clicked on an advertisement after visiting the page. The pages with low efficiency should be redesigned to better serve the purposes of the site. Navigation-pattern discovery should help in restructuring a site by inserting links and redesigning pages, and ultimately accommodating user needs and expectations. One possible categorization of Web mining is based on which part of the Web to mine, and it consists of three areas:

Web-content mining - describes the discovery of useful information from Web documents. Basically, Web content consists of several types of data such as text, image, audio, video, metadata as well as hyperlinks. Research in mining multiple types of data is now termed multimedia-data mining. We could consider multimedia-data mining as an instance of Web-content mining. The Web content data consist of unstructured data such as free text, semi-structured data such as HTML documents, and a more structured data such as tables and database- generated HTML pages. The goal of Web-content mining is mainly to assist or to improve information-finding or filtering the information. Building a new model of data on the Web, more sophisticated queries other than the keywords-based search could be asked.

Web-structure mining - tries to discover the model underlying the link structure on the Web. The model is based on the topology of the hyperlinks with or without a description of the links. The model can be used to categorize Web pages and is useful for generating information such as the similarity relationship between Web sites.

Web-usage mining - tries to make sense of the data generated by the Web surfer's sessions or behaviors. While Web-content mining and Web-structure mining utilize real or primary data on the Web, Web-usage mining mines the secondary data derived from the behaviour of users while interacting with the Web. This includes data from Web server-access logs, proxy-server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, bookmark data, and any other data that is derived from a person's interaction with the Web.

To deal with problems of Web-page quality, Web-site structure, and their use, two families of Web tools emerge. The first includes tools that accompany the users in their navigation, learn from their behavior, make suggestions as they brows, and, occasionally, customize the user-profile. These tools are usually connected to or built-in into parts of different search engines. The second family of tools analyzes the activities of users off-line. Their goal is to provide an insight in the semantics of a Web site's structure by discovering how this structure is actually utilized. In other words, knowledge of the navigational behavior of users is used to predict future trends. New data-mining techniques are behind these tools, where web log files are analyzed and information uncovered. In the next two sections, we will illustrate Web mining with three techniques that are representative of a large spectrum of Web-mining methodologies developed recently.

Apriori Algorithm

ALGORITHM APRIORI
The algorithm Apriori computes the frequent itemsets in the database through several iterations. Iterations i computes all frequent i-itemsets (itemsets with i elements). Each iteration has two steps: candidate generation and candidate counting and selection.

In the first phase of the first iteration, the generated set of candidate itemsets contains all I-itemsets (i.e, all items in the database). In the counting phase, the algorithm counts their support searching again through the whole database. Finally, only I-itemsets (items) with s above required threshold will be selected as frequent. Thus, after the first iteration, all frequent I-itemsets will be known.

What are the itemsets generated in the second iteration? In other words, how does one generate 2-itemset candidates? Basically, all pairs of items are candidates. Based on knowledge about infrequent itemsets obtained from previous iterations, the Apriori algorithm reduces the set of candidate itemsets by pruning -a priori-those candidate itemsets that cannot be frequent. The pruning is based on the observation that if an itemset is frequent all its subsets could be frequent as well. Therefore, before entering the candidate-counting step, the algorithm discards every candidate itemset that has an infrequent subset.

Assume that the minimum support s= 50% so an itemset is frequent if it is contained in at least 50% of the transactions-in our example, in two out of every four transactions in the database. In each iteration, the Apriori algorithm constructs a candidate set of large itemsets, counts the number of occurrences of each candidate, and then determines large itemsets based on the predetermined minimum support s = 50%.

Sunday, December 14, 2008

Clustering

CLUSTERING CONCEPTS

Cluster analysis is a set of methodologies for automatic classification of samples into a number of groups using a measure of association, so that the samples in one group are similar and samples belonging to different groups are not similar. The input for a system of cluster analysis is a set of samples and a measure of similarity (or dissimilarity) between two samples. The output from cluster analysis is a number of groups (clusters) that form a partition, or a structure of partitions, of the data set. One additional result of cluster analysis is a generalized description of every cluster, and this is especially important for a deeper analysis of the data set's characteristics.

Samples for clustering are represented as a vector of measurements, or more formally, as a point in a multidimensional space. Samples within a valid cluster are more similar to each other than they are to a sample belonging to a different cluster. Clustering methodology is particularly appropriate for the exploration of interrelationships among samples to make a preliminary assessment of the sample structure. Humans perform competitively with automatic-clustering procedures in one, two, or three dimensions, but most real problems involve clustering in higher dimensions. It is very difficult for humans to intuitively interpret data embedded in a high-dimensional space.

Thursday, December 11, 2008

Associative Classfication Medhod

ASSOCIATIVE-CLASSIFICATION METHOD
CMAR (Classification based on Multiple Association Rules) is a classification method adopted from the frequent pattern or FP-growth method for generation of frequent itemsets. Although the basic principles of CMAR are explained in this section, for a better understanding of all details we suggest that a reader start with Section 8.6 as an introduction. The main reason we include CMAR methodology in this chapter is its logic-based approach to classification problems, and the possibility of comparing its accuracy and efficiency with the C4.5 methodology.

Suppose data samples are given with n attributes (A1, A2, …, An). Attributes can be categorical or continuous. For a continuous attribute, we assume that its values are discretized into intervals in the preprocessing phase. A training data set T is a set of samples such that for each sample there exists a class label associated with it. Let C = {c1, c2, …, cm} be a finite set of class labels.

In general, a pattern P = {a1, a2, …, ak} is a set of attribute values for different attributes ( 1 ≤ k ≤ n). A sample is said to match the pattern P if it has all the attribute values given in the pattern. For rule R: P → c, the number of data samples matching pattern P and having class label c is called the support of rule R, denoted sup(R). The ratio of the number of samples matching pattern P and having class label c versus the total number of samples matching pattern P is called the confidence of R, denoted as conf(R). The association-classification method (CMAR) consists of two phases:

Rule generation or training and

Classification or testing.

In the first rule generation phase, CMAR computes the complete set of rules in the form R: P → c, such that sup(R) and conf(R) pass the given threshold. For a given support threshold and confidence threshold, the associative-classification method finds the complete set of class-association rules (CAR) passing the thresholds. In a testing phase, when a new (unclassified) sample comes, the classifier, represented by a set of association rules, selects the rule which matches the sample and has the highest confidence, and uses it to predict the classification of the new sample.

Limitation

LIMITATIONS OF DECISION TREES AND DECISION RULES
Decision rule- and decision tree-based models are relatively simple, readable, and their generation is very fast. Unlike many statistical approaches, a logical approach does not depend on assumptions about distribution of attribute values or independence of attributes. Also, this method tends to be more robust across tasks than most other statistical methods. But there are also some disadvantages and limitations of a logical approach, and a data-mining analyst has to be aware of it because the selection of an appropriate methodology is a key step to the success of a data-mining process.

If data samples are represented graphically in an N-dimensional space, where N is the number of attributes, then a logical classifier (decision trees or decision rules) divides the space into regions. Each region is labeled with a corresponding class. An unseen testing sample is then classified by determining the region into which the given point falls. Decision trees are constructed by successive refinement, splitting existing regions into smaller ones that contain highly concentrated points of one class. The number of training cases needed to construct a good classifier is proportional to the number of regions. More complex classifications require more regions that are described with more rules and a tree with higher complexity. All that will require an additional number of training samples to obtain a successful classification.

A graphical representation of decision rules is given by orthogonal hyperplanes in an N-dimensional space. The regions for classification are hyperrectangles in the same space. If the problem at hand is such that the classification hyperplanes are not orthogonal, but are defined through a linear (or nonlinear) combination of attributes, such as the example in Figure 7.12, then that increases the complexity of a rule-based model. A logical approach based on decision rules tries to approximate nonorthogonal, and sometimes, nonlinear classification with hyperrectangles; classification becomes extremely complex with large number of rules and a still larger error.

Prunning

PRUNING DECISION TREES
Discarding one or more subtrees and replacing them with leaves simplify a decision tree, and that is the main task in decision-tree pruning. In replacing the subtree with a leaf, the algorithm expects to lower the predicted error rate and increase the quality of a classification model. But computation of error rate is not simple. An error rate based only on a training data set does not provide a suitable estimate. One possibility to estimate the predicted error rate is to use a new, additional set of test samples if they are available, or to use the cross-validation techniques explained in Chapter 4. This technique divides initially available samples into equal sized blocks and, for each block, the tree is constructed from all samples except this block and tested with a given block of samples. With the available training and testing samples, the basic idea of decision tree-pruning is to remove parts of the tree (subtrees) that do not contribute to the classification accuracy of unseen testing samples, producing a less complex and thus more comprehensible tree. There are two ways in which the recursive-partitioning method can be modified:

Deciding not to divide a set of samples any further under some conditions. The stopping criterion is usually based on some statistical tests, such as the χ2 test: If there are no significant differences in classification accuracy before and after division, then represent a current node as a leaf. The decision is made in advance, before splitting, and therefore this approach is called prepruning.

Removing restrospectively some of the tree structure using selected accuracy criteria. The decision in this process of postpruning is made after the tree has been built.

C4.5 follows the postpruning approach, but it uses a specific technique to estimate the predicted error rate. This method is called pessimistic pruning. For every node in a tree, the estimation of the upper confidence limit ucf is computed using the statistical tables for binomial distribution (given in most textbooks on statistics). Parameter Ucf is a function of ∣Ti∣ and E for a given node. C4.5 uses the default confidence level of 25%, and compares U25% (∣Ti∣/E) for a given node Ti with a weighted confidence of its leaves. Weights are the total number of cases for every leaf. If the predicted error for a root node in a subtree is less than weighted sum of U25% for the leaves (predicted error for the subtree), then a subtree will be replaced with its root node, which becomes a new leaf in a pruned tree.

Let us illustrate this procedure with one simple example. A subtree of a decision tree is given in Figure 7.9, where the root node is the test x1 on three possible values {1, 2, 3} of the attribute A. The children of the root node are leaves denoted with corresponding classes and (∣Ti∣/E) parameters. The question is to estimate the possibility of pruning the subtree and replacing it with its root node as a new, generalized leaf node.

Wednesday, December 10, 2008

Decision Tree

Decision tree

Decision tree is one technique for classification, it has flowchart structure like tree. (Han., 2001). Decision tree build by two nodes, there is the node and leaf. Nodes represent to attribute test, branch of the node related on probabilities result from node test. Hence, the leaf represent value of the class. (kantardzic, 2003).

Decision tree handles two kind of attribute,

Numeric or continue : Domain have infinite values, it represent in real number. Examples : ages, salary.
Nominal or category : Domain have finite values (finite set). Examples : jobs, status.

Missing Data

Missing data or missing value is a generally problems in data processing. Data had been collected not always have complete value. In huge data, missing value not influence in data processing result. However, if missing values more, it can influence the data processing result.

Generally we can handles missing values with this methods,

1. Deleting all record with missing values.

2. Make new algorithm or modified old algorithm which handles missing data. (Kantardzic, 2003).

C4.5 Algorithm

C4.5 algorithm is an decision tree algorithm, it showed by Quinlan as result in developing ID3 algorithm. The result is:

1. Counting attribute selection measure have more accurate tree. C4.5 algorithms have counting information Gain or Gain-ratio.

2. It can handle training data with missing value. To handle this problem C4.5 algorithm use counting gain-ratio for get test attributes.

It can handles continue attribute (numeric).

Data Mining

Introduction

Datamining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. Datamining is the analysis of(oftenlarge) observational data sets to unsuspected relationships and to summarize the data

Data mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Data mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an "interesting" outcome. Data mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers.

In practice, the two primary goals of data mining tend to be prediction and description. Prediction involves using some variables or fields in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding patterns describing the data that can be interpreted by humans. Therefore, it is possible to put data-mining activities into one of two categories:

1. Predictive data mining, which produces the model of the system described by the given data set, or

2. Descriptive data mining, with produces new, nontrivial information based on the available data set.

On the predictive end of the spectrum, the goal of data mining is to produce a model, expressed as an executable code, which can be used to perform classification, prediction, estimation, or other similar tasks. On the other, descriptive, end of the spectrum, the goal is to gain an understanding of the analyzed system by uncovering patterns and relationships in large data sets. The relative importance of prediction and description for particular data-mining applications can vary considerably. The goals of prediction and description are achieved by using data-mining techniques, explained later in this book, for the following primary data-mining tasks:

1. Classification – discovery of a predictive learning function that classifies a data item into one of several predefined classes.

2. Regression – discovery of a predictive learning function, which maps a data item to a real-value prediction variable.

3. Clustering – a common descriptive task in which one seeks to identify a finite set of categories or clusters to describe the data.

4. Summarization – an additional descriptive task that involves methods for finding a compact description for a set (or subset) of data.

5. Dependency Modeling – finding a local model that describes significant dependencies between variables or between the values of a feature in a data set or in a part of a data set.

6. Change and Deviation Detection – discovering the most significant changes in the data set.

Data mining is one of the fastest growing fields in the computer industry. Once a small interest area within computer science and statistics, it has quickly expanded into a field of its own. One of the greatest strengths of data mining is reflected in its wide range of methodologies and techniques that can be applied to a host of problem sets. Since data mining is the entire data warehousing, data-mart, and decision-support community, encompassing professionals from such industries as retail, manufacturing, telecommunications, healthcare, insurance, and transportation. In the business community, data mining can be used to discover new purchasing trends, plan investment strategies, and detect unauthorized expenditures in the accounting system. It can improve marketing campaigns and the outcomes can be used to provide customers with more focused support and attention. Data-mining techniques can be applied to problems of business process reengineering, in which the goal is to understand interactions and relationships among business practices and organizations.

InuyashaBlog