Forgot Password?

More On DataMining

 

Mining Frequent Patterns, Associations, and Correlations

Market Basket Analysis

Market basket analysis may be performed on the retail data of customer transactions at a store.that can be then used to plan marketing or advertising strategies, or in the design of a new catalog. Market basket analysis can also help retailers plan which items to put on sale at reduced prices. If customers tend to purchase computers and printers together, then having a sale on printers may encourage the sale of printers as well as computers.

Frequent Itemsets, Closed Itemsets, and Association Rules

formulae

Association rule mining can be viewed as a two-step process:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup.

2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.

Frequent Pattern Mining can be based on:

  • The completeness of patterns to be mined
  • The levels of abstraction involved in the rule set
  • The number of data dimensions involved in the rule:
  • The types of values handled in the rule
  • The kinds of rules to be mined:
  • The kinds of patterns to be mined

Efficient and Scalable Frequent Itemset Mining Methods

The Apriori Algorithm - Frequent Itemsets mining using Candidate Generation

Apriori property: All nonempty subsets of a frequent itemset must also be frequent. Join and prune actions:

  1. The join step
  2. The Prune step.

 


Generating Association Rules from Frequent Itemsets

Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them (where strong association rules satisfy both minimum support and minimum confidence).

Example:

Generating association rules. Let’s try an example based on the transactional data for DMT shown in Table 5.1. Suppose the data contain the frequent itemset l = {I1, I2, I5}. What are the association rules that can be generated from l? The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}. The resulting association rules are as shown below, each listed with its confidence:

I1 /\ I2 =>15, confidence = 2/4 = 50%
I1 /\ I5 => I2, confidence = 2/2 = 100%
I2 /\ I5 => I1, confidence = 2/2 = 100%
I1 => I2 /\ I5, confidence = 2/6 = 33%
I2 => I1 /\ I5, confidence = 2/7 = 29%
I5 = I1 /\ I2, confidence = 2/2 = 100%

 

Methods to improve the efficiency of Apriori algorithm:

Hash-based technique (hashing itemsets into corresponding buckets): A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck , for k > 1.

Transaction reduction (reducing the number of transactions scanned in future iterations): A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k + 1)-itemsets.

Partitioning (partitioning the data to find candidate itemsets): A partitioning technique can be used that requires just two database scans to mine the frequent itemsets as shown below , It has twp phases

partiotining

Sampling (mining on a subset of the given data): The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy against effeciency

Dynamic itemset counting (adding candidate itemsets at different points during a scan): A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points.

Mining Frequent Itemsets without Candidate Generation

In many cases the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. However, it can suffer from two nontrivial costs:

  • It may need to generate a huge number of candidate sets
  • It may need to repeatedly scan the database and check a large set of candidates by pattern matching


Mining Frequent Itemsets Using Vertical Data Format

Mining Closed Frequent Itemsets

Pruning strategies include the following:
Item merging: If every transaction containing a frequent itemset X also contains an itemset Y but not any proper superset of Y , then X ∪Y forms a frequent closed itemset and there is no need to search for any itemset containing X but no Y . Sub-itemset pruning: If a frequent itemset X is a proper subset of an already found frequent closed itemset Y and support count(X) = support count(Y ), then X and all of X’s descendants in the set enumeration tree cannot be frequent closed itemsets and thus can
be pruned.

Item skipping: In the depth-first mining of closed itemsets, at each level, there will be a prefix itemset X associated with a header table and a projected database. If a local frequent item p has the same support in several header tables at different levels, we can safely prune p from the header tables at higher levels.

Mining Multilevel Association Rules

Suppose we are given the task-relevant set of transactional data in Table below for sales in an DMT store, showing the items purchased for each transaction. The concept hierarchy for the items is shown in Figure below. A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higherlevel, more general concepts. Data can be generalized by replacing low-level concepts within the data by their higher-level concepts, or ancestors, from a concept hierarchy.

Constraint-Based Association Mining

Fora a Constraint-based mining, The constraints can include the following:

Knowledge type constraints: These specify the type of knowledge to be mined, such as association or correlation.

Data constraints: These specify the set of task-relevant data.

Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, or levels of the concept hierarchies, to be used in mining.

Interestingness constraints: These specify thresholds on statistical measures of rule interestingness, such as support, confidence, and correlation.

Rule constraints: These specify the form of rules to be mined. Such constraints may be expressed as metarules (rule templates), as the maximum or minimum number of predicates that can occur in the rule antecedent or consequent, or as relationships among attributes, attribute values, and/or aggregates.

Metarule-Guided Mining of Association Rules

Metarules allow users to specify the syntactic form of rules that they are interested in mining. The rule forms can be used as constraints to help improve the efficiency of the mining process.

Constraint Pushing or Mining Guided by Rule Constraints

Rule constraints specify expected set/subset relationships of the variables in the mined rules, constant initiation of variables, and aggregate functions.

Example:
A closer look at mining guided by rule constraints. Suppose that DMT has a sales multidimensional database with the following interrelated relations:

sales(customer name, item name, TID)
lives in(customer name, region, city)
item(item name, group, price)
transaction(TID, day, month, year)

Classification and Prediction

The data analysis task is classification, where a model or classifier is constructed to predict categorical labels, such as “safe” or “risky” for the loan application data; “yes” or “no” for the marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data. Suppose that the marketing manager would like to predict how much a given customer will spend during a sale at DMT. This data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a categorical label. This model is a predictor.

Issues Regarding Classification and Prediction

Steps in preparing the Data for Classification and Prediction Data cleaning:

  • Relevance analysis
  • Data transformation and reduction
  • Comparing Classification and Prediction Methods
  • Accuracy
  • speed
  • Robustness
  • scalability
  • Interpretability

Classification by Decision Tree Induction

Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.

Decision Tree Induction

decision tree

 

Basic algorithm for generating a decision tree from training tuples.


Attribute Selection Measures

An attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given data partition, D, of class-labeled training tuples into individual classes.

Information gain

The expected information needed to classify a tuple in a partition D is given by:

partitioning


Gain ratio

gainratio

Gini index

The Gini index is used in CART.

gini

if a binary split on A partitions D into D1 and D2 , the gini index of D given that partitioning is

gini2

 

Tree Pruning

When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of overfitting the data. Scalability and Decision Tree Induction
problem: Most often, the training data will not fit in memory! Decision tree
construction therefore becomes inefficient due to swapping of the training tuples in
and out of main and cache memories., thats why it is necessary to have scalable decision tree.

prune


Bayesian Classification

Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.

Bayes’ Theorem

bayes

Naïve Bayesian Classification

1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, X = (x1 , x2 , . . . , xn ), depicting n measurements made on the tuple from n attributes, respectively, A1 , A2 , . . . , An .

2. Suppose that there are m classes, C1 , C2 , . . . , Cm . Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if

P(Ci |X) > P(C j |X) for 1 ≤ j ≤ m, j = i.

Thus we maximize P(Ci |X). The class Ci for which P(Ci |X) is maximized is called the maximum posteriori hypothesis. By Bayes’ theorem

3. As P(X) is constant for all classes, only P(X|Ci )P(Ci ) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1 ) = P(C2 ) = · · · = P(Cm ), and we would therefore maximize P(X|Ci ). Otherwise, we maximize P(X|Ci )P(Ci ). Note that the class prior probabilities may be estimated by P(Ci ) = |Ci,D |/|D|, where |Ci,D | is the number of training tuples of class Ci in D.

4. Given data sets with many attributes, it would be extremely computationally expensive to compute P(X|Ci ). In order to reduce computation in evaluating P(X|Ci ), the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the tuple (i.e., that there are no dependence relationships among the attributes).


5. In order to predict the class label of X, P(X|Ci )P(Ci ) is evaluated for each class Ci . The classifier predicts that the class label of tuple X is the class Ci if and only if In other words, the predicted class label is the class Ci for which P(X|Ci )P(Ci ) is the maximum.


Bayesian Belief Networks

 

belief

conditional probability table showing, probability of lung cancer based on family history (FH ) and Alcoholic (A)

  FH,A FH,~A ~FH,A ~FH,~A
RD 0.8 0.4 0.6 0.1
~RD 0.2 0.6 0.4 0.9


Training Bayesian Belief Networks

In the learning or training of a belief network, a number of scenarios are possible. The network topology (or “layout” of nodes and arcs) may be given in advance or inferred from the data. The network variables may be observable or hidden in all or some of the training tuples. The case of hidden data is also referred to as missing values or incomplete data.

 

Rule Induction Using a Sequential Covering Algorithm


Classification by Backpropagation

Backpropagation is a neural network learning algorithm. The field of neural networks was originally kindled by psychologists and neurobiologists who sought to develop and test computational analogues of neurons.

A Multilayer Feed-Forward Neural Network

neural

Defining a Network Topology

Before training can begin, the user must decide on the network topology by specifying the number of units in the input layer, the number of hidden layers (if more than one), the number of units in each hidden layer, and the number of units in the output layer.

Backpropagation

Backpropagation learns by iteratively processing a data set of training tuples, comparing the network’s prediction for each tuple with the actual known target value




Associative Classification: Classification by Association Rule Analysis

Frequent patterns and their corresponding association or correlation rules characterize interesting relationships between attribute conditions and class labels, and thus have been recently used for effective classification. Association rules show strong associations between attribute-value pairs (or items) that occur frequently in a given data set. Association rules are commonly used to analyze the purchasing patterns of customers in a store.

Lazy Learners (or Learning from Your Neighbors)

Eager learners, when given a set of training tuples, will construct a generalization (i.e., classification) model before receiving new (e.g., test) tuples to classify.
lazy approach, in which the learner instead waits until the last minute before doing any model construction in order to classify a given test tuple. That is, when given a training tuple, a lazy learner simply stores it (or does only a little minor processing) and waits until it is given a test tuple. Only when it sees the test tuple does it perform generalization in order to classify the tuple based on its similarity to the stored training tuples.

examples of lazy learners: k-nearest-neighbor classifiers and case-based reasoning
classifiers.

Other classification methods include

Genetic Algorithms Genetic algorithms attempt to incorporate ideas of natural evolution.

Rough Set Approach Rough set theory can be used for classification to discover structural
relationships within imprecise or noisy data.

Fuzzy Set Approaches Rule-based systems for classification have the disadvantage that they involve sharp cutoffs for continuous attributes.

 

Prediction

Linear Regression Straight-line regression analysis involves a response variable, y, and a
single predictor variable, x. It is the simplest form of regression, and models y as a linear
function of x.

formulae

Nonlinear Regression
Transformation of a polynomial regression model to a linear regression model.
Consider a cubic polynomial relationship given by
y = w0 + w1 x + w2 x2 + w3 x3 . ........... (1)
To convert this equation to linear form, we define new variables:
x1 = x
x2 = x 2
x3 = x3


Equation (1) can then be converted to linear form by applying the above assignments,
resulting in the equation y = w0 + w1 x1 + w2 x2 + w3 x3 , which is easily solved by the
method of least squares using software for regression analysis. Note that polynomial
regression is a special case of multiple regression. That is, the addition of high-order
terms like x2 , x3 , and so on, which are simple functions of the single variable, x, can be
considered equivalent to adding new independent variables.

Ensemble Methods—Increasing the Accuracy

Bagging We first take an intuitive look at how bagging works as a method of increasing accuracy. For ease of explanation, we will assume at first that our model is a classifier.


Boosting

In boosting, weights are assigned to each training tuple. A series of k classifiers is iteratively learned. After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1 , to “pay more attention” to the training tuples that were misclassified by Mi . The final boosted classifier, M∗, combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy. The boosting algorithm can be extended for the prediction of continuous values.