The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. The following are typical requirements of clustering in data mining: Data matrix (or object-by-variable structure): This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects ×p variables) Dissimilarity matrix (or object-by-object structure): This stores a collection of proximities that are available for all pairs of n objects. It is often represented by an n-by-n table: Binary Variables A binary variable has only two states: 0 or 1. A categorical variable is a generalization of the binary variable in that it can take on more than two states. For example, map color is a categorical variable that may have, say, five states: maroon, yellow, orange, pink, and blue. A discrete ordinal variable resembles a categorical variable, except that the M states of the ordinal value are ordered in a meaningful sequence. A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately following the formula AeBt or Ae−Bt Methods used in clustering are : A Classical Partitioning Method used is k-Means and k-Medoids
In general, there are two types of hierarchical clustering methods: Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold. There are three methods : The grid-based clustering approach uses a multiresolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed WaveCluster: This can be carried out in different approaches CLIQUE: A Dimension-Growth Subspace Clustering Method PROCLUS: A Dimension-Reduction Subspace Clustering Method Constraint-based clustering finds clusters that satisfy user-specified preferences or constraints. Depending on the nature of the constraints, constraint-based clustering may adopt rather different approaches. Here are a few categories of constraints. Very often, there exist data objects that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers. Many data mining algorithms try to minimize the influence of outliers or eliminate them all together. This, however, could result in the loss of important hidden information because one person’s noise could be another person’s signal. In other words, the outliers may be of particular interest, such as in the case of fraud detection, where outliers may indicate fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining or outlier analysis. There are two basic types of procedures for detecting outliers: Several efficient algorithms for mining distance-based outliers have been developed. These are outlined as follows. The technique introduces the following key terms. An OLAP approach to deviation detection uses data cubes to identify regions of anomalies Methodologies for Stream Data Processing and Stream Data Systems 1) Random Sampling Rather than deal with an entire data stream, we can think of sampling the stream at periodic intervals. 2) Sliding Windows Instead of sampling the data stream randomly, we can use the sliding window model to analyze stream data 3) Histograms The histogram is a synopsis data structure that can be used to approximate the frequency distribution of element values in a data stream. 4) Multiresolution Methods A common way to deal with a large amount of data is through the use of data reduction methods, A popular data reduction method is the use of divide-andconquer strategies such as multiresolution data structures 5) Sketches Synopses techniques mainly differ by how exactly they trade off accuracy for storage. Sampling techniques and sliding window models focus on a small part of the data, whereas other synopses try to summarize the entire data, often at multiple levels of detail. Some techniques require multiple passes over the data, such as histograms and wavelets, whereas other methods, such as sketches, can operate in a single pass. frequency moments : Randomized algorithms, in the form of random sampling and sketching, are often used to deal with massive, high-dimensional data streams. In traditional database systems, data are stored in finite and persistent databases. However, stream data are infinite and impossible to store fully in a database. In a Data Stream Management System (DSMS), there may be multiple data streams. They arrive on-line and are continuous, temporally ordered, and potentially infinite. Once an element from a data stream has been processed, it is discarded or archived, and it cannot be easily retrieved unless it is explicitly stored in memory. The special properties of stream data introduce new challenges in query processing. In particular, data streams may grow unboundedly, and it is possible that queries may require unbounded memory to produce an exact answer. Stream data are generated continuously in a dynamic environment, with huge volume, infinite flow, and fast-changing behavior. It is impossible to store such data streams completely in a data warehouse. Multidimensional analysis for power supply stream data. A power supply station generates infinite streams of power usage data. Suppose individual user, street address, and second are the attributes at the lowest level of granularity. Given a large number of users, it is only realistic to analyze the fluctuation of power usage at certain high levels, such as by city or street district and by quarter (of an hour), making timely power supply adjustments and handling unusual situations. Conceptually, for multidimensional analysis, we can view such stream data as a virtual data cube, consisting of one or a few measures and a set of dimensions, including one time dimension, and a few other dimensions, such as location, user-category, and so on. However, in practice, it is impossible to materialize such a data cube, because the materialization requires a huge amount of data to be computed and stored. Some efficient methods must be developed for systematic analysis of such data. Time Dimension with Compressed Time Scale: Tilted Time Frame Types of models: Two critical cuboids (or layers): Materializing a cube at only two critical layers leaves much room for how to compute the cuboids in between. These cuboids can be precomputed fully, partially, or not at all (i.e., leave everything to be computed on the fly) Approximate frequent items. A router is interested in all items whose frequency is at least 1% (min support) of the entire traffic stream seen so far. It is felt that 1/10 of min support (i.e., ε = 0.1%) is an acceptable margin of error. This means that all frequent items with a support of at least min support will be output, but that some items with a support of at least (min support − ε) will also be output. The Hoeffding tree algorithm is a decision tree learning method for stream data classification. The VFDT (Very Fast Decision Tree) algorithm makes several modifications to the Hoeffding tree algorithm to improve both speed and memory utilization. The modifications include breaking near-ties during attribute selection more aggressively, computing the G function after a number of training examples, deactivating the least promising leaves whenever memory is running low, dropping poor splitting attributes, and improving the initialization method. VFDT works well on stream data and also compares extremely well to traditional classifiers in both speed and accuracy To adapt to concept-drifting data streams, the VFDT algorithm was further developed into the Concept-adapting Very Fast Decision Tree algorithm (CVFDT). CVFDT also uses a sliding window approach; however, it does not construct a new model from scratch each time. Rather, it updates statistics at the nodes by incrementing the counts associated with new examples and decrementing the counts associated with old ones. Therefore, if there is a concept drift, some nodes may no longer pass the Hoeffding bound. When this happens, an alternate subtree will be grown, with the new best splitting attribute at the root. As new examples stream in, the alternate subtree will continue to develop, without yet being used for classification. Once the alternate subtree becomes more accurate than the existing one, the old subtree is replaced. The idea is to train an ensemble or group of classifiers (using, say naïve Bayes) from sequential chunks of the data stream. That is, whenever a new chunk arrives, we build a new classifier from it. The individual classifiers are weighted based on their expected classification accuracy in a time-changing environment. Only the top-k classifiers are kept. The decisions are then based on the weighted votes of the classifiers. For effective clustering of stream data, several new methodologies have been developed, as follows: CluStream is an algorithm for the clustering of evolving data streams based on userspecified, on-line clustering queries. It divides the clustering process into on-line and offline components. The on-line component computes and stores summary statistics about the data stream using microclusters, and performs incremental on-line computation and maintenance of the microclusters. The off-line component does macroclustering and answers various user questions using the stored summary statistics, which are based on the tilted time frame model. A time-series database consists of sequences of values or events obtained over repeated measurements of time. Trend Analysis Similarity Search in Time-Series Analysis Sequential pattern mining is the mining of frequently occurring ordered events or subsequences as patterns. An example of a sequential pattern is “Customers who buy a N900 are likely to buy an iTouch notepad within a month.” Sequential pattern mining is computationally challenging because such mining may generate and/or test a combinatorially explosive number of intermediate subsequences. A full periodic pattern is a pattern where every point in time contributes (precisely or approximately) to the cyclic behavior of a time-related sequence. For example, all of the days in the year approximately contribute to the month cycle of the year. A partial periodic pattern specifies the periodic behavior of a time-related sequence at some but not all of the points in time. For example, rama have breakfasefrom 7:45 to 8:15 every weekday morning, but her activities at other times do not have much regularity. Partial periodicity is a looser form of periodicity than full periodicity and occurs more commonly in the real world. Alignment of Biological Sequences : Pairwise Alignment BLAST finds regions of local similarity between biosequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families. Multiple sequence alignment is usually performed on a set of sequences of amino acids that are believed to have similar structures. The goal is to find common patterns that are conserved among all the sequences being considered Markov Chain Example: Markov chain. The Markov chain in Figure above is a probabilistic model for CpG islands. Evaluation: Given a sequence, x, determine the probability, P(x), of obtaining x in the model. What is Graph Mining? Apriori-based Approach
gSpan: A pattern-growth algorithm for frequent substructure mining. Characteristics of Social Networks: Traditional methods of machine learning and data mining, taking, as input, a random sample of homogenous objects from a single relation, may not be appropriate here. The data comprising social networks tend to be heterogeneous, multirelational, and semi-structured. As a result, a new field of research has emerged called link mining. Tasks involved in link mining : Challenges Multirelational data mining (MRDM) methods search for patterns that involve multiple tables (relations) from a relational database The ILP problem is defined as follows: Given background knowledge B, a set of positive examples P, and a set of negative examples N, find a hypothesis H such that: (1) ∀t ∈ P : H ∪B |= t (completeness), and (2) ∀t ∈ N : H ∪ B |= t (consistency), where |= stands for logical implication.
Multirelational clustering is the process of partitioning data objects into a set of clusters based on their similarity, utilizing information in multiple relations. In this section we will introduce CrossClus (Cross-relational Clustering with user guidance), an algorithm for multirelational clustering that explores how to utilize user guidance in clustering and tuple ID propagation to avoid physical joins.Cluster Analysis
Types of Data in Cluster Analysis :


Interval-Scaled Variables Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examples include weight and height, latitude.
Major Clustering Methods
Partitioning Methods
The k-Means method is a Centroid-based technique.
typically, the square-error criterion is used and is defined as
The k-means algorithm for partitioning
Representative Object-Based Technique: The k-Medoids Method
Hierarchical methods in clustering
Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. Most hierarchical clustering methods belong to this category. They differ only in their definition of intercluster similarity.
Density-Based methods in clustering
Sufficiently High Density
Grid-Based methods in clustering
STING: Statistical information grid
STING is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular cells. There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) is precomputed and stored
Clustering Using Wavelet Transformation
WaveCluster is a multiresolution clustering algorithm that first summarizes the data by imposing a multidimensional grid structure onto the data space. It then uses a wavelet transformation to transform the original feature space, finding dense regions in the transformed space.Model-Based Clustering Methods
Methods of Clustering High-Dimensional Data
CLIQUE (CLustering In QUEst) was the first algorithm proposed for dimension-growth subspace clustering in high-dimensional space.
PROCLUS (PROjected CLUStering) is a typical dimension-reduction subspace clustering method. That is, instead of starting from single-dimensional spaces, it starts by finding an initial approximation of the clusters in the high-dimensional attribute space. Each dimension is then assigned a weight for each cluster, and the updated weights are used in the next iteration to regenerate the clusters. This leads to the exploration of dense regions in all subspaces of some desired dimensionality and avoids
the generation of a large number of overlapped clusters in projected dimensions of lower dimensionality.Constraint-Based Cluster Analysis
What is Outlier Analysis?
Statistical Distribution-Based Outlier Detection
or all of them are accepted as consistent.
procedure.Distance-Based Outlier Detection
objects.

OLAP Data Cube Technique
in large multidimensional dataMining Stream, Time-Series, and Sequence Data

Randomized Algorithms
Chebyshev’s inequality :
Data Stream Management Systems and Stream QueriesStream Query Processing
Stream OLAP and Stream Data Cubes
Example:
The most recent time is registered at the finest granularity; the more distant time is registered at a coarser granularity; and the level of coarseness depends on the application requirements and on how old the time point is (from the current time). Such a time dimension model is called a tilted time frame.
Critical Layers of stream data cube
analyst would like to study
automated system) would like to continuously study the data..
Partial Materialization of a Stream Cube
Example for Frequent-Pattern Mining in Data Streams
Hoeffding Tree Algorithm
It was initially used to track Web clickstreams and construct models to predict
which Web hosts and Web sites a user is likely to access. It typically runs in sublinear
time and produces a nearly identical decision tree to that of traditional batch learners.
It uses Hoeffding trees, which exploit the idea that a small sample can often be enough
to choose an optimal splitting attribute. This idea is supported mathematically by the
Hoeffding bound (or additive Chernoff bound).

Very Fast Decision Tree (VFDT) and Concept-adapting Very Fast Decision Tree (CVFDT)
A Classifier Ensemble Approach to Stream Data Classification
Clustering in evolving data streams
STREAM is a single-pass, constant factor approximation algorithm that was developed for the k-medians problem
Mining Time-Series Data
A time series involving a variable Y , representing, say, the daily closing price of a share in a stock market, can be viewed as a function of time t, that is, Y = F(t). Such a function can be illustrated as a time-series graph,which describes a point moving with the passage of time.
Unlike normal database queries, which find data that match the given query exactly, a similarity search finds data sequences that differ only slightly from the given query sequence.Mining Sequence Patterns in Transactional Databases
Scalable Methods for Mining Sequential Patterns
Periodicity Analysis for Time-Related Sequence DataMining Sequence Patterns in Biological Data

The BLAST Local Alignment Algorithm
The method of Multiple Sequence Alignment
Hidden Markov Model for Biological Sequence Analysis
A Markov chain is a model that generates sequences in which the probability of a symbol depends only on the previous symbol. Figure below is an example Markov chain model. A Markov chain model is defined by (a) a set of states, Q, which emit symbols and (b) a set of transitions between states.
The states are A, C, G, and T. For readability, only some of the transition probabilities are shown. For example, the transition probability from state G to state T is 0.14, that is, P(xi = T|xi−1 = G) = 0.14. Here, the emitted symbols are understood. For example, the symbol C is emitted when transitioning from state C. In speech recognition, the symbols emitted could represent spoken words or phrases.Tasks using hidden Markov models include:
Decoding: Given a sequence, determine the most probable path through the model that produced the sequence.
Learning: Given a model and a set of training sequences, find the model parameters (i.e., the transition and emission probabilities) that explain the training sequences with relatively high probability. The goal is to find a model that generalizes well to sequences we have not seen before.
Forward Algorithm

Viterbi Algorithm

Baum-Welch Algorithm

Introduction to Graph Mining, Social Network Analysis, and Multirelational Data Mining
Graphs become increasingly important in modeling complicated structures, such as circuits, images, chemical compounds, protein structures, biological networks, socialnetworks, the Web, workflows, and XML documents. Many graph search algorithms have been developed in chemical informatics, computer vision, video indexing, and text retrieval. With the increasing demand on the analysis of large amounts of structured data, graph mining has become an active and important theme in data mining.Methods for Mining Frequent Subgraphs
Apriori-based algorithms for frequent substructure mining include AGM, FSG, and a path-join method. AGM shares similar characteristics with Apriori-based itemset mining. FSG and the path-join method explore edges and connections in an Apriori-based fashion. Each of these methods explores various candidate generation strategies.

Pattern-Growth Approach
When building a DFS tree, the visiting sequence of vertices forms a linear order. We use
subscripts to record this order, where i < j means vi is visited before v j when the depth-first
search is performed. A graph G subscripted with a DFS tree T is written as GT . T is called a
DFS subscripting of G.
Social Network Analysis
Link Mining: Tasks and Challenges
What is Multirelational Data Mining?
ILP Approach to Multirelational Classification
Inductive Logic Programming (ILP) is the most widely used category of approaches to multirelational classification. There are many ILP approaches. In general, they aim to find hypotheses of a certain format that can predict the class labels of target tuples, based on background knowledge (i.e., the information stored in all relations).
Tuple ID Propagation
Tuple ID propagation is a technique for performing virtual join, which greatly improves efficiency of multirelational classification. Instead of physically joining relations, they are virtually joined by attaching the IDs of target tuples to tuples in nontarget relations.
Multirelational Classification Using Tuple ID Propagation
CrossMine, an approach that uses tuple ID propagation for multirelational classification. To better integrate the information of ID propagation, CrossMine uses complex predicates as elements of rules. A complex predicate, p, contains two parts:
1. prop-path
2. ConstraintCrossMine a Rule-based classification across multiple relations
Multirelational Clustering with User Guidance
Advanced Data Mining