Forgot Password?

DataMining For Beginner

 

Data Processing

What is the need for Data Processing ?

To get the required information from huge, incomplete, noisy and inconsistent set of data it is necessary to use dataprocessing.

What is Data Cleaning?

Data cleaning is a procedure to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies

What is Data Integration?

integrating multiple databases, data cubes, or files, that is, data integration.

What is Data Transformation?

data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process.

What is Data Reduction?

Data reduction obtains a reduced representation of the data set that is much smaller
in volume, yet produces the same (or almost the same) analytical results.


Data Summarization

meanmedianmode


1: Median is the average of the middle two values.


2: A holistic measure is a measure that must be computed on the entire data set as a whole. It cannot be computed by partitioning the given data into subsets and merging the values obtained for the measure in each subset.

Ways to measure the Dispersion of Data

The range of the set is the difference between the largest and smalles values.

The most commonly used percentiles other than the median are quartiles.

The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data. This distance is called the interquartile range (IQR) and is defined as
IQR = Q3 − Q1 .
Rule of thumb for identifying suspected outliers is to single out values falling at least 1.5 × IQR above the third quartile or below the first quartile.

Graphic Displays of Basic Descriptive Data Summaries

methods

Data Cleaning


How to handle missing Values?


How to handle Noisy Data?
Noise is a random error or variance in a measured variable. Given a numerical attribute such as, say, marks , we can “smooth” the data to remove the noise using Smoothing techniques like :

The process of Data Cleaning


The first step in data cleaning as a process is discrepancy detection. knowledge or “data about data” is referred to as metadata is starting point. The data should also be examined regarding unique rules, consecutive rules, and null rules.

 


What is Data Integration ?

Data Integration combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. Issues that arises during data integration like Schema integration and object matching Redundancy is another important issue.


Some redundancies can be detected by correlation analysis correlation coefficient ,The χ2 value (also known as the Pearson χ2 statistic) is computed as:

formula

where Oi j is the observed frequency of the joint event (Ai , B j ) and Eij is the expected frequency of (Ai , B j ), which can be computed as

formula

where N is the number of data tuples, count(A = ai ) is the number of tuples having value afor A, and count(B = b ) is the number of tuples having value b for B.

Data Transformation

Data transformation can involve the following:


Data Reduction techniques

These techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.


Strategies for data reduction include the following:

  1. Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
  2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
  3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.
    Wavelet Transforms Principal Components Analysis
  4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms.
  5. Discretization and concept hierarchy generation, where raw data values for attributes are replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction.


Data Discretization and Concept Hierarchy Generation

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data. This leads to a concise, easy-to-use, knowledge-level representation of mining results

Concept Hierarchy Generation for Categorical Data

Following are methods for generation of concept hierarchies for categorical data.:


Example:
Concept hierarchy generation using prespecified semantic connections. Suppose that
a data mining administrator has pinned together the five attributes
name , age , college , degree, grade because they are closely linked
semantically regarding the notion of student. If a user were to specify only the attribute
college for a hierarchy defining student , the system can automatically drag in all of the above
five semantically related attributes to form a hierarchy.

An overview of Data Warehouse and OLAP Technology

What is a Data WareHouse?
Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions.
The four things that distinguish data warehouses from other data repository systems are:

Differences between Operational Database Systems and Data Warehouses

The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems.


Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems.

major reason for a separation between database and datawarehouse is to help promote the high performance of both systems

Comparision between OLTP and OLAP

Feature OLTP OLAP
Characteristic Operational processing Informational processiong
Orientation transaction Analysis
Function day-to-day operation long term informational reqirements
Design Application oriented subject oriented
Access Read and write Mostly read
Data accessed tens millions
View Detailed Summarized
Priority High performance and availability High flexibility
User clerk, DBA Knowledge worker
Size 100MB to GB 100GB to TB

 

A Multidimensional Data Model

Tables and Spreadsheets to Data Cubes, Stars, Snowflakes, and Fact Constellations are example for Multidimensional Databases

Measures

Measures can be organized into three categories

OLAP Operations in the Multidimensional Data Model

In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data from different perspectives. A number of OLAP data cube operations exist to materialize these different views, allowing Interactive querying and analysis of the data at hand. Hence, OLAP provides a user-friendly environment for interactive data analysis.

Roll-up: The roll-up operation (also called the drill-up operation by some vendors)
performs aggregation on a data cube, either by climbing up a concept hierarchy for
a dimension or by dimension reduction.

Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to
more detailed data

Slice and dice: The slice operation performs a selection on one dimension of the
given cube, resulting in a subcube.

Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data
axes in view in order to provide an alternative presentation of the data

A Starnet Query Model for Querying Multidimensional Databases

A starnet model consists of radial lines emanating from a central point, where each line represents a concept hierarchy for a dimension. Each abstraction level in the hierarchy is called a footprint.


The Architecture of a Data Warehouse

Steps for the Design and Construction of Data Warehouses includes

Indexing in OLAP Data

The bitmap indexing method allows quick searching in data cubes. The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the domain of the attribute. If the domain of a given attribute consists of n values, then n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors). If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0

Efficient Processing of OLAP Queries

1) Determine which operations should be performed on the available cuboids
2) Determine to which materialized cuboid(s) the relevant operations should be applied:

Different data warehouse applications:

Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts, or graphs

Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting.

Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools.

On-Line Analytical Mining

On-line analytical mining (OLAM) (also called OLAP mining) integrates on-line analytical processing (OLAP) with data mining and mining knowledge in multidimensional databases. Among the many different paradigms and architectures of data mining systems, OLAM is particularly important for the following reasons

  1. High quality of data in data warehouses
  2. Available information processing infrastructure surrounding data warehouses
  3. OLAP-based exploratory data analysis
  4. On-line selection of data mining functions

Architecture of OLAM

OLAM

Data Cube Computation an Data Generalization

Data generalization is a process that abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels.


Efficient Methods for Data Cube Computation
Different Data cube materialization include Full Cube, Iceberg Cube, Closed Cube, and Shell Cube


General Strategies for Cube Computation
Technique 1: Sorting, hashing, and grouping.
Technique 2: Simultaneous aggregation and caching intermediate results.
Technique 3: Aggregation from the smallest child, when there exist multiple child cuboids.
Technique 4: The Apriori pruning method can be explored to compute iceberg cubes efficiently


The Apriori property,in the context of data cubes, states as follows: If a given cell does not satisfy minimum support, then no descendant (i.e., more specialized or detailed version) of the cell will satisfy minimum support either. This property can be used to substantially reduce the computation of iceberg cubes.


Multiway Array Aggregation for Full Cube Computation

The Multiway Array Aggregation (or simply MultiWay) method computes a full data cube by using a multidimensional array as its basic data structure

  1. Partition the array into chunks
  2. Compute aggregates by visiting (i.e., accessing the values at) cube cells

BUC: Computing Iceberg Cubes from the Apex Cuboid Downward

BUC stands for “Bottom-Up Construction" , BUC is an algorithm for the computation of sparse and iceberg cubes. Unlike MultiWay, BUC constructs the cube from the apex cuboid toward the base cuboid. This allows BUC to share data partitioning costs. This order of processing also allows BUC to prune during construction, using the Apriori property.

Example:

BUC construction of an iceberg cube. Consider the iceberg cube expressed in SQL as
follows:
compute cube iceberg cube as
select A, B, C, D, count(*)
from R
cube by A, B, C, D
having count(*) >= 3


Star-Cubing: Computing Iceberg Cubes Using a Dynamic Star-tree Structure



Frag-Shells



Computing Cubes with Complex Iceberg Conditions

If the condition is violated for some cell c, then every descendant of c will also violate
that condition. For example, if the quantity of an item I sold in a region X is less than 75$, then the same item I sold in a subregion of X can never satisfy the condition count ≥ 75. Conditions that obey
this property are known as antimonotonic.

 

Further Development of Data Cube and OLAP Technology

Discovery-Driven Exploration of Data Cubes Tools need to be developed to assist users in intelligently exploring the huge aggregated space of a data cube. Discovery-driven exploration is such a cube exploration approach.

Complex Aggregation at Multiple Granularity: Multifeature Cubes Data cubes facilitate the answering of data mining queries as they allow the computation of aggregate data at multiple levels of granularity

example:
Query 1: A simple data cube query. Find the total sales in 2004, broken down by item,
region, and month, with subtotals for each dimension.
To answer Query 1, a data cube is constructed that aggregates the total sales at the
following eight different levels of granularity: {(item, region, month), (item, region),
(item, month), (month, region), (item), (month), (region), ()}, where () represents all.
Query 1 uses a typical data cube like that introduced in the previous chapter. We
call such a data cube a simple data cube because it does not involve any dependent aggregates.

Constrained Gradient Analysis in Data Cubes

Constrained multidimensional gradient analysis reduces the search space and derives interesting results. It incorporates the following types of constraints:

Attribute-Oriented Induction—An Alternative Method for Data Generalization and Concept Description.

Data generalization summarizes data by replacing relatively low-level values (such as numeric values for an attribute age) with higher-level concepts (such as low, middle, and high).

Attribute-Oriented Induction for Data Characterization
The attribute-oriented induction approach is basically a query-oriented, generalization-based, on-line data analysis technique The general idea of attribute-oriented induction is to first collect the task-relevant data using a database query and then perform generalization based on the examination of the number of distinct values of each attribute in the relevant set of data

example:
A data mining query for characterization. Suppose that a user would like to describe the general characteristics of graduate students in the Big University database, given the attributes name, gender, major, birth place, birth date, residence, phone# (telephone number), and gpa (grade point average). A data mining query for this characterization can be expressed in the data mining query language, DMQL, as follows:

use DMT University DB
mine characteristics as “AI Students”
in relevance to name, gender, subject, age, location, phone#, score
from student
where status in “graduate”


Attribute generalization is based on the following rule: If there is a large set of distinct values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute.


Depending on the attributes or application involved, a user may prefer some attributes to remain at a rather low abstraction level while others are generalized to higher levels. The control of how high an attribute should be generalized is typically quite subjective. The control of this process is called attribute generalization control.

Different ways to control a generalization process

  1. Attribute generalization threshold control
  2. Generalized relation threshold control

Efficient Implementation of Attribute-Oriented Induction

Presentation of the Derived Generalization

Attribute-oriented induction generates one or a set of generalized descriptions,Generalized descriptions resulting from attribute-oriented induction are most commonly displayed in the form of a generalized relation (or table).


Example: cross tabulation, bar chart, pie chart, cube view.


Mining Class Comparisons: Discriminating between Different Classes steps:

  1. Data collection
  2. Dimension relevance analysis
  3. Synchronous generalization
  4. Presentation of the derived comparison

Example:

Mining a class comparison. Suppose that you would like to compare the general properties between the graduate students and the undergraduate students at DMT University, given the attributes name, gender, subject, age, location, phone# and score
This data mining task can be expressed in DMQL as follows:
use DMT University DB
mine comparison as “grad vs undergrad students”
in relevance to name, gender, subject, age, location, phone#, score
for “graduate students”
where status in “graduate”
versus “undergraduate students”
where status in “undergraduate”
analyze count%
from student