Introduction to SQL Server 2008 Data Mining
Introduction
Industry today faces a number of problems when attempting to analyze their data. There is no lack of data, but sure lack of analyzing and extracting useful information from them. In fact, many businesses feel they are drowning in data; they are unable to make sense of it all and turn it into information.
Instead of developing a vague idea, and then go digging for the data to support it, wouldn't it be helpful for the data mining to determine relationships, predict future events, spot bad data, and allow for the analysis of data in ways that have never been possible? This can be accomplished through the use of data mining. You can use data mining to answer questions such as:
- What will potential customers buy? What products sell together?
- How can a company predict which customers are at risk for company(say insurance)?
- Where has the marketplace been, and where is it heading?
- How can a business best analyze its Web site usage patterns by mining the logs?
- How can an organization determine the success of its marketing campaign?
- How can businesses prevent poor quality data from entering the system?
- What are current industry standards for business intelligence?
- Can intelligent dashbaords help business? What are some current trends followed in industry?
Sql Server 2008
Microsoft has introduced a wealth of new data mining features in Microsoft SQL Server 2008 that allow businesses to answer their concerns with data and mining for information in them. The current version of SQL Server, SQL Server 2008,[4] (code-named "Katmai",) was released (RTM) on August 6, 2008 and aims to make data management self-tuning, self organizing, and self maintaining with the development of SQL Server Always On technologies, to provide near-zero downtime. SQL Server 2008 will also include support for structured and semi-structured data, including digital media formats for pictures, audio, video and other multimedia data. In current versions, such multimedia data can be stored as BLOBs (binary large objects), but they are generic bitstreams. Intrinsic awareness of multimedia data will allow specialized functions to be performed on them. According to Paul Flessner, senior Vice President, Server Applications, Microsoft Corp., SQL Server 2008 can be a data storage backend for different varieties of data: XML, email, time/calendar, file, document, spatial, etc as well as perform search, query, analysis, sharing, and synchronization across all data types.
Other new data types include specialized date and time types and a Spatial data type for location-dependent data.[7] Better support for unstructured and semi-structured data is provided using the new FILESTREAM[8] data type, which can be used to reference any file stored on the file system.[9] Structured data and metadata about the file is stored in SQL Server database, whereas the unstructured component is stored in the file system. Such files can be accessed both via Win32 file handling APIs as well as via SQL Server using T-SQL; doing the latter accesses the file data as a BLOB. Backing up and restoring the database backs up or restores the referenced files as well.[10] SQL Server 2008 also natively supports hierarchical data, and includes T-SQL constructs to directly deal with them, without using recursive queries.
The Full-Text Search functionality has been integrated with the database engine, which simplifies management and improves performance.
Spatial data will be stored in two types. A "Flat Earth" (GEOMETRY or planar) data type represents geospatial data which has been projected from its native, spherical, coordinate system into a plane. A "Round Earth" data type (GEOGRAPHY) uses an ellipsoidal model in which the Earth is defined as a single continuous entity which does not suffer from the singularities such as the international dateline, poles, or map projection zone "edges". Approximately 70 methods are available to represent spatial operations for the Open Geospatial Consortium Simple Features for SQL, Version 1.1
SQL Server includes better compression features, which also helps in improving scalability. It also includes Resource Governor that allows reserving resources for certain users or workflows. It also includes capabilities for transparent encryption of data as well as compression of backups.SQL Server 2008 supports the ADO.NET Entity Framework and the reporting tools, replication, and data definition will be built around the Entity Data Model. SQL Server Reporting Services will gain charting capabilities from the integration of the data visualization products from Dundas Data Visualization Inc., which was acquired by Microsoft. On the management side, SQL Server 2008 includes the Declarative Management Framework which allows configuring policies and constraints, on the entire database or certain tables, declaratively.The version of SQL Server Management Studio included with SQL Server 2008 supports IntelliSense for SQL queries against a SQL Server 2008 Database Engine. SQL Server 2008 also makes the databases available via Windows PowerShell providers and management functionality available as Cmdlets, so that the server and all the running instances can be managed from Windows PowerShell.
What’s new in Sql Server 2008
Cross-Validation (Analysis Services - Data Mining.Cross-validation is an established method of assessing the accuracy of data mining models. In cross-validation, you iteratively partition the mining structure data into subsets, build models on the subsets, and then measure the accuracy of the model for each partition. By reviewing the returned statistics, you can determine how reliable the mining model is, and more easily compare models that are based on the same structure.
Cross-validation is available in the Mining Accuracy Chart view of Data Mining Designer. You can also partition a mining structure, test multiple mining models, and generate an analysis by using Analysis Services stored procedures.
In SQL Server 2008, to generate a cross-validation report, you specify the mining structure and the predictable attribute, and then specify the number of folds into which to segment the case data.
Analysis Services returns a table that reports statistics such as likelihood or root mean square error for individual partitions, and the mean and standard deviation of all measures for the aggregate models.
SQL Server 2008 supports the creation, management, and use of data mining models from Microsoft Excel when you use the SQL Server 2008 Data Mining Add-ins for Office 2007. The newest version of this popular free add-in has been enhanced by adding support for server-side training and test partitions, cross-validation, and several new analytic tools such as shopping basket analysis and a printable prediction calculator.
Enhancements to the Microsoft Time Series Algorithm
To improve the accuracy and stability of some predictions in time series models, a new algorithm has been added to the Microsoft Time Series algorithm. Based on the well-known ARIMA algorithm, the new algorithm provides better long-term predictions than the ARTxp algorithm that Analysis Services has been using. (ARTxp is an auto-regressive tree algorithm that is optimized for either a single time slice or short-term predictions.)
By default, the new implementation of the Microsoft Time Series algorithm uses the ARTxp algorithm to train one version of the model and the ARIMA algorithm to train another version. The algorithm then weights the results of these two models to provide the prediction characteristics that you prefer. If you do not want to use this default implementation, you can specify that the Microsoft Time Series algorithm use only the ARTxp or the ARIMA algorithm. In SQL Server 2008 Enterprise, you can specify a custom weighting of the algorithms to provide the best prediction over a variable time span.
Data Mining : A better future ahead
Microsoft data mining tools go far beyond traditional data mining. In other words, companies ran a data mining tool and then looked at the results. Far from being a tool that does a specific job, Microsoft data mining tools perform their work but immediately apply the results back into the process. This means that data mining models may group customers based on certain criteria, but the analyst can then immediately start slicing data based on these new customer groupings.
This ability to feed the results of the data mining models right back into the analysis process means that organizations can now determine how best to analyze the data. Today, most organizations break customers down by geography and include limited demographic information, such as income or education levels. Data mining might determine that a particular mix of customers makes more sense than any geographic boundaries; for example high-income people who enjoy both classic music and tractor pulls might prefer particular products, regardless of their education levels or geography. Once this customer group has been established, it can be used during the analysis phase to examine data. Analyzing data using groupings that are not inherently obvious is one of the great benefits of data mining.
Being able to analyze data using the output of data mining models is powerful, but these results can also be used during the integration phase. In other words, as an organization consolidates data from throughout the enterprise, data mining can determine if the data makes sense in light of other data. This ability to spot anomalous data during the data integration phase of a business intelligence project means that the resulting data warehouse is cleaner and analysis against it will be more accurate and meaningful.
SQL Server 2008 Data Mining
The Microsoft SQL Server 2008 Data Mining Platform introduces significant capabilities to address data mining across various fields and has new features which improves the process. In traditional terms, data mining can predict future results based on input, or attempt to find relationships among data or cluster data in previously unrecognized yet similar groups.
Microsoft data mining tools are different from traditional data mining applications in significant ways. First, they support the entire development lifecycle of data in the organization, which Microsoft refers to as Integrate, Analyze, and Report. This ability frees the data mining results from the hands of a select few analysts and opens those results up to the entire organization. Second, SQL Server 2008 Data Mining is a platform for developing intelligent applications, not a stand-alone application. You can build custom applications that are intelligent, because the data mining models are easily accessible to the outside world. Further, the model is extensible so that third parties can add custom algorithms to support particular mining needs. Finally, Microsoft data mining algorithms can be run in real time, allowing for the real-time validation of data against a set of mined data.
Benefits of SQL Server 2008 Data Mining Features
SQL Server 2008 data mining features contain a number of benefits over traditional data mining applications. As discussed previously, SQL Server 2008 data mining features are integrated across all the SQL Server products, including SQL Server, SQL Server Integration Services, and Analysis Services. SQL Server data mining tools are not a single application that a company runs to produce output that is then analyzed independent of the rest of the analysis process. Instead, data mining features are embedded throughout the process and are able to run in real time and the results can be fed back into the process of integration, analysis, or reporting. However, these features would be meaningless if they were difficult to use. Fortunately, Microsoft has focused on making the tools easy to use.
The data mining features in SQL Server 2008 make the creation of intelligent applications easy thanks to a very powerful but simple API. This API includes the ability to call predictive models from client applications without having to understand the internals of each model and how they work. This allows developers to call the engine and choose the model that provides the best results based on the data analyzed. Data that is returned is tokenized, meaning that numeric values are returned in a series of attributes. This allows developers to work with simple data rather than some new data format.
Accessing the data mining results is as simple as using an SQL-like language called Data Mining Extensions to SQL, or DMX. The syntax is designed to be approachable to those who already know SQL. For example, a DMX query might appear as follows.
Example 1:
SELECT TOP 100 CustomerID
FROM CustomerBuyerModel
NATURAL PREDICTION JOIN
OPENQUERY('CustomerData', 'SELECT * FROM Customers')
ORDER BY PredictProbability([Buyer],True) DESC
Example 2:
SELECT
[Bike Buyer] AS Buyer,
PredictHistogram([Bike Buyer]) AS Statistics
FROM
[Decision Tree]
NATURAL PREDICTION JOIN
(SELECT 35 AS [Age],
'5-10 Miles' AS [Commute Distance],
'1' AS [House Owner Flag],
2 AS [Number Cars Owned],
2 AS [Total Children]) AS t
Example 3:
SELECT
TOP 10
t.[LastName],
t.[FirstName],
[Decision Tree].[Bike Buyer],
PredictProbability([Bike Buyer])
FROM
[Decision Tree]
PREDICTION JOIN
OPENQUERY([Adventure Works DW],
'SELECT
[LastName],
[FirstName],
[MaritalStatus],
[Gender],
[YearlyIncome],
[TotalChildren],
[NumberChildrenAtHome],
[Education],
[Occupation],
[HouseOwnerFlag],
[NumberCarsOwned]
FROM
[dbo].[ProspectiveBuyer]
') AS t
ON
[Decision Tree].[Marital Status] = t.[MaritalStatus] AND
[Decision Tree].[Gender] = t.[Gender] AND
[Decision Tree].[Yearly Income] = t.[YearlyIncome] AND
[Decision Tree].[Total Children] = t.[TotalChildren] AND
[Decision Tree].[Number Children At Home] = t.[NumberChildrenAtHome] AND
[Decision Tree].[Education] = t.[Education] AND
[Decision Tree].[Occupation] = t.[Occupation] AND
[Decision Tree].[House Owner Flag] = t.[HouseOwnerFlag] AND
[Decision Tree].[Number Cars Owned] = t.[NumberCarsOwned]
WHERE [Decision Tree].[Bike Buyer] =1
ORDER BY PredictProbability([Bike Buyer]) DESC
One of the most important features of data mining in SQL Server 2008 is the ability to handle large data sets. In many data mining tools, the analyst must create a valid random sample of the data and run the data mining application against that random sample.
The SQL Server 2008 Data Mining Algorithms
Analysis Services includes the following algorithm types:
- Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. An example of a classification algorithm is the Microsoft Decision Trees Algorithm.
- Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. An example of a regression algorithm is the Microsoft Time Series Algorithm.
- Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. An example of a segmentation algorithm is the Microsoft Clustering Algorithm.
- Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis. An example of an association algorithm is the Microsoft Association Algorithm.
- Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path flow. An example of a sequence analysis algorithm is the Microsoft Sequence Clustering Algorithm.
Table 1.This table provides suggestions for which algorithms to use for specific tasks.[compiled from MSDN]
Task |
Microsoft algorithms to use |
Predicting a discrete attribute. |
Microsoft Decision Trees Algorithm |
Predicting a continuous attribute. |
Microsoft Decision Trees Algorithm |
Predicting a sequence. |
|
Finding groups of common items in transactions. |
Microsoft Association Algorithm |
Finding groups of similar items. |
Microsoft Clustering Algorithm |
Table 2. The algorithms featured in SQL Server 2008 Data Mining
Model |
Description |
Decision Trees |
The Decision Trees algorithm calculates the odds of an outcome based on values in a training set. For example, a person in the age group 20-30 that makes over $60,000/year and owns a home is more likely to need a lawn service than someone in the age group of 15-19 who doesn't own a home. Based on age, income, and home ownership, the Decision Trees algorithm can calculate the odds of that person needing a lawn service based on historical values. |
Association Rules |
The Association Rules algorithm helps identify relationships between various elements. For example, it is used in cross-selling solutions because it notes relationships between items and can be used to predict what else someone buying a product will also be interested in purchasing. The Association Rules algorithm can handle incredibly large catalogs, having been tested on catalogs of over half a million items. |
Naive Bayes |
The Naive Bayes algorithm is used to clearly show the differences in a particular variable for various data elements. For example, the Household Income variable differs for each customer in the database, and can be used as a predictor of future purchasing. This model excels at showing the differences between certain groups such as customers who churn and those who don't. |
Sequence Clustering |
The Sequence Clustering algorithm is used to group or cluster data based on a sequence of previous events. For example, users of a Web application can often follow a variety of paths through the site. This algorithm can group customers based on their sequence of pages through the site to help analyze users and determine if some paths are more profitable than others. This algorithm can also be used to predict, such as predicting the next page a user may visit. Note that the predictive capability of the Sequence Clustering algorithm is something that many other data mining vendors cannot deliver. |
Time Series |
The Time Series algorithm is used to analyze and forecast time-based data. Sales are the most commonly analyzed and predicted data using the Time Series algorithm. This algorithm looks for patterns across multiple data series so that businesses can determine how different elements affect the analyzed series. |
Neural Nets |
Neural networks are the core of artificial intelligence. They seek to uncover relationships in data that other algorithms miss. While the Neural Nets algorithm tends to be slower than the other algorithms, it finds relationships that may be non-intuitive. |
Text Mining |
The Text Mining algorithm appears in SQL Server Integration Services and analyzes unstructured text data. This allows companies to analyze unstructured data such as a "comments" section on a customer satisfaction survey. |
Table 3. links to the types of information available for each algorithm
While SQL Server 2008 includes a number of algorithms out of the box, the model used by SQL Server 2008 allows any vendor to add new models into the data mining engine. Those models become peers with the models that come with SQL Server 2008. Algorithms from third parties also benefit from all the other features: they are callable using DMX and are easy to integrate into any part of the Integrate, Analyze, and Report process.
SQL Server 2008 Data Mining and End-to-end Business Intelligence
Data Preprocessing
The preprocessing phase covers the data handling where data is cleaned or preprocessed before the datamining phase. Data Preprocessing is an important step in the knowledge discovery process, can be even considered as a fundamental building block of data mining.Data preprocessing is a part of ETL, its nothing but transforming the data. To be more precise modifying the source data in to a different format which enables data mining algorithms to be applied easily, improves the effectiveness and the performance of the mining algorithms, represents the data in easily understandable format for both humans and machines, supports faster data retrieval from databases, and makes the data suitable for a specific analysis to be performed. During the preprocessing phase, it is also possible to have missing values provided by the model if interpolated values are acceptable.
Data Integration
The integration phase covers the capturing of data from disparate sources, and loading it into one or many sources. Data mining algorithms help companies spot outliers that already exist in the data, or outliers that may be brought in during a traditional extraction, transforming, and loading (ETL) process.
These values can be from a prior period or can be forecasts of future activity. The advantage that Microsoft data mining tools offer are that the numbers can be generated on the fly during the integration process rather than being provided only after the integration is completed. Data mining tools are integrated with SQL Server Integration Services. This means that during the data movement and transformation stage, data can be analyzed and modified based on the predictive output of the data mining models
Data Analysis
Typical data mining tools generate results after a data warehouse is built and these results are analyzed independently of the analysis done on the data warehouse. Forecasts are generated or relationships are identified, but the result of the data mining models is generally independent of the data used in the data warehouse.
Microsoft tools are integrated with the entire process. Just as data mining is available in SQL Server Integration Services, the benefits of data mining are visible in Analysis Services and SQL Server as well. Whether a company chooses to use relational or OLAP data, mining benefits can be evident during the analysis phase. Thanks to the Universal Data Model (UDM), analysis can be performed against either relational or OLAP data in a transparent manner, and data mining provides a boost to this analysis.
Reporting
The integration between data mining and reporting in SQL Server 2008, providing predictive results to anyone in the organization can be done in a simple, flexible, and scalable manner.
Industry Problems that can be Solved with Data Mining
When looking at industry problems that can be solved by data mining, most people think of market basket analysis(Say customer A bought Apples, Customer B would buy what?) or finding relationships between data that were previously unknown. In reality, there are a variety of other problems that can be addressed with data mining, but to do this it is important to realize that data mining can fit into any phase of the Integrate, Analyze, and Report process.
Scenario 1: What Will Customers Buy this month at Walmart? Which Products Sell Together, would candy and bread go together ?
Scenario 2: Where has the Marketplace Been, and Where Is it Going? Analyze the market, finance and sales.
Scenario 3: Analyzing the Web Site usage patterns by log mining.
Scenario 4: Text Analysis, idnetifying keywords and more.
Conclusion
Microsoft provides a wide range of data mining options, which includes collaborative solutions and ad hoc analysis (in MS Office Excel). A free plug-in is available in MS Office Excel 2007, which helps the analyst to analyze the data patterns. In addition to this plug-in, the Business Intelligence Development Studio (BIDS) that is free with the SQL Server can also be used for data mining purpose. Data mining is used in various applications such as forecasting business and customer trends, detecting fraud (especially in the banking sector), generating customized advertisements, grouping customers on the basis of their purchasing trends, and risk analysis.Microsoft data mining is extensible. It can be licensed through SQL Server 2008 (or SQL Server 2005) and it is compatible with other technologies, thereby allowing access to data in different formats. Microsoft data mining can also be used for business intelligence solutions and it is scalable unlike other data mining products.
References: (Compiled from following)
- Database Engine XML Enhancementshttp://msdn.microsoft.com/en-us/library/ms170809.aspx. Retrieved 2007-12-03.
- Database Engine Enhancements". http://msdn.microsoft.com/en-us/library/ms170910.aspx. Retrieved 2007-12-03.
- Multiple Active Result Sets (MARS) in SQL Server 2005 (retrieved June 20, 2009)
- Microsoft SQL Server 2008". http://www.microsoft.com/sql/prodinfo/futureversion/default.mspx. Retrieved 2007-04-06.
- ChannelWeb: Next SQL Server stop: Katmai". http://www.crn.com/storage/170702999. Retrieved 2005-11-05.
- Announced to the SQL Server Special Interest Group at the ESRI 2008 User's Conference on August 6, 2008 by Ed Katibah (Spatial Program Manager at Microsoft)
- Microsoft Gives Peek At Next Version Of SQL Server". http://www.informationweek.com/software/showArticle.jhtml?articleID=199500164&subSection=Development. Retrieved 2007-05-11.
- One more test build to go for SQL Server 2008". http://blogs.zdnet.com/microsoft/?p=937. Retrieved 2006-11-13.
- Guest Blogger: Ted Kummert". http://blogs.msdn.com/stevengu/archive/2007/11/13/guest-blogger-ted-kummert.aspx. Retrieved 2007-11-20.
- Christian Kleinerman. "SQL Server 2008 for developers". Channel9. http://channel9.msdn.com/Showpost.aspx?postid=387069. Retrieved 2008-03-07.
- Fernando Azpeitia Lopez. "SQL Server 2008 Full-Text Search: Internals and Enhancements". http://msdn.microsoft.com/en-us/library/cc721269.aspx.
- Microsoft Shares Details on SQL Server 2008 Spatial Support by Directions Staff". http://www.directionsmag.com/editorials.php?article_id=2477&trv=1. Retrieved 2007-09-07.
- SQL Server "Katmai" to Deliver Entity Data Platform and Support LINQ". http://oakleafblog.blogspot.com/2007/05/sql-server-to-deliver-entity-data.html. Retrieved 2007-05-12.
- Microsoft Details Dynamic IT Strategy at Tech-Ed 2007". http://www.earthtimes.org/articles/show/news_press_release,115898.shtml. Retrieved 2007-06-04.
- SQL Server IntelliSense". http://msdn.microsoft.com/en-us/library/ms173434.aspx. Retrieved 2008-08-18.
- SQL Server Support for PowerShell!". http://blogs.msdn.com/powershell/archive/2007/11/13/sql-server-support-for-powershell.aspx. Retrieved 2007-12-03.
- SQL Server Team. "TechEd 2009 – New SQL Server Innovations". MSDN Blogs. http://blogs.technet.com/dataplatforminsider/archive/2009/05/11/teched-2009-new-sql-server-innovations.aspx. Retrieved 2009-05-12.
- http://www.sqlserverdatamining.com/ssdm/
- http://msdn.microsoft.com/en-us/sqlserver/cc511476.aspx
- http://msdn.microsoft.com/en-us/library/bb510513.aspx
- www.wikipedia.org