Forgot Password?

Machine LearningAn Introduction

Introduction

Machine Learning is simply the ability of the machine to learn from the previous experience or history and perform better at a given task, as the future mimics the past.

Definition

"Machine Learning is simply the ability of the machine to learn from the previous experience or history and perform better for a given task, as the future mimics the past."

Tom Mitchell: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Witten and Frank : Things learn when they change their behavior in a way that makes them perform better in the future.

Langley : Machine learning is a science of the artificial. The field's main objects of study are artifacts, specifically algorithms that improve their performance with experience.

Alpaydin: Machine learning is programming computers to optimize a performance criterion using example data or past experience.

Ron Kohavi: In Knowledge Discovery, machine learning is most commonly used to mean the application of induction algorithms, which is one step in the knowledge discovery process.

Answers.com: The process or technique by which a device modifies its own behavior as the result of its past experience and performance

wikipedia.org: As a broad subfield of artificial intelligence, machine learning is concerned with the design and development of algorithms and techniques that allow computers to "learn". At a general level, there are two types of learning: inductive, and deductive. Inductive machine learning methods extract rules and patterns out of massive data sets.

UCI Machine Learning Group : Machine learning investigates the mechanisms by which knowledge is acquired through experience.

 

ML history

1950's & 60's

  • The History of machine learning dates back to the 1950's during the AI and cognitive science day's.
  • Realization of domain knowledge for intelligence and lead to knowledge systems.
  • Pattern recognition emerged as a new field.
  • Neural networks, perceptron, learning in the limit theory.
  • Neurophysiological:Rosenblatt's perceptron,Biological:Simulated evolution,Psychological:Symbol processing systems,Statistical: Control and pattern recognition,Samuel's checkers program
  • Theoretical:Gold's identification in the limit, Minsky and Papert's criticism of the perceptron
  • 1970's

  • Symbolic concept induction,knowledge acquisition systems, Quinlan’s ID3; Michalski’s AQ and soybean diagnosis results, Scientific discovery with BACON, mathematical discovery with AM.
  • Winston's ARCH:Learned concept of a blocks-world arch, Buchanan and Mitchell's Meta-Dendral: Learned mass-spectrometry prediction rules, Michalski'sAQ11:Learned soybean disease diagnosis rules, Quinlan's ID3: Learned chess end-game rules, Fikes, Hart and Nilsson's MACROPS:Learned macro-operators in blocks-world planning,Lenat's AM:Discovered interesting mathematical concepts.
  • 1980's

  • Continued progress on decision-tree and rule learning.
  • Explanation-based learning, speedup learning; utility problem, analogy, resurgence of connectionism (PDP, ANN), PAC learning, experimental evaluation.
  • In 1980, First workshop on Machine Learning was at CMU attended by 30 participants.
  • Extended to domains of planning, diagnostics, design and control.
  • Explosion of research directions.
  • Some new directions included Learning theory,Symbolic learning algorithms,Connectionist (neural network) learning algorithms,Clustering and discovery,Explanation-based learning,Knowledge-guided inductive learning,Analogical and case-based reasoning,Genetic algorithms.
  • 1990's

  • Data mining; adaptive software agents & Information Retrieval; reinforcement learning; theory refinement; inductive logic programming; voting, bagging, boosting, and stacking; learning Bayesian networks.
  • Emergence of support vector machines.
  • Maturity of the field was observed.
  • Some new directions included Statistical comparisons of algorithms, Theoretical analyses of algorithms, Successful applications, Multi-relational learning,Ensemble and Kernel Methods.
  • Is Machine learning = Data mining (?)
  • 2000 & Beyond

  • Rise of SVM: Kernal Machines, Ensembles,and statistical relational learning.
  • Interactions between symbolic machine learning, computational learning theory, neural networks, statistics, pattern recognition.
  • New applications for ML techniques: knowledge discovery in databases, language processing, robot control, combinatorial optimization.
  • To improve accuracy by learning ensembles, Scaling up supervised learning algorithms, Learning complex stochastic models (Hierarchical Mixture of Experts, Hidden Markov Model, Dynamic Probabilistic Network).
  • Methods

    Introduction

    Machine Learning is considered as a subfield of Artificial Intelligence and it is concerned with the development of techniques and methods which enable the computer to learn. In simple terms, it is considered the science of development of algorithms which enable the machine to learn and perform tasks and activities. Machine learning overlaps with statistics in many ways. Over the period of time many techniques and methodologies were developed for machine learning tasks. Learning is classified basically into supervised learning, unsupervised learning and semi-supervised learning.

    This section of the Wiki focuses on introducing some learning algorithms and techniques which are used. We focus on some basic algorithms but we should also remember that new techniques are being formulated and old techniques are being customized based on the problem and, in this wiki we plan to cover few topics and in future we plan to extend and include more. Also, We plan to make this section informative and would try to define the technique and state to which problems can it be successfully applied to and mention their benefits and also identify problems associated with them.

    Decision-Tree Learning

    A decision tree is a simple inductive learning structure. Given an instance of an object or situation, which is specified by a set of properties, the tree returns a "yes" or "no" decision about that instance. Each internal node in the tree represents a test on one of those properties, and the branches from the node are labeled with the possible outcomes of the test. Each leaf node is a Boolean classifier for the input instance.. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. The machine learning technique for inducing a decision tree from data is called decision tree learning, or (colloquially) decision trees. [4] The major advantage of decision trees is its ability to interpret a trained model. Decision trees also works with numerical data as input, since they find the margin that maximizes information gain. Their ability to mix categorical and numerical data is another advantage. Inductive Bias: Shorter trees are preferred over larger ones.[5] Occam's Razor: Prefer the simplest hypothesis which fits the data. [6]

    Overfitting the training data is an important issue because the training examples are only a sample of all possible instances, it is possible to add branches to the tree that improve performance on the training examples while deceasing performance on other instances outside this set. Methods for post-pruning the decision tree are therefore important to avoid overfitting in decision tree learning. [7]

    Artificial Neural Networks

    An artificial neural network (ANN), often just called a "neural network" (NN), is a mathematical model or computational model based on biological neural networks. It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. [8] The greatest advantage of ANNs is their ability to be used as an arbitrary function approximation mechanism which 'learns' from observed data. However, using them is not so straightforward and a relatively good understanding of the underlying theory is essential. The tasks to which artificial neural networks are applied tend to fall within the following broad categories of Function approximation, or regression analysis, including time series prediction and modeling, classification, including pattern and sequence recognition, novelty detection and sequential decision making, and data processing, including filtering, clustering, blind source separation and compression. Examples include application areas include game-playing and decision making (backgammon, chess, racing), pattern recognition (face identification, object recognition and more), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications (automated trading systems), data mining (knowledge discovery), visualization and e-mail spam filtering. [9]

    Artificial neural network provides a method for learning real-valued and vector-valued functions over continuous and discrete-valued attributes, in a way that is robust to noise in the training data. The backpropagation algorithm is the most popular method and has been successfully applied to various tasks such as hand writing recognition and robot control. The hypothesis space considered by this algorithm is the space of all function that can be represented by assigning weights to the given, fixed network of interconnected units. Backpropogation searches the space of possible hypothesis using gradient decent to iteratively reduce the error in the network fit to the training examples. More generally, gradient decent is a potentially useful method for searching many continuously parameterized hypothesis spaces where the training error is a differentiable of hypothesis parameters. Overfitting the training data is an important issue in ANN training. Cross validation methods can be used for estimating an appropriate stopping point for gradient decent search and thus to minimize the risk of overfitting. [10]

    Bayesian Learning

    Bayesian Learning is a probabilistic approach to learning and inference. It is based on the assumption that the quantities of interest are governed by probability distributions. It is attractive because in theory it can arrive at optimal decisions. It provides a quantitative approach to weighing the evidence supporting alternative hypotheses. [11] Bayesian learning has been successfully applied to Data mining, Robotics, Signal processing, Bioinformatics, Text analysis (spam filters), and graphics. Bayesian methods can be used to determine the most probable hypothesis given the data, maximum a posteriori (MAP) hypothesis. This is the optimal hypothesis in the sense that no other hypothesis is more likely. The naive Bayes classifier is a bayesian learning methods and it is called naive because it incorporates the simplifying assumption that attribute values are conditionally independent, given the classification of the instance. [12] When this assumption is met, the naive Bayes classifier outputs the MAP classification. A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". [13]In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers often work much better in many complex real-world situations than one might expect. The framework of Bayesian reasoning can provide a useful basis for analyzing certain learning methods that do not directly apply Bayes theorem. The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimizes the description length of the hypothesis plus the description of the data given the hypothesis. minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the largest compression of the data. [14] The EM algorithm provides a quite general approach to leanring in the presence of unobservable variables. This algorithm begins with a arbitrary initial hypothesis.An expectation-maximization (EM) algorithm is used in statistics for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. [15]

    Support Vector Machines

    Machine Learning domain NN approaches are considered to be a baseline technique for data-driven modeling. The advanced Machine Learning techniques include Support Vector Machines. They are independent of the dimensionality of the input feature space and allow development of robust non-linear models with good generalization abilities. These methods were recently introduced in the scope of Statistical Learning Theory. According to the inductive learning principles of the Statistical Learning Theory, the optimal predictive model for a given data modeling task has to be built by finding the trade-off between the model complexity and its fit to training data. It gives rise to excellent generalization abilities of the Support Vector Machines.

    Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. They belong to a family of generalized linear classifiers. A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. Support vector machines map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes. An assumption is made that the larger the margin or distance between these parallel hyperplanes the better the generalization error of the classifier will be. [16]

    Soft Margin Classifier: In real world problem it is not likely to get an exactly separate line dividing the data within the space. And we might have a curved decision boundary. We might have a hyperplane which might exactly separate the data but this may not be desirable if the data has noise in it. It is better for the smooth boundary to ignore few data points than be curved or go in loops, around the outliers. This is handled in a different way; here we hear the term slack variables being introduced. Now we have, yi(w’x + b) ≥ 1 - Sk. This allows a point to be a small distance Sk on the wrong side of the hyper plane without violating the constraint. Now we might end up having huge slack variables which allow any line to separate the data, thus in such scenarios we have the Lagrangian variable introduced which penalizes the large slacks.

    min L = ½ w’w - ∑ λk ( yk (w’xk + b) + sk -1) + α ∑ sk

    Where reducing α allows more data to lie on the wrong side of hyper plane and would be treated as outliers which give smoother decision boundary. [17]

    Kernal Trick: Let’s first look at few definitions as what is a kernel and what does feature space mean? Kernel: If data is linear, a separating hyper plane may be used to divide the data. However it is often the case that the data is far from linear and the datasets are inseparable. To allow for this kernels are used to non-linearly map the input data to a high-dimensional space. The new mapping is then linearly separable. This mapping is defined by the Kernel: Feature Space: Transforming the data into feature space makes it possible to define a similarity measure on the basis of the dot product. If the feature space is chosen suitably, pattern recognition can be easy.

    Now getting back to the kernel trick, we see that when w,b is obtained the problem is solved for a simple linear scenario in which data is separated by a hyper plane. The Kenral trick allows SVM’s to form nonlinear boundaries. [a] The algorithm is expressed using only the inner products of data sets. This is also called as dual problem. [b] Original data are passed through non linear maps to form new data with respect to new dimensions by adding a pair wise product of some of the original data dimension to each data vector. [c] Rather than an inner product on these new, larger vectors, and store in tables and later do a table lookup, we can represent a dot product of the data after doing non linear mapping on them. This function is the kernel function. More on kernel functions is given below.

    Kernal Trick: Dual Problem: First we convert the problem with optimization to the dual form in which we try to eliminate w, and a Lagrangian now is only a function of λi. There is a mathematical solution for it but this can be avoided here as this tutorial has instructions to minimize the mathematical equations, I would describe it instead. To solve the problem we should maximize the LD with respect to λi. Kernal Trick: Inner Product summarization: Here we see that we need to represent the dot product of the data vectors used. The dot product of nonlinearly mapped data can be expensive. The kernel trick just picks a suitable function that corresponds to dot product of some nonlinear mapping instead. Some of the most commonly chosen kernel functions are given below in later part of this tutorial. A particular kernel is only chosen by trial and error on the test set, choosing the right kernel based on the problem or application would enhance SVM’s performance. Kernel Functions: The idea of the kernel function is to enable operations to be performed in the input space rather than the potentially high dimensional feature space. Hence the inner product does not need to be evaluated in the feature space. We want the function to perform mapping of the attributes of the input space to the feature space. The kernel function plays a critical role in SVM and its performance. It is based upon reproducing Kernel Hilbert Spaces. If K is a symmetric positive definite function, which satisfies Mercer’s Conditions. Then the kernel represents a legitimate inner product in feature space. The training set is not linearly separable in an input space. The training set is linearly separable in the feature space. This is called the “Kernel trick”. [18]

    The different kernel functions are listed below: [1] Polynomial: A polynomial mapping is a popular method for non-linear modeling. The second kernel is usually preferable as it avoids problems with the hessian becoming Zero.[2] Gaussian Radial Basis Function: Radial basis functions most commonly with a Gaussian form.[3] Exponential Radial Basis Function: A radial basis function produces a piecewise linear solution which can be attractive when discontinuities are acceptable. There are many more including Fourier, splines, B-splines, additive kernels and tensor products.

    SVM is powerful to approximate any training data and generalizes better on given datasets. The complexity in terms of kernel affects the performance on new datasets [8]. SVM supports parameters for controlling the complexity and above all SVM does not tell us how to set these parameters and we should be able to determine these Parameters by Cross-Validation on the given datasets. Classification in SVM is an example of Supervised Learning. Known labels help indicate whether the system is performing in a right way or not. This information points to a desired response, validating the accuracy of the system, or be used to help the system learn to act correctly. A step in SVM classification involves identification as which are intimately connected to the known classes. This is called feature selection or feature extraction. Feature selection and SVM classification together have a use even when prediction of unknown samples is not necessary. They can be used to identify key sets which are involved in whatever processes distinguish the classes. SVMs can also be applied to regression problems by the introduction of an alternative loss function. The loss function must be modified to include a distance measure. The regression can be linear and non linear. Linear models mainly consist of the following loss functions, e-intensive loss functions, quadratic and Huber loss function. Similarly to classification problems, a non-linear model is usually required to adequately model data. In the same manner as the non-linear SVC approach, a non-linear mapping can be used to map the data into a high dimensional feature space where linear regression is performed. The kernel approach is again employed to address the curse of dimensionality. In the regression method there are considerations based on prior knowledge of the problem and the distribution of the noise. In the absence of such information Huber’s robust loss function, has been shown to be a good alternative. [19]

    SVM has been found to be successful when used for pattern classification problems. Applying the Support Vector approach to a particular practical problem involves resolving a number of questions based on the problem definition and the design involved with it. One of the major challenges is that of choosing an appropriate kernel for the given application [4]. There are standard choices such as a Gaussian or polynomial kernel that are the default options, but if these prove ineffective or if the inputs are discrete structures more elaborate kernels will be needed. By implicitly defining a feature space, the kernel provides the description language used by the machine for viewing the data. Once the choice of kernel and optimization criterion has been made the key components of the system are in place. Let’s look at some examples. The task of text categorization is the classification of natural text documents into a fixed number of predefined categories based on their content. Since a document can be assigned to more than one category this is not a multi-class classification problem, but can be viewed as a series of binary classification problems, one for each category. One of the standard representations of text for the purposes of information retrieval provides an ideal feature mapping for constructing a Mercer kernel. Indeed, the kernels somehow incorporate a similarity measure between instances, and it is reasonable to assume that experts working in the specific application domain have already identified valid similarity measures, particularly in areas such as information retrieval and generative models.Traditional classification approaches perform poorly when working directly because of the high dimensionality of the data, but Support Vector Machines can avoid the pitfalls of very high dimensional representations. A very similar approach to the techniques described for text categorization can also be used for the task of image classification, and as in that case linear hard margin machines are frequently able to generalize well. The first real-world task on which Support Vector Machines were tested was the problem of hand-written character recognition. Furthermore, multi-class SVMs have been tested on these data. It is interesting not only to compare SVMs with other classifiers, but also to compare different SVMs amongst themselves. They turn out to have approximately the same performance, and furthermore to share most of their support vectors, independently of the chosen kernel. The fact that SVM can perform as well as these systems without including any detailed prior knowledge is certainly remarkable. The major strengths of SVM are the training is relatively easy. No local optimal, unlike in neural networks. It scales relatively well to high dimensional data and the trade-off between classifier complexity and error can be controlled explicitly. The weakness includes the need for a good kernel function. Support Vector Machines acts as one of the best approach to data modeling. They combine generalization control as a technique to control dimensionality. The kernel mapping provides a common base for most of the commonly employed model architectures, enabling comparisons to be performed. [20] [21]

    Instance-based Learning

    Instance based learning methods differ from other approaches to function approximately because they delay processing of training examples until they must label a new query instance. They need not form an explicit hypothesis if the entire target function over the entire instance space. They may form different local approximations to the target function for each query instance. Advantages of instance based method include the ability to model complex target functions by a collection of less complex approximation and thus preventing the information present in training instances be lost. k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. It can also be used for regression. [22] It is an instance based algorithm for approximating real values or discrete valued target functions. The target function value for a new query is estimated from the known value of the k nearest training examples. Locally weighted regression methods are a generalization of k-nearest neighbor in which an explicit local approximation to the target function is constructed for each query instance. Instance-based learning algorithms simply store some or all of the training examples and postpone any generalization effort until a new instance must be classified. They can thus build query-specific local models, which attempt to fit the training examples only in a region around the query point. Metric distance minimization(MDM) is perhaps the simplest instance-based learning method. In MDM, when a query is presented, we find the training example that is closest to the query point, in terms of the Euclidean distance or some other suitable metric, and provide the parameters of that example as predicted output parameters. Locally Weighted Regression (LWR) uses examples that are close to the query point, weighted according to their distance, and builds a model in the vicinity of that point. This local approximation can be a linear function, a quadratic function or even a multilayer neural net. In this work we use a local linear model around the query point to approximate the target function. [23][24]

    Ensemble Methods

    Ensemble methods are methods than rather than finding one best hypothesis for the learning problem construct a set of hypotheses (called "committee" or "ensemble") that can, in some fashion, "vote" to predict the class of the new data example. Ensemble learning algorithms work running a base learning algorithm multiple times, generating a hypothesis at each iteration. [25] The hypothesis can be generated in two ways: 1) In a independent way, resulting in a set of diverse hypothesis. It can be done by running the algorithm several times and providing different training data at each iteration or providing different subsets of the input features at each iteration. An example of this way is Bagging [26]. 2) In an additive way, adding one hypothesis at a time to the ensemble, and constructing the hypothesis trying to minimize the classification error on a weighted training data set. An example of this way is Boosting. [27]

    Bagging: Given a training data with m examples, Bagging works by generating at each iteration a new training data set of m examples by sampling uniformly from the original examples, which means some original examples appear multiple times and other original examples do not appear in the resampling. If the learning algorithm is unstable (small changes in the training data causes large changes in the hypothesis, like decision trees) then Bagging will produce a set of diverse hypothesis (H). Having H, for classifying a new example, we should proceed to a voting between all the hypothesis hi belonging to H, and the final classification would be the most voted one.

    Boosting:Boosting assigns a weight to each training sample and then, at each iteration, generates a model that tries to minimize the sum of the weights of the misclassified examples. The errors at each iteration are used to actualize the weights of the training samples, increasing the misclassified instances’ weights and decreasing the correctly classified instances’ weights. [28]

    The reasons we need ensembles[29]: The first cause of the need for ensembles is that the training data may not provide sufficient information for choosing a single best classifier from H. Most of our learning algorithms consider very large hypothesis spaces, so even after eliminating hypotheses that misclassify training examples, there are many hypothesis remaining. A second reason is that our learning algorithms may not be able to solve the difficult search problems that we try to solve. A third reason is that our hypothesis space H may not contain the true function f.

    Genetic Algorithms

    GA conduct a randomized, parallel, hill-climbing, search for hypotheses that optimize a predefined fitness function. genetic algorithm (GA) is a search technique used in computing to find exact or approximate solutions to optimization and search problems. Genetic algorithms are categorized as global search heuristics. Genetic algorithms are a particular class of evolutionary algorithms (also known as evolutionary computation) that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover (also called recombination).[30] The following steps describe the application of a genetic algorithm: Start with an initial population (e.g. random) of candidate solutions, repeatedly apply a number of genetic operators to generate a new population and denote the best individual of the last generation (population) as the solution.The operators that a genetic algorithm uses are:

    Reproduction: Select individuals with higher fitness than others to reproduce so that their children are found in the next generation. Unfit individuals die with higher probability than fitter ones.

    Crossover: Combine two reproduced individuals so that their children are copies in the next generation.

    Mutation: Probabilistic change of part of an individual.

    Genetic algorithms are simple to implement, but their behavior is difficult to understand. In particular it is difficult to understand why they are often successful in generating solutions of high fitness. [31] GAs have been most commonly applied to optimization problems but in learning world they are applied to hypotheses are complex (example is a set of rules for robot control) and in which the objective to be optimized may be an indirect function of the hypothesis.

    Graph-based Learning

    Graph-based relational learning (GBRL) is the task of finding novel, useful, and understandable graph-theoretic patterns in a graph representation of data. [32] Graph-based data representation is becoming increasingly more commonplace, as graphs can represent some kinds of data more efficiently than relational tables. As such, interesting patterns in the form of subgraphs can be discovered by mining these graph-based datasets. Because the learned patterns can be used to predict future occurrences, it is necessary to learn graphical concepts that can optimally classify the data in the presence of uncertainty.[33] Many machine learning algorithms build on graphs, Clustering algorithms, e.g. spectral clustering, Dimensionality reduction algorithms based on manifolds (LLE, Isomap), Semi-supervised learning algorithms, e.g. label propagation, Ranking algorithms, Graph representations for ontology learning and word sense disambiguation, Graph algorithms for Information Retrieval, text mining and understanding, Graph matching for Information Extraction, Random walk graph methods and Spectral graph clustering, Graph labeling and edge labeling for semantic representations, Encoding semantic distances in graphs, Ranking algorithms based on graphs, Small world graphs in natural language, Semi-supervised graph-based methods, and Statistical network analysis and methods for NLP.

    Reinforcement Learning

    Reinforcement learning (RL) is learning by interacting with an environment. An RL agent learns from the consequences of its actions, rather than from being explicitly taught and it selects its actions on basis of its past experiences (exploitation) and also by new choices (exploration), which is essentially trial and error learning. The reinforcement signal that the RL-agent receives is a numerical reward, which encodes the success of an action's outcome, and the agent seeks to learn to select actions that maximize the accumulated reward over time. [34] Using function approximation, RL can apply to much larger state spaces than classical sequential optimization techniques such as dynamic programming. In addition, using simulations (sampling), RL can apply to systems that are too large or complicated to explicitly enumerate the next-state transition probabilities. The environment is typically formulated as a finite-state Markov decision process (MDP), and reinforcement learning algorithms for this context are highly related to dynamic programming techniques. State transition probabilities and reward probabilities in the MDP(Markov Decision Process) are typically stochastic but stationary over the course of the problem. [35]

    Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. A strength with Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. A recent variation called delayed-Q learning has shown substantial improvements, bringing PAC bounds to Markov Decision Processes. Reinforcement learning applications have ranged from robotics, to industrial manufacturing, to combinatorial search problems such as computer game playing. [36][37]

    Pattern Recognition

    Pattern recognition [38] aims to classify data (patterns) based on either a priori knowledge or on statistical information extracted from the patterns. The patterns to be classified are usually groups of measurements or observations, defining points in an appropriate multidimensional space. A complete pattern recognition system consists of a sensor that gathers the observations to be classified or described; a feature extraction mechanism that computes numeric or symbolic information from the observations; and a classification or description scheme that does the actual job of classifying or describing observations, relying on the extracted features.Typical applications are automatic speech recognition, classification of text into several categories (e.g. spam/non-spam email messages), the automatic recognition of handwritten postal codes on postal envelopes, or the automatic recognition of images of human faces. The last two examples form the subtopic image analysis of pattern recognition that deals with digital images as input to pattern recognition systems. [39] Pattern recognition is a very active field of research intimately bound to machine learning. Also known as classification or statistical classification, pattern recognition aims at building a classifier that can determine the class of an input pattern. This procedure, known as training, corresponds to learning an unknown decision function based only on a set of input-output pairs (\boldsymbol{x}_i,y_i) that form the training data (or training set). Nonetheless, in real world applications such as character recognition, a certain amount of information on the problem is usually known beforehand. [40] [41]

    Applications

    Introduction

    Machine learning (ML) is a proven to have significant impact on both industry and research. There are numerous successful applications of machine learning; but here we introduce only a few selective applications.

    Financial Applications

    Machine learning techniques have produced some of the financial industry's most successful trading strategies during the past 20 years. With markets, trade execution and financial decision making becoming more automated and competitive, practitioners increasingly recognize the need for ML. Learning techniques include reinforcement learning, optimization methods, recurrent and state space models, on-line algorithms, evolutionary computing, kernel methods, bayesian estimation, wavelets, neural nets, SVMs, boosting, and multi-agent simulation. Financial domains where machine learning apply includes high frequency data, trading strategies, execution models, forecasting, volatility, extreme events, credit risk, portfolio management, yield curve estimation, option pricing, and so forth. [1]

    Credit Risk Analysis: Machine learning can be used by investors to make investment decisions(Building Probabilistic Models for use by an investor for measuring risk) [2]. We should also note that learning techniques can be applied to learn patterns and apply them to identify credit fraud.

    Loan Application Screening: In the 1980s American Express (UK) used statistical methods [3] to divide loan applications into three categories: which include those that should definitely be accepted, those that should definitely be rejected, and those which required a human expert to judge. The human experts could correctly predict if an applicant would, or would not, default on the loan in only about 50% of the cases. And when ML produced rules that were much more accurate – correctly predicting default in 70% of the cases – ML based system were immediately put into use.[4][[5]

    Weather forecasting

    In recent years, many solutions to intelligent weather forecast have been proposed, especially on temperature and rainfall predictions. They solutions include techniques such as Neural Networks, SVM, regression, and time series analysis, with the obtained results confirm that proposed solutions have the potential for successful application to the problem of temperature and rainfall estimation, and the relationships between the factors that contribute to certain weather conditions can be estimated at a certain extent. There are also extended application of weather prediction such as application involving avalanche danger prediction. [6] [7]

    Speech recognition

    In the last decades, the large data collections of spoken and written language has opened new opportunities for the speech and language communities,but it also implied a challenge, as the task has shifted from the analysis of few examples collected in laboratory conditions, to the study of how language is used in the real world. Machine-learning methods provide tools for coping with the complexity of these data collections. They can be used to develop models that can perform reasonably well in speech recognition and synthesis tasks, despite our incomplete understanding of the human speech perception and production mechanisms. Machine learning can also be used as a complement to standard statistics to extract knowledge from multivariate data collections, where the number of variables, the size (number of data points), and the quality of the data (missing data, inaccurate transcriptions) would make standard analysis methods ineffective Finally these methods can be used to model and simulate the processes that take place in the human brain during speech perception and production. [8][9]

    Natural Language Processing

    Natural-language-generation systems convert information from computer databases into normal-sounding human language. Natural-language-understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate. [10] Applications of machine learning to language processing include document classification, document segmentation, tagging, entity extraction, problems involving parsing, inducing representations of linguistic objects. General techniques include probabilistic parsing, reinforcement learning in dialog systems, Neural networks, dimensionality reduction methods, non-negative factorizations, finite-state techniques, Bayes methods, SVM, and so forth. [11] [12] [13]

    Smart environments

    Smart environments is a technological concept that, according to Mark Weiser is "a physical world that is richly and invisibly interwoven with sensors, actuators, displays, and computational elements, embedded seamlessly in the everyday objects of our lives, and connected through a continuous network" [14] One major feature of smart environments is the Predictive and Decision-Making capabilities, which is a direct application of machine learning. In order to meet environment goals such as maximizing comfort, minimizing cost, and adapting to inhabitants, a smart environment must rely upon tools from artificial intelligence such as prediction. First, models of various devices can be learned from observation and used to predict their behaviors in the future. Second, predicting an inhabitant's next action may be needed for the environment to automate selected repetitive tasks for the inhabitant, to detect anomalies that could indicate security or health concerns, and to identify ways of improving control of the environment. The results of a prediction algorithm may ultimately be input to a decision making algorithm that selects actions for the house to execute. [15] [16] [17]

    Games

    Computer games have evolved from the simple graphics and gameplay of early titles like Spacewar!, to a wide range of more visually advanced titles. And at the same time the game play evolved using AI and machine learning techniques. PC games are usually built around a central piece of software, known as a game engine,[20] that simplifies the development process and enables developers to easily port their projects between platforms. [18] Machine learning techniques involves learning by observations, learning by instruction and learning by experience. [19]

    Robotics

    Robotics is the science and technology of robots, their design, manufacture, and application. Robotics requires a working knowledge of electronics, mechanics and software, and is usually accompanied by a large working knowledge of many subjects. [20] Robotics and machine learning has evolved to become more than skills involving reaching, grasping, and manipulation. Robot learning in realistic environments requires novel algorithms for learning to identify important events in the stream of sensory inputs, and to temporarily memorize them in adaptive, dynamic, internal states until the memories can help to compute proper control actions.[21] It also focuses on problems involving motion, searching and goal identification, and so forth. [22] The current goal is the dream of robots that work alongside humans in natural environments. [23] [24]

    Oil Industry

    Crude oil [25] is often mixed with natural gas when it is extracted from the ground, and the two must be separated prior to refining. Finding the ideal settings to control the separation process is a complex task. British Petroleum used ML techniques to create a set of rules for setting the control parameters and had good success. [26] Machine learning can also be applied for automatic recognition of risk patterns in oil installations [27].

    Electric load prediction

    The field of Short Term Load Forecasting (STLF) has indicated the potential for developing applications leveraging machine learning techniques which are capable of predicting the hourly power load demand [28]. Some work has been done in the area of predicting electricity distribution feeder failures. [29]

    Chemical Process Control

    Westinghouse's process for manufacturing nuclear fuel pellets is controlled by numerous control parameters that interact in complex ways and in a complex controlled environment which are constantly observed[30]. If Incorrectly set, then the process's throughput and yield will be low. ML was used to create a set of rules for controlling the manufacturing process. Its use in 1984 benefited Westinghouse by more than ten million dollars per year. This is another successful application of machine learning. Other applications include chemical process diagnosis [31], discovery of chemical transformations [32] and so forth.

    Medicine and Biology

    Continuous advances in computational intelligence technology have enabled researchers to collect and effectively analyze large amounts of complex clinical and biological data. In recent years, research in the interdisciplinary area of computer assisted medical decision-making has dramatically intensified. The overall objective is to provide physicians with computer tools that can assist them with their clinical decisions. And also help discover new knowledge which was previously unidentified or is of interest to the biological community. Some focus is on theoretical aspects of machine learning techniques for data analysis, and practical applications of machine learning techniques in medicine and biology. [33] [34]

    Genes Analysis and Application:Recent years have witnessed a rapidly accelerating accumulation of genetic sequence information, including the sequencing of whole genomes. The first step in the analysis of this data is to identify the thousands of genes within each new genome where machine learning can be applied. [35] [36] [37]

    Predicting Drug Activity: In order to design new drugs with a certain desired biological activity, or to understand the mechanisms underlying the activity (or non-activity, e.g., non-toxicity) of known drugs, it is necessary to discover the relationships between chemical structure and the activity of interest. Relationships discovered from experimental data are called SARs (Structure Activity Relationships). Because of the complex 3-D shapes involved, manual SAR analysis is infeasible except in the simplest cases. A particular type of ML, inductive logic programming (ILP), has proven particularly useful in discovering SARs because it directly reasons about the 2-D or 3-D structure of the drugs in addition to their physico-chemical properties. [www.doc.ic.ac.uk/~shm/Papers/qsarilp.pdf] [38]

    Other applications include Supervised and Unsupervised learning successful in discovering lead indicators of carcinogenic activity. [39]

    Web Mining and Learning

    One of the most successful application available is the web search and information retrieval. Web search evolution has spawned a new era of machine learning. Today many techniques are developed for text and web page processing such as page rank algorithm, crawler techniques (preferential crawler) and so forth. We observe many spam classification techniques which are a direct result of machine learning application.[40]

    Others

    Machine learning can be successfully applied to many fields including science, technology, medicine, art and so forth. These applications include Network intrusion detection [41], Grid Computing [42], Computer graphics and animation [43], and more.

    Software

  • weka 3.0.3 -- Waikato machine learning workbench
  • xgobi -- data visualization and multidimensional scaling http://www.research.att.com/areas/stat/xgobi
  • KNIME -- Data Mining and Machine Learning workbench
  • http://www.knime.org/
  • SNNS -- Stuttgart neural network simulator
  • http://www.nada.kth.se/~orre/snns-manual/
  • Subdue -- Graph Based Knowledge Discovery
  • http://ailab.wsu.edu/subdue/
  • OpenCyc -- general knowledge base and commonsense reasoning engine
  • http://www.opencyc.org/
  • Bayes Net Toolbox for Matlab
  • http://bnt.sourceforge.net/
  • Tree Visualizer -- MLC++ decision-trees can be viewed using SGI's MineSetTM Tree Visualizer
  • http://www.sgi.com/tech/mlc/trees.html
  • Spider -- General Purpose Machine Learning Toolbox in Matlab
  • http://www.kyb.tuebingen.mpg.de/bs/people/spider/index.html
  • RapidMiner (formerly YALE) -- knowledge discovery and data mining
  • http://rapid-i.com/
  • autoclass 3.3.3 -- MDL-based clustering
  • http://ic.arc.nasa.gov/ic/projects/bayes-group/autoclass
  • C4.5 release 8 -- Quinlan's decision tree builder
  • http://www.rulequest.com/
  • CMU SLM -- CMU-Cambridge Statistical Language Modeling Tookit v2
  • http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
  • datgen 3.1.1 -- rule-based dataset synthesis
  • http://www.datasetgenerator.com
  • delve 1.1p3 -- framework for running machine learning experiments
  • http://www.cs.toronto.edu/~delve
  • gsearch 2.06 -- corpus search toolkit
  • http://www.hcrc.ed.ac.uk/gsearch
  • rainbow -- text classification tool
  • http://www.cs.cmu.edu/~mccallum/bow
  • TIMBL -- Tilburg Memory Based Learner
  • http://ilk.uvt.nl/software.html
  • SAM -- Sequence Alignment and Modeling System using HMM (Hidden Markov Model)
  • http://bioweb.pasteur.fr/seqanal/motif/sam-uk.html
  • LNKnet -- Pattern Classification Software
  • http://www.ll.mit.edu/IST/lnknet/index.html
  • Lemga -- Learning Models and Generic Algorithms
  • http://www.work.caltech.edu/ling/lemga/


    Interested to look at more: Then visit here: KDnuggets : Software[1]

    Tutorials

    This section provides links to various tutorials on machine learning and its sub-fields, which are available online.

  • http://www.autonlab.org/tutorials/list.html
  • http://mlg.eng.cam.ac.uk/tutorials/07/
  • http://mlg.eng.cam.ac.uk/tutorials/06/
  • Semi-Supervised Learning: http://pages.cs.wisc.edu/~jerryzhu/icml07tutorial.html
  • Bayesian Methods for Reinforcement Learning: http://www.cs.uwaterloo.ca/~ppoupart/ICML-07-tutorial-Bayes-RL.html
  • Group Theoretical Methods in Machine Learning: http://www1.cs.columbia.edu/~risi/ICMLtutorial/ICML%20tutorial/Group%20theoretical%20methods%20in%20machine%20learning.html
  • Online Learning for Real World Problems: http://www.cis.upenn.edu/~crammer/icml-tutorial-index.html
  • Relational Data Community Generation: http://www.fortune.binghamton.edu/icml2007tutorial.html
  • Machine Learning for Computer Graphics: http://www.dgp.toronto.edu/~hertzman/mlcg2003/
  • http://web.engr.oregonstate.edu/~tgd/projects/tutorials.html
  • ROC Tutorial http://www.cs.bris.ac.uk/~flach/ICML04tutorial/index.html
  • Matthias Seeger. Gaussian processes for machine learning[3]. International Journal of Neural Systems, 14(2):1-38, 2004.
  • P.H. Chen, C.-J. Lin, and B. Schölkopf. A tutorial on nu -support vector machines [4]. 2003.
  • N. Cristianini. ICML'01 tutorial[5], 2001.
  • K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2):181-201, May 2001. (PDF)
  • B. Schölkopf. SVM and kernel methods [6], 2001. Tutorial given at the NIPS Conference.
  • B. Schölkopf. A short tutorial on kernels[7], 2000. Tutorial given at the NIPS'00 Kernel Workshop.
  • B. Schölkopf. Support vector learning[8], 1999. Tutorial given at DAGM'99.
  • C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.
  • Knowledge Discovery and Data Mining [9], 2(2), 1998.
  • A. J. Smola and B. Schölkopf. A Tutorial on Support Vector Regression [10]. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998.
  • Introduction tutorial on Decision tree.[11]
  • Decision Trees [12]
  • Tutorials on bayesian learn

  • A Tutorial on Learning With Bayesian Networks [13].
  • What is Bayesian Learning? [14]
  • Bayesian Methods for Machine Learning [15]
  • Bayesian Networks [16]
  • Inference in Bayesian Networks [17]
  • Learning Bayesian Networks [18]
  • A Short Intro to Naive Bayesian Classifiers [19]
  • Bayesian Methods for Reinforcement Learning [20]
  • Tutorials on K-means and Hierarchical Clustering :

  • K-means and Hierarchical Clustering [21]
  • Tutorials on Markov Models

  • Hidden Markov Models [22]
  • Markov Decision Processes [23]
  • A tutorial on hidden Markov models [24]
  • Hidden markov models [25]
  • Tutorials on Genetic Algorithm

  • A Genetic Algorithm Tutorial [26]
  • Genetic Algorithm [27]
  • Genetic Algorithm Tutorial [28]
  • Illinois Genetic Algorithms Laboratory [29]
  • An overview of genetic algorithms: Part 1, fundamentals [30]
  • The Genetic Programming Tutorial Notebook [31]
  • Tutorials on Neural Networks

  • Neural Networks[32]
  • What is a Neural Net? A Brief Introduction [33]
  • Neural Network FAQ [34].
  • Neural network Course [35]

    References

    1. Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7.
    2. Ian H. Witten and Eibe Frank "Data Mining: Practical machine learning tools and techniques" Morgan Kaufmann ISBN 0-12-088407-0.
    3. LANGLEY, Pat, Elements of Machine Learning, Morgan Kaufmann Series in Machine Learning, 1995.
    4. http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=10341
    5. http://robotics.stanford.edu/~ronnyk/glossary.html
    6. http://www.answers.com/topic/machine-learning
    7. http://en.wikipedia.org/wiki/Machine_learning
    8. http://mlearn.ics.uci.edu/Machine-Learning.html
    9. Asuncion, A & Newman, D.J. (2007). UCI Machine Learning Repository [1]. Irvine, CA: University of California, Department of Information and Computer Science.
    10. Video Lectures on Machine Learning and Data Mining : www.videolectures.net. [2]
    11. Video of Tom Mitchell's March 1, 2007 seminar talk at the Carnegie Mellon University School of Computer Science's Machine Learning Department. [3]
    12. Brief introduction to machine learning and some good links. [4]
    13. Machine Learning and Statistics. [5]
    14. Support Vector Machines. [6]
    15. Wikipedia: Machine learning. [7]
    16. Wikipedia: Category:Machine learning. [8]
    17. Google Directory on Machine Learning. [9]
    18. Artificial Intelligence Portal[10]
    19. Machine Learning Resources [11]
    20. AI on the web [12]