Machine Learning

Machine learning is a science that is concerned with making computers work without human intervention. Machine learning is an important way to solve the problem of Data mining. This technology has enabled self-driving cars, better web search, and a thorough understanding of human genome.[1]
Machine learning evolved from the fields of computer science, statistics, engineering, and mathematics and it requires this combination of skills to effectively apply it in problem-solving. For effective problem-solving computer programs that learn from data and improve with experience have to be developed. [2]
In Machine learning, a branch of artificial intelligence is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases. A learner can take advantage of data to capture characteristics of interest of their unknown underlying probability distribution.A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data. [3]

Supervised learning In this approach input data also referred to as training data has a known result e.g. email is spam or not. A model undergoes training where it makes predictions that are corrected when they are incorrect. This process is repeated up to the point when the level of accuracy is acceptable. Classification and regression are problems that can be solved this way. Logistic Regression and Neural Networks are examples of algorithmic examples of Supervised learning. [4]

Unsupervised learning In this approach there are no known results. A model is developed by drawing upon structures present in the data. Clustering and dimensionality reduction are some problems solved this way.

Semi-supervised learning In this approach there are known and unknown results. The model has to learn structures present in the data and also make predictions.

Reinforcement learning In this approach loss function of the learning system is unclear. This is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. [5]

Currently Hot topics

(1). Deep learning seems to be getting the most press right now. It is a form of a Neural Network (with many neurons/layers). Articles are currently being published in the New Yorker and the New York Times on Deep Learning.

(2). Combining Support Vector Machines (SVMs) and Stochastic Gradient Decent (SGD) is also interesting. SVMs are really interesting and useful because you can use the kernel trick to transform your data and solve a non-linear problem using a linear model (the SVM). A consequence of this method is the training runtime and memory consumption of the SVM scales with the size of the data set. This situation makes it very hard to train SVMs on large data sets. SGD is a method that uses a random process to allow machine learning algorithms to converge faster. To make a long story short, you can combine SVMs and SGD to train SVMs on larger data sets (theoretically).

(3). Because computers are now fast, cheap, and plentiful, Bayesian statistics is now becoming very popular again (this is definitely not "new"). For a long time, it was not feasible to use Bayesian techniques because you would need to perform probabilistic integration by hand (when calculating the evidence). Today, Bayesist are using Monte Carlo Markov Chains, Grid Approximations, Gibbs Sampling, Metropolis Algorithm, etc.

(4). Any of the algorithms described in the paper "Map Reduce for Machine Learning on a Multicore". This paper talks about how to take a machine learning algorithm/problem and distribute it across multiple computers/cores. It has very important implications because it means that all of the algorithms mentioned in the paper can be translated into a map-reduce format and distributed across a cluster of computers. Essentially, there would never be a situation where the data set is too large because you could just add more computers to the Hadoop cluster. This paper was published a while ago, but not all of the algorithms have been implemented into Mahout yet.

(5). Anomaly detection. [6]
Anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions

Classification of Input Data Type

Data comes in three formats:

Structured data is organized in a way that both computers and humans can read. The most obvious example is a relational database.

Semi-structured data, which includes XML, email and electronic data interchange (EDI), lacks such formal structure but nonetheless contains tags that separate semantic elements. Semi-structured data does not have a form but does have tags that help create structured-like hierarchies.

Unstructured data refers to data types, including images, audio, and video, which are not part of a database. It has no clear or consistent structure and no formal data model (FDM). Unstructured data is created by new data sources, many of which did not even exist at the dawn of the database. Every time one uses a mobile device to place a call, sends a text message, views (or posts) a video or interacts with a website, that creates data. Every transaction, in any context, creates data, as does every email. The content that populates every Web venue from the text, images and rich media content built by site owners, to social networking content created by potentially any Web user at any time from anywhere around the globe—creates data. Even phone calls, if they are delivered as packets over an IP network, are now data.

Challenges

The foremost challenge is the need to unlock the data and gain access to it so you can store it and use it. This allows for the information to stay in its raw format, where it can be analyzed and reported on as it streams real-time into an analytics system. For structured data, this process is fairly straightforward. When working with unstructured data, on the other hand, advanced algorithms and powerful engines are needed to process the incoming data. [7][8]

Challenges in Machine Learning have proven to be efficient and cost-effective ways to quickly bring to industry solutions that may have been confined to research. In addition, the playful nature of challenges naturally attracts students, making the challenge a great teaching resource. Challenge participants range from undergraduate students to retirees, joining forces in a rewarding environment allowing them to learn, perform research, and demonstrate excellence. Therefore challenges can be used as a means of directing research, advancing the state-of-the-art or venturing in completely new domains. [9]

QUESTION 1: What are the limits of deep learning?

ANSWER:Yann LeCun, Director of AI Research at Facebook and Professor at NYU

The “classical” forms of deep learning include various combinations of feed-forward modules (often convolutional nets) and recurrent nets (sometimes with memory units, like LSTM or MemNN).

These models are limited in their ability to “reason”, i.e. to carry out long chains of inferences, or optimization procedure to arrive at an answer. The number of steps in computation is limited by the number of layers in feed-forward nets, and by the length of time, a recurrent net will remember things.

To enable deep learning systems to reason, we need to modify them so that they don’t produce a single output (say the interpretation of an image, the translation of a sentence, etc), but can produce a whole set of alternative outputs (e.g the various ways a sentence can be translated). This is what energy-based models are designed to do give you a score for each possible configuration of the variables to be inferred. A particular instance of energy-based models is factor graphs (non-probabilistic graphical models). Combining learning systems with factor graphs is known as “structured prediction” in machine learning. There have been many proposals to combine neural nets and structured prediction in the past, going back to the early 1990s. In fact, the check reading system my colleague and I built at Bell Labs in the early 1990s used a form of structured prediction on top of convolutional nets that we called “Graph Transformer Networks”. There has been a number of recent works on sticking graphical models on top of ConvNets and training the whole thing end to end (e.g. for human body pose estimation and such).
For a review/tutorial on energy-based models and structured prediction on top of neural nets (or other models) see this paper: [10]

Deep learning is certainly limited in its current form because almost all the successful applications of it use supervised learning with human-annotated data. We need to find ways to train large neural nets from “raw” non-annotated data so they capture the regularities of the real world. As I said in a previous answer, my money is on adversarial training.

QUESTION 2: What are the pros and cons of Generative Adversarial Networks vs Variational Autoencoders?

An advantage for VAEs (Variational AutoEncoders) is that there is a clear and recognized way to evaluate the quality of the model (log-likelihood, either estimated by importance sampling or lower-bounded). Right now it’s not clear how to compare two GANs (Generative Adversarial Networks) or compare a GAN and other generative models except by visualizing samples.

A disadvantage of VAEs is that, because of the injected noise and imperfect reconstruction, and with the standard decoder (with factorized output distribution), the generated samples are much more blurred than those coming from GANs.

The fact that VAEs basically optimize likelihood while GANs optimize something else can be viewed both as an advantage or a disadvantage for either one. Maximizing likelihood yields an estimated density that always bleeds probability mass away from the estimated data manifold. GANs can be happy with a very sharp estimated density function even if it does not perfectly coincide with the data density (i.e. some training examples may come close to the generated images but might still have nearly zero probability under the generator, which would be infinitely bad in terms of likelihood).

GANs tend to be much more finicky to train than VAEs, not to mention that we do not have a clear objective function to optimize, but they tend to yield nicer images.

A Few Useful Things to Know about Machine Learning

1. Learning = Representation + Evaluation + Optimization

Suppose you have an application that you think machine learning might be good for. The first problem facing you is the bewildering variety of learning algorithms available. Which one to use? There are literally thousands available, and hundreds more are published each year. The key to not getting lost in this huge space is to realize that it consists of combinations of just three components. The components are:

Representation. A classifier must be represented in some formal language that the computer can handle. Conversely, choosing a representation for a learner is tantamount to choosing the set of classifiers that it can possibly learn. This set is called the hypothesis space of the learner. If a classifier is not in the hypothesis space, it cannot be learned. A related question, which we will address in a later section, is how to represent the input, i.e., what features to use.

Evaluation. An evaluation function (also called objecan tive function or scoring function) is needed to distinguish good classifiers from bad ones. The evaluation function used internally by the algorithm may differ from the external one that we want the classifier to optimize, for ease of optimization (see below) and due to the issues discussed in the next section.

Optimization. Finally, we need a method to search among the classifiers in the language for the highest-scoring one. The choice of optimization technique is key to the efficiency of the learner, and also helps determine the classifier produced if the evaluation function has more than one optimum. It is common for new learners to start out using off-the-shelf optimizers, which are later replaced by custom-designed ones.

2. Feature Engineering is the Key

At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. If you have many independent features that each correlates well with the class, learning is easy. On the other hand, if the class is a very complex function of the features, you may not be able to learn it. Often, the raw data is not in a form that is amenable to learning, but you can construct features from it that are. This is typically where most of the effort in a ma- chine learning project goes. It is often also one of the most interesting parts, where intuition, creativity and “black art” are as important as the technical stuff.

First-timers are often surprised by how little time in a ma- chine learning project is spent actually doing machine learn- ing. But it makes sense if you consider how time-consuming it is to gather data, integrate it, clean it and pre-process it, and how much trial and error can go into feature design. Also, machine learning is not a one-shot process of building a data set and running a learner, but rather an iterative process of running the learner, analyzing the results, modifying the data and/or the learner, and repeating. Learning is often the quickest part of this, but that’s because we’ve already mastered it pretty well! Feature engineering is more difficult because it’s domain-specific while learners can be largely general-purpose. However, there is no sharp frontier between the two, and this is another reason the most useful learners are those that facilitate incorporating knowledge.

Of course, one of the holy grails of machine learning is to automate more and more of the feature engineering process. One way this is often done today is by automatically generating large numbers of candidate features and selecting the best by (say) their information gain with respect to the class. But bear in mind that features that look irrelevant in isolation may be relevant in combination. For example, if the class is an XOR of k input features, each of them by itself carries no information about the class. (If you want to annoy machine learners, bring up XOR.) On the other hand, running a learner with a very large number of features to find out which ones are useful in combination may be too time-consuming, or cause overfitting. So there is ultimately no replacement for the smarts you put into feature engineering. [11]

Criticism

1. Criticism by Machine Learning Experts

Machine learning lacks in some major and minor components. One of the common problems with machine learning is the debugging process. The automated process of debugging in machine learning can be extremely time-consuming, which can make some users uncomfortable. The lack of statistical prediction invention in machine learning can cause the learning to lack in details. [12]
Also, the difficulty lies in the fact that the set of all possible behaviors are given all possible inputs is too large to be covered by the set of observed examples.This makes the learner generalize from the given data, so as to be able to produce a useful output in new cases. [13]

Machine learning is the body of research related to automated large-scale data analysis. Historically, the field was centered around biologically inspired models, and the long-term goals of much of the community are oriented to producing models and algorithms that can process information as well as biological systems.

The field also encompasses many of the traditional areas of statistics with, however, a strong focus on mathematical models and also prediction. Machine learning is now central to many areas of interest in computer science and related large-scale information processing domains.

There are many problems haven't solved in practical machine learning. For example, in anomaly detection, the state of the art method suffers from scalability, use-case restrictions, the difficulty of use and a large number of false positives. [14]

Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of today’s most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including healthcare, manufacturing, education, financial modeling, policing, and marketing.

People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms—for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world’s alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several “visual Turing tests” probing the model’s creative generalization abilities, which in many cases are indistinguishable from human behavior.

Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of “one-shot learning.” Traditional gradient-based networks require a lot of data to learn, often through extensive iterative training. When new data is encountered, the models must inefficiently relearn their parameters to adequately incorporate the new information without catastrophic interference. Architectures with augmented memory capacities, such as Neural Turing Machines (NTMs), offer the ability to quickly encode and retrieve new information, and hence can potentially obviate the downsides of conventional models. Here, we demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location-based focusing mechanisms.

We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. We focus on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic. Unlike most work on generative models, our primary goal is not to train a model that assigns high likelihood to test data, nor do we require the model to be able to learn well without using any labels. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: our model generates MNIST samples that humans cannot distinguish from real data and CIFAR-10 samples that yield a human error rate of 21.3%. We also present ImageNet samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of ImageNet classes.

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Book list

1. The Elements of Statistical Learning (ESL), Trevor Hastie etc.

During the past decade, there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book.

This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide data (p bigger than n), including multiple testing and false discovery rates. [15]

This is the first textbook on pattern recognition to present the Bayesian viewpoint. The book presents approximate inference algorithms that permit fast approximate answers in situations where exact answers are not feasible. It uses graphical models to describe probability distributions when no other books apply graphical models to machine learning. No previous knowledge of pattern recognition or machine learning concepts is assumed. Familiarity with multivariate calculus and basic linear algebra is required, and some experience in the use of probabilities would be helpful though not essential as the book includes a self-contained introduction to basic probability theory. [16]

Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package -- PMTK (probabilistic modeling toolkit) -- that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students. [17]

4. Bayesian Reasoning and Machine Learning (BRML), David Barber

People who know the methods have their choice of rewarding jobs. This hands-on text opens these opportunities to computer science students with modest mathematical backgrounds. It is designed for final-year undergraduates and master's students with limited background in linear algebra and calculus. Comprehensive and coherent, it develops everything from basic reasoning to advanced techniques within the framework of graphical models. Students learn more than a menu of techniques, they develop analytical and problem-solving skills that equip them for the real world. Numerous examples and exercises, both computer based and theoretical, are included in every chapter. Resources for students and instructors, including a MATLAB toolbox, are available online. [18]

This graduate-level textbook introduces fundamental concepts and methods in machine learning. It describes several important modern algorithms, provides the theoretical underpinnings of these algorithms, and illustrates key aspects for their application. The authors aim to present novel theoretical tools and concepts while giving concise proofs even for relatively advanced topics. Foundations of Machine Learning fills the need for a general textbook that also offers theoretical details and an emphasis on proofs. Certain topics that are often treated with insufficient attention are discussed in more detail here; for example, entire chapters are devoted to regression, multi-class classification, and ranking. The first three chapters lay the theoretical foundation for what follows, but each remaining chapter is mostly self-contained. The appendix offers a concise probability review, a short introduction to convex optimization, tools for concentration bounds, and several basic properties of matrices and norms used in the book.

The book is intended for graduate students and researchers in machine learning, statistics, and related areas; it can be used either as a textbook or as a reference text for a research seminar. [19]

6. Deep Learning, Yoshua Bengio

The Deep Learning textbook is a resource intended to help students and practitioners enter the field of machine learning in general and deep learning in particular. The online version of the book is now complete and will remain available online for free. Deep learning is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept deﬁned in terms of its relation to simpler concepts. By gathering knowledge from experience, this approach avoids the need for human operators to formally specify all of the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones.[20]

Probabilistic Graphical Models discusses a variety of models, spanning Bayesian networks, undirected Markov networks, discrete and continuous models, and extensions to deal with dynamical systems and relational data. For each class of models, the text describes the three fundamental cornerstones: representation, inference, and learning, presenting both basic concepts and advanced techniques. Finally, the book considers the use of the proposed framework for causal reasoning and decision making under uncertainty. The main text in each chapter provides the detailed technical development of the key ideas. Most chapters also include boxes with additional material: skill boxes, which describe techniques; case study boxes, which discuss empirical cases related to the approach described in the text, including applications in computer vision, robotics, natural language understanding, and computational biology; and concept boxes, which present significant concepts drawn from the material in the chapter. Instructors (and readers) can group chapters in various combinations, from core topics to more technically advanced material, to suit their particular needs. [21]

This book attempted to summarize the actively developing field of statistical learning with sparsity. A sparse statistical model is one having only a small number of nonzero parameters or weights. It represents a classic case of “less is more”: a sparse model can be much easier to estimate and interpret than a dense model. In this age of big data, the number of features measured on a person or object can be large, and might be larger than the number of observations. The sparsity assumption allows us to tackle such problems and extract useful and reproducible patterns from big datasets. [22]

9. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies, John D. Kelleher and etc.

This introductory textbook offers a detailed and focused treatment of the most important machine learning approaches used in predictive data analytics, covering both theoretical concepts and practical applications. Technical and mathematical material is augmented with explanatory worked examples, and case studies illustrate the application of these models in the broader business context . These models are used in predictive data analytics applications including price prediction, risk assessment, predicting customer behavior, and document classification. The author also explains how Machine learning is often used to build predictive models by extracting patterns from large data sets. [23]

History

1946: the first computer system ENIAC was developed.

1950: Alan Turing proposed a test based on the idea that we can only determine if a machine can actually learn if we communicate with it and cannot distinguish it from another human.

1952: Arthur Samuel in IBM wrote the first game-playing program, for checkers, to achieve sufficient skill to challenge a world champion. Samuel’s machines learning programs improved the performance of checkers players.

1957: Frank Rosenblatt showed that by combining a large number of classifiers in a network a powerful model could be created.

1964: ELIZA system developed by Joseph Weizenbaum. ELIZA simulated a psychotherapist by using tricks like string substitution and canned responses based on keywords.

More

Search

Navigation

Tools

Verify.Wiki uses the power of collaboration and crowd-sourcing to fight unverified content shared across WhatsApp, Facebook, Youtube and other online sources. Every contribution goes through a crowd-sourced verification process to ensure further accuracy and transparency. Users receive badges and industry certifications for contributing and verifying content.