A Primer on Machine Learning

底漆（化妆品）计算机科学人工智能化学有机化学

作者

Audrene S. Edwards,Bruce Kaplan,Tun Jie

出处

期刊：Transplantation [Ovid Technologies (Wolters Kluwer)]
日期：2020-08-18 卷期号：105 (4): 699-703 被引量：9

链接

lww.com nih.govdoi.org

标识

DOI：10.1097/tp.0000000000003316

摘要

In transplant medicine, large collections of data from patients and various procedures have been stored and organized in registries and databases. With the increase in data volume, there has been a demand for tools that can handle the challenges presented by so-called “big data.” In recent years, mathematical and statistical tools such as machine learning are being utilized in an increasing number of analyses. In addition, machine learning has been utilized in various other domains in which a large amount of complex data needs to be interrogated (eg, genomics). Although the term “machine learning” has become a term commonly mentioned, the techniques, strengths, and limitations are often not fully understood by readers of transplant literature. This commentary will cover some of the history and basic concepts of machine learning. The idea of machine learning came from the father of artificial intelligence, Alan Turing, while giving a talk at the London Mathematical Society. Turing stated, “what we want is a machine that can learn from experience.”1 At its most elemental, machine learning is a set of tools that can be used to mine, analyze, extract, and provide insight “from a record of the observable world.”2 For machines to perform such tasks, the machine must “learn from experience.” In this commentary, we will first explore the history of machine learning. Then, we will discuss the concepts of “experience” and “task,” as it pertains to machine learning (in a paraphrased version of the formal definition for machine learning provided by Tom Mitchell). This primer on machine learning is written with the purpose of helping those who have seen or heard the term, connect with the concept and tools within the context for which machine learning was chosen as a viable tool. Finally, we will provide an overview of terms, tools, strengths, and limitations within machine learning. With an understanding that some terms used in this discussion about machine learning may not be intuitively clear, a glossary as well as flow diagram has been provided for the reader’s use. HISTORY OF MACHINE LEARNING During his talk at the London Mathematical Society in 1947, Alan Turing predicted the future for machine computing would be “a machine that can learn from experience.” Based on this concept, the paradigm of machine learning was developed in the 1950s. This initial process was an algorithm generated by computers based on certain models derived from a training set of data. To improve the machine learning performance, decision trees were generated and “weights” were assigned to these decision trees to achieve the best fit of this training set. This “discovery” algorithm developed with machine learning was then validated with a validation set. Some of the pioneering work on this subject included Donald Hebb’s neural network model based on the understanding of brain cell interaction, and Arthur Samuel’s alpha-beta pruning model developed for playing checkers. Frank Rosenblatt combined both Donald Hebb and Arthur Samuel’s models and developed a custom-build computer called Mark I Preceptron in 1957. The pursuit of building such a machine was to apply the machine in the field of image recognition. However, the approach initially failed to achieve its objective in practice. The early breakthrough in machine learning was the conception of the nearest neighbor algorithm, which was successfully applied in the field of mapping routes in the late 1960s. With the expansion of multilayer concept for neural networks and the boosting algorithms, deep neural networks or deep learning dominate the field of machine learning today. Some of the successful machine learning applications developed in the last 2 decades include playing board games (chess and Go), speech recognition, facial recognition, and self-driving vehicles. It is worth mentioning that parallel to the development of machine learning, artificial intelligence research initially focused on knowledge-based approaches rather than empirically derived algorithms. UNDERSTANDING MACHINE LEARNING According to Tom Mitchell, a computer scientist at Carnegie Mellon University: a computer program is said to learn from experience E with respect to some tasks T and performance measure P, if its performance at tasks in T improves with experience E.3 In other words, machine learning is a discipline focused on how a computer program, when given certain tasks, can improve with experience, while the performance in relation to a certain task can be measured and improved over time, as more data are introduced to the program. In reference to machine learning, the experiences that are mentioned explicitly are the following 3 learning experiences: supervised learning, unsupervised learning,4 and semisupervised learning (the focus for the scope of this article will be on the more common learning experiences, supervised, and unsupervised learning). For each experience within machine learning, there are tasks that can be completed using certain algorithms respective to each task. It is important to recognize that machine learning is not synonymous with statistical and mathematical methodologies that are used to interpret the output given when using machine learning. The statistics and mathematics utilized can be dependent on multiple factors including on the nature of the data and the question being asked. Figure 1 provides a layout of each experience, task, and some examples of many tools, such as techniques and algorithms, used within machine learning for each task respectively.FIGURE 1.: A roadmap to effective machine learning. AP, affinity propagation; GDA, generalized discriminant analysis; LASSO, least absolute shrinkage and selection operator; PCA, principal component analysis.SUPERVISED LEARNING When describing the supervised learning experience, one must start with the characteristics of the data used for supervised learning. In supervised learning, the domain of the data is restricted because of prior knowledge. For example, you have a dataset in which you want to predict the cost of a house in a certain neighborhood. The response variable, cost, is numerical, so we have a regression task. Before choosing which algorithm you would like to use to build a model that will help you predict the cost of a house, you could go through the dataset, and restrict the domain, by deciding which variables would not be of importance for predicting the cost of a house, based on your prior knowledge or experience. In the data, all variables are labeled and the response (output) variables are clearly distinguished from the explanatory (input) variables. From the original data, 2 datasets are made known as a training set and a test set. For the training set, the data are used to train the program. A task must be established to choose the proper algorithm for analysis. Within supervised learning, there are 2 tasks: regression and classification. Regression is the task in which an algorithm of choice is used to build a predictive model, in which the response variable is continuous. For the classification task, an algorithm of choice is used to build a predictive model, in which the response variable is categorical. An example of using supervised learning techniques in transplantation can be found in the paper “Prediction of Perioperative Mortality of Cadaveric Liver Transplant Recipients During Their Evaluations.”5 For this study, 3 supervised learning techniques were utilized to develop a scoring system to identify patients who have an increased risk of poor outcomes after a liver transplant, before operation. The 3 supervised learning techniques used were classification trees, neural networks, and logistic regression. Classification trees were used to find predictors associated with 90-day postoperative mortality. Neural networks were used to determine the probabilities of 90-day postoperative mortality and the weight of each independent variable relevant to the probability that a patient lives or dies postliver transplant. Finally, logistic regression was chosen to estimate the odds ratio of 90-day mortality and the logarithms of adjusted odd-ratios. For each task, an algorithm is chosen to establish the relationship between the explanatory variables and the response variables for prediction purposes. Once a relationship has been inferred between the explanatory and response variables, then the test set is used. The test set contains data used to assess the performance of the algorithm after a relationship has been established. The test (validation) set helps to assess prediction accuracy, as well as prevalence, and positive predictive value, and negative predictive value. For regression and classification tasks, some possible algorithms of choice are linear regression, logistic regression, Least Absolute Shrinkage and Selection Operator regression, Linear Discriminant Analysis, Naïve Bayes, decision trees, and k-nearest neighbors. A description and definition for the algorithms of each task can be found in Table 1. TABLE 1. - Glossary of terms for machine learning Introduction Term Definition Machine learning A set of tools (encompassing mathematics, statistics, and computer science) that can be used to mine, analyze, and extract important insight from data. Experience The type of data to collect to assign the proper task to solve the problem or answer the question of interest. Task A group of tools assigned to a problem or question of interest based on the availability of the data and the type of prediction or inference of interest. History of machine learning Deep learning Also known as neural networks, deep learning is a set of algorithms used to recognize patterns within data. Understanding machine learning Semisupervised learning An experience within machine learning the uses the data and algorithms of both supervised and unsupervised learning. Supervised machine learning Supervised learning (experience) An experience within machine learning, in which an algorithm is used to infer a function from labeled training data. Domain Complete set of possible input variables included for analysis. Response variable Dependent variable, also referred to as the target variable or outcome, which is the variable of interest for prediction. Explanatory variable Independent variables, also known as input variables. Training set Data within the supervised learning experience, in which algorithms are used to build predictive models, from a function found from the given explanatory and response variables. Test set Data that are used to assess the performance of the algorithm within the supervised learning experience. Algorithm Mathematical and logical based program that adjusts and adapts to data, which enhances the performance of the program as it is introduced to more data over time. Regression A task within supervised learning, in which an algorithm is used to define the relationship between data wherein the input variables and output variables are distinguishable, and the input variables are used to predict a numerical target variable. Classification A task within supervised learning, in which an algorithm is used to define the relationship between data wherein the input variables and output variables are distinguishable, and the input variables are used to predict a categorical response variable. Continuous (response variable) A response variable that is numerical. Categorical (response variable) A response variable that is binary or represents a variable that has >1 classification (eg, gender). LASSO regression A regression method used to eliminate multicollinearity, and aid in variable selection through shrinkage, which enhances the prediction accuracy and interoperability of results for linear regression models. Linear regression A predictive modeling technique used to define and investigate the relationship between the chosen explanatory variables, and a continuous response variable. For this technique the regression line is linear in nature. Logistic regression A predictive modeling technique used to find the probability of event. This technique establishes the relationship between the chosen explanatory variables and a binary response variable. Naïve Bayes A predictive algorithm used for classification tasks within machine learning. Decision trees A predictive machine learning technique that can be used for both regression and classification tasks, in which a decision tree is used to represent decisions made for a particular outcome. Random forests A predictive machine learning technique that can be used for both regression and classification tasks, that uses a multitude of randomly, uncorrelated constructed decision trees used to provide a prediction from the average vote among the trees in the forest. k-nearest neighbors An algorithm within supervised learning that is used for predictive purposes and can be used for regression and classification tasks. When using k-nearest neighbors to make predictions, the algorithm arrives at the desired outcome by searching through the data set for neighbors (data with similar instances) and summarizes the response variable for those neighbors. LDA A dimensionality technique used for supervised learning, in which the data set dimension is reduced by eliminating redundant explanatory variables. Unsupervised machine learning Unsupervised learning An experience within machine learning, in which an algorithm is used to infer a function from training data wherein the explanatory variables are labeled, but the response variable is not labeled. Clustering analysis A task within machine learning, in which an algorithm is used to group data that share similar features. Dimensionality reduction A task within machine learning used to reduce the number of explanatory variables is a chosen data set. This feature is used to assess the possibility of overfitting. The more variables a dataset has, the more complex a model can become. This task assists with the elimination of variables that explain the same “concept” and noise. Affinity propagation A clustering algorithm that creates clusters of data points with similar features, by sending messages between the data points until convergence. Expectation maximization clustering An iterative clustering algorithm used in unsupervised learning for clustering that can be used to predict the values from the probability distribution of latent variables (variables that are not directly observable but are inferred from the variables that are observable). K-means clustering A clustering algorithm used in unsupervised learning for the task of clustering analysis, that takes data points and put them into a defined number of clusters (“k” clusters). PCA A statistical technique used in dimensionality reduction to eliminate intercorrelations among the chosen explanatory variables within a data set (multicollinearity). GDA A dimensionality reduction technique used to extract nonlinear features in such a way that nonuseful information is removed from the data, and class separability is increased through maximizing between-class distinction and minimizing within-class distinction. GDA, generalized discriminant analysis; LASSO, least absolute shrinkage and selection operator; LDA, Linear Discriminant Analysis; PCA, principal component analysis. UNSUPERVISED LEARNING Unlike supervised learning, in which there is a more defined relationship between the response and explanatory variables, the domain of the data used for unsupervised learning is not as restricted. The domain restriction is not of importance because the goal when using unsupervised learning is to look for patterns within the data. So, for unsupervised learning, the response variable is not explicitly defined. Within unsupervised learning, there are 2 tasks known as clustering analysis and dimensionality reduction. Clustering analysis discovers groups of observation that are related. These groups are known as clustering, and within the clusters the observations have similarities based on some defined similarity. Dimensionality reduction is used for cases of multicollinearity. Multicollinearity is a phenomenon that happens when a data set has a plethora of explanatory variables, in which there exist intercorrelations between the explanatory variables. An example of using unsupervised learning techniques in transplantation can be found in the paper, “Intragraft Antiviral-Specific Gene Expression as a Distinctive Transcriptional Signature for Studies in Polyomavirus-Associated Nephropathy.”6 In this study, unsupervised hierarchical clustering analysis was used to determine similarities between polyomavirus nephropathy and T cell–mediated rejection gene expression. Principal component analysis was used to confirm separation between the 3 different clinical phenotypes for polyomavirus nephropathy–specific genes. For cluster analysis and dimensionality reduction, some possible algorithms of choice are affinity propagation, expectation maximization clustering, K-means clustering, and principal component analysis. A detailed description for the possible choice of algorithm for each task within unsupervised learning can be found in Table 1. ADVANTAGES AND DISADVANTAGES OF SUPERVISED AND UNSUPERVISED LEARNING Depending on the question a researcher wants to answer, or the type of data that are available, will determine whether supervised learning or unsupervised learning is the best approach. There are advantages and disadvantages for both supervised and unsupervised learning. One major advantage of supervised learning is that the results from supervised learning techniques are more accurate because of the data being labeled before analysis. The key to labeling data is to understand the objectives and properties of the data collected to assign variable names. Since the data are labeled before analysis, this also makes the results easy to interpret. Another advantage of supervised learning is the ability to understand how the algorithm or technique of choice learns as the relationship between inputs and the output variable are established. Before analysis, you can determine how many classes you want because the data are labeled. A disadvantage of supervised learning is the chance of misclassification of any new inputs, because in supervised learning, the algorithms are trained on the given data. Another disadvantage is much time is required during computation for very large data sets. Also, the data cannot be clustered or classified because the features are predefined, rather than left to the machine to define the features. When using supervised learning techniques, the amount of information can be limited because results are based on the data with predefined domains and structures. Having data with predefined domains and structures also limits the amount of information and insight that can be gained from data used in supervised learning techniques. For unsupervised learning, the advantages are that the data used for unsupervised learning do not have to be labeled or structured beforehand. This is a major advantage for unsupervised learning because labeling data can take time, and most available data do not have labels or structure. Another advantage of unsupervised learning is that the techniques of unsupervised learning can be used to establish relationships and patterns unrecognized by humans. In unsupervised learning, the accuracy of results may not be as reliable as the results when using supervised learning. The results are not as reliable because the data are not labeled or structured. So the outputs from unsupervised learning are not clearly known because the data were not labeled and did not have structure before analysis. STRENGTHS AND LIMITATIONS OF MACHINE LEARNING The tools within machine learning make it very versatile and applicable for many different types of datasets. However, machine learning as with any other analytic tool has strengths and limitations. The strengths of machine learning are the ability to easily identify patterns and trends that cannot be detected by the human eye, or by more classical statistical techniques, and the ability to handle multidimensional and multivariety data. Within machine learning algorithms, as the algorithms gain experience, the accuracy and efficiency of the algorithms increase, making the predictions faster and more accurate as the volume of data increases. The limitations of machine learning, however, are in the interpretation of the results. Although machine learning can be used to build powerful models for recognizing patterns or prediction, sometimes these patterns may not have biologic context or plausibility. Although robust patterns can be generated results, these results at times may be overinterpreted or may have no practical context or application. When building training models for the tool of choice in machine learning, for accuracy and to decrease the chance of overfitting, large amounts of data are needed to train the algorithm or model. However, without a robust validation, the nature of many of these techniques may lead to associations that cannot be reproduced. In the end, results generated by machine learning need to be interpreted as any other analysis and are not per se correct because it is powerful. CONCLUSION Machine learning is a term that is often stated as a tool for analysis and prediction, especially in research, but is not explained in detail. The overall understanding of machine learning is that there are 2 common learning experiences that happen within machine learning: supervised and unsupervised learning. The experience is chosen depending on the data available, as well as the question of interest, or the type of problem the investigator would like to solve. Within supervised and unsupervised learning, there are tasks that have tools that can solve the problem or question of interest, with much accuracy and precision. What determines which learning experience to choose, is the data available and the question of interest that the researchers, or investigators would like to solve. When solving a variety of problems using machine learning, there are free as well as paid software available for usage. Software such as the statistical package R, Python, and WEKA, just to name a few, are examples of software used for machine learning. The goal of this primer is to introduce the reader to the basic structure and concept of machine learning, while giving examples of how machine learning is used in transplantation. The techniques and algorithms included in this article are not an exhaustive list, and the reader is encouraged to do individual research about the topic for a more in-depth exploration. Machine learning is used in many different disciplines and since its debut, continues to grow in techniques as well as application. Links are provided for more information about machine learning such as strength and weaknesses of each machine learning experience, and the strengths and weaknesses of each technique.

求助该文献

最长约 10秒，即可获得该文献文件

A Primer on Machine Learning

今日热心研友