Introduction to Machine Learning
Recordings from STAT 451: Introduction to Machine Learning Machine Learning (FS 2020) at University Wisconsin-Madison
- Part I: Introduction
- Part II: Computational Foundations
- Part III: Tree-Based Methods
- Part IV: Evaluation
- L08: Model Evaluation Part 1 – Basics: Underfitting & Overfitting
- L09: Model Evaluation Part 2 – Resampling Methods
- L10: Model Evaluation Part 3 – Cross Valdiation
- L11: Model Evaluation Part 4 – Statistical Tests and Algorithm Selection
- L12: Model Evaluation Part 5 – Evaluation Metrics
- Student Presentations
Part I: Introduction
L01: What is Machine Learning
1.1 Course overview
Course overview and introduction to the course “Stat 451: Introduction to Machine Learning (FS 2020).”
1.2 What is Machine Learning
The definition of machine learning and how machine learning is related to programming.
1.3 Categories of Machine Learning
Discussion of the three broad categories of machine learning: supervised learning, unsupervised learning, and reinforcement learning.
1.4 Notation
Machine learning formalities and notation that we will be using in this course.
1.5 ML application
The main steps for approaching a machine learning application along with a categorization with the different aspects of machine learning components.
1.6 ML motivation
Different perspectives and motivations regarding studying machine learning.
L02: Nearest Neighbor Methods
2.1 Introduction to NN
Introducing nearest neighbor methods, going over some applications of nearest neighbors and covering the 1-nearest neighbor algorithm.
2.2 Nearest neighbor decision boundary
Covering the intuition behind the 1-nearest neighbor’s decision boundary. Also, it lists some of the common distance measures.
2.3 K-nearest neighbors
Extending the 1-nearest neighbor concepts to the k-nearest neighbors method for classification and regression.
2.4 Big O of K-nearest neighbors
Looking at the Big-O runtime complexity of a naive implementation of k-nearest neighbors
2.5 Improving k-nearest neighbors
Summarizing some of the common tricks for making k-nearest neighbors more efficient in terms of computational performance and predictive performance.
2.6 K-nearest neighbors in Python
Using k-nearest neighbors in Python using scikit-learn. The Jupyter Notebook referenced in this video is available at https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/L02/code/02_knn_demo.ipynb.
Part II: Computational Foundations
L03: (Optional) Python Programming
3.1 Python overview
Talking about the use of Python in this course. I will also show a quick demo using C (a statically typed language) vs Python. It’s probably not the most exciting lecture :).
3.2 Python setup
Demonstrating how to install Python using Miniconda on macOS. Also, I provide some brief demo of the conda package manager.
3.3 Running Python code
Showing the different ways for running Python code: the REPL, IPython, .py scripts, and Visual Studio Code.
L04: Scientific Computing in Python
4.1 Intro to NumPy
Introducing NumPy on a basic level before diving into more details in the following videos.
4.2 NumPy Array Construction and Indexing
4.3 NumPy Array Math and Universal Functions
4.4 NumPy Broadcasting
4.5 NumPy Advanced Indexing – Memory Views and Copies
4.6 NumPy Random Number Generators
4.7 Reshaping NumPy Arrays
4.8 NumPy Comparison Operators and Masks
4.9 NumPy Linear Algebra Basics
4.10 Matplotlib
L05: Machine Learning with Scikit-Learn
5.1 Reading a Dataset from a Tabular Text File
5.2 Basic data handling
5.3 Object Oriented Programming & Python Classes
5.4 Intro to Scikit-learn
5.5 Scikit-learn Transformer API
5.6 Scikit-learn Pipelines
Part III: Tree-Based Methods
L06: Decision Trees
6.1 Intro to Decision Trees
6.2 Recursive algorithms & Big-O
6.3 Types of decision trees
6.4 Splitting criteria
6.5 Gini & Entropy versus misclassification error
Explaining why we use entropy (or Gini) instead of the misclassification error as impurity metric in the information gain equation of CART decision trees.
6.6 Improvements & dealing with overfitting
Convering some issues with decision trees (like overfitting) and discusses some improvements such as the gain ratio, pre-pruning, and post-pruning.
6.7 Code Example
Showing a quick demo of how to train and visualize a decision tree with scikit-learn.
L07: Ensemble Methods
7.1 Intro to ensemble methods
Discussing ensemble methods, including majority voting, bagging, random forests, stacking, and gradient boosting – those are some of the most popular and widely used applied ML methods of all time! :)
7.2 Majority Voting
Gpoing over one of the most basic case of model ensembles, majority voting. Using a toy example (making certain assumptions), we see why majority voting can be better than using a single classifier alone.
7.3 Bagging
Looking at bagging (bootstrap aggregating) and also introduce the bias-variance trade-off and decomposition in order to understand why bagging is useful.
7.4 Boosting and AdaBoost
Discussing the general concept behind boosting – one of the model ensembling approaches in machine learning. Then, it goes over an early boosting algorithm and approach called adaptive boosting (AdaBoost), which boosts weak learners (i.e., decision tree stumps) to strong classifiers.
7.5 Gradient Boosting
In this video, we will take the concept of boosting a step further and talk about gradient boosting. Where AdaBoost uses weights for training examples to boost the trees in the next round, gradient boosting uses the gradients of the loss to compute residuals on which the next tree in the sequence is fit.
XGBoost paper mentioned in the video: hhttps://dl.acm.org/doi/pdf/10.1145/2939672.2939785
7.6 Random Forests
Discussing random forests, how random forests are related to bagging, and why random forests might perform better than bagging in practice.
7.7 Stacking
Exolaining Wolpert’s stacking algorithm (stacked generalization) and shows how to use stacking classifiers in mlxtend and scikit-learn.
Part IV: Evaluation
L08: Model Evaluation Part 1 – Basics: Underfitting & Overfitting
8.1 Intro to overfitting and underfitting
A brief overview of the topics to be covered in the model evaluation lectures. It will then start by giving a brief introduction to overfitting and underfitting.
8.2 Intuition behind bias and variance
Providing some intuition behind the terms bias and variance in the context of bias-variance decomposition and machine learning.
8.3 Bias-Variance Decomposition of the Squared Error
Decomposing the squared error loss into its bias and variance components.
8.4 Bias and Variance vs Overfitting and Underfitting
Discussing the connection between bias & variance and overfitting & underfitting.
8.5 Bias-Variance Decomposition of the 0/1 Loss
Discussing the tricky topic of decomposing the 0/1 loss into bias and variance terms.
8.6 Different Uses of the Term “Bias”
The different uses of the term “bias” in machine learning by introducing the concepts of machine learning bias and fairness bias.
L09: Model Evaluation Part 2 – Resampling Methods
9.1 Introduction
Going over the contents being covered in L09 (issues with the holdout method, resampling methods, and confidence intervals). Then, it introduces some of the motivations behind model evaluation.
9.2 Holdout Evaluation
Using a test set for estimating the generalization performance of a model. Technically, an independent test set can provide an unbiased estimate, but we can see that in practice it can actually be pessimistically or optimistically biased.
9.3 Holdout Model Selection
Discussing the holdout method for model evaluation in the previous video, this video covers the holdout method for model selection (aka hyperparameter tuning).
9.4 ML Confidence Intervals via Normal Approximation
The simplest way of making confidence intervals for machine learning classifiers using the test set performance: normal approximation intervals.
9.5 Resampling and Repeated Holdout
Learning curves and how to assess whether a model can benefit from more data. Then it covers the repeated holdout method.
9.6 Bootstrap Confidence Intervals
The Leave One Out Bootstrap (i.e., computing the model performances on out-of-bag samples) for constructing confidence intervals.
9.7 The .632 and .632+ Bootstrap methods
The .632 bootstrap, which addresses the pessimistic bias of the OOB bootstrap covered in the previous video. Then, we discuss the .632+ Bootstrap, which addresses the optimism bias introduced by the .632 method.
L10: Model Evaluation Part 3 – Cross Valdiation
10.1 Cross-validation lecture overview
Going over the topics we are going to cover in this lecture: cross-validation and model selection. Also, it gives a big-picture overview discussing recommended techniques for model evaluation and model selection.
10.2 Hyperparameters
Recapping the concept of hyperparameters.
10.3 k-fold CV for model evaluation
Introduces the concept of k-fold cross-validation and explains how it can be used for evaluating models. Also, it discusses why 10-fold cross-validation is a good choice (compared to 2-fold and 5-fold CV as well as leave-one-out-cross-validation).
10.4 k-fold CV for model eval. code examples
Explaining how we can evaluate models via k-fold cross-validation in Python using scikit-learn. A later video will show how we can use k-fold cross-validation for hyperparameter tuning and model selection.
10.5 k-fold CV for model selection
After talking about k-fold cross-validation for model evaluation in the last two videos, we are now going to talk about k-fold cross-validation for model selection, including hyperparameter tuning techniques such as grid search and randomized search.
10.6 k-fold CV for model evaluation code examples
Looking at code examples for using k-fold cross-validation for model selection. In particular, we are looking at GridSearchCV and RandomizedSearchCV in scikit-learn. Jupyter notebook link: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/L10/code/10_06_kfold-sele.ipynb.
10.7 k-fold CV 1-standard error method
Suggesting the 1-standard error method as a tie breaker for selecting one model from a set of similarly well performing models.
10.8 k-fold CV 1-standard error method code example
Goes over a code example for applying the 1-standard error method, which can be used as a tie breaker for selecting one model from a set of similarly well performing models. Jupyter notebook link: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/L10/code/10_08_1stderr.ipynb
L11: Model Evaluation Part 4 – Statistical Tests and Algorithm Selection
11.1 Lecture Overview
Going over the model and algorithm comparison-related topics that are covered in Lecture 11.
11.2 McNemar’s Test for Pairwise Classifier Comparison
Introducing McNemar’s test, which is a nonparametric statistical test for comparing the performance of two models with each other on a given test set.
11.3 Multiple Pairwise Comparisons
Extending the concept of McNemar’s test, which is a pairwise procedure, but recommending Cochran’s Q (a generalization of McNemar’s test) as an omnibus test.
11.4 Statistical Tests for Algorithm Comparison
Giving a brief overview of different statistical tests that exist for model and algorithm comparisons.
11.5 Nested CV for Algorithm Selection
Introducing the main concept behind nested cross-validation for algorithm selection.
11.6 Nested CV for Algorithm Selection Code Example
Picking up where the previous video left off, this video goes over nested cross-validation by looking at a scikit-learn code example.
L12: Model Evaluation Part 5 – Evaluation Metrics
12.0 Lecture Overview
This first video in L12 gives an overview of what’s going to be covered in L12.
12.1 Lecture Overview
Going over the concept of a confusion matrix and how it relates to the true positive and false positives rates, among others.
12.2 Precision, Recall, and F1 Score
Looking at binary performance metrics such as 12.2 Precision, Recall, and F1 Score.
12.3 Balanced Accuracy
Discussing the balanced accuracy (also known as the average-per-class accuracy), which is an alternative to the standard accuracy and can be useful in the context of class imbalance.
12.4 Receiver Operating Characteristic
Explaining the concept behind receiver operating characteristic curves, relating it back to the concept of true and false positive rates.
12.5 Extending Binary Metric to Multiclass Problems
This last video discusses how binary classifiers can be extended to multi-class settings. Then, it discusses how binary evaluation metrics, i.e., via micro- and macro-averaging.
Student Presentations
These presentations are shared with the students’ permission.
Modeling COVID Positivity Rates at U.S. College Campuses (Student Presentation, Group 16)
Using News to Predict Stock Movement (Student Presentation, Group 12)
Using Machine Learning to Predict NBA Games (Student Presentation, Group 22)
Machine Learning-Based Authorship Identification in Web Fictions (Student Presentation, Group 17)
Unemployment Rate Forecasting using Machine Learning (Student Presentation, Group 3)
Machine Learning for Characterizing Climate-related Disasters (Student Presentation, Group 20)
Predicting Pitch Outcomes in Major League Baseball (Student Presentation, Group 11)
Twitter Posts Political Ideology Classification (Student Presentation, Group 15)