Introduction to Machine Learning

Recordings from STAT 451: Introduction to Machine Learning Machine Learning (FS 2020) at University Wisconsin-Madison

Part I: Introduction
- L01: What is Machine Learning
- L02: Nearest Neighbor Methods
Part II: Computational Foundations
Part III: Tree-Based Methods
- L06: Decision Trees
- L07: Ensemble Methods
Part IV: Evaluation

Part I: Introduction

L01: What is Machine Learning

1.1 Course overview

Course overview and introduction to the course “Stat 451: Introduction to Machine Learning (FS 2020).”

1.2 What is Machine Learning

The definition of machine learning and how machine learning is related to programming.

1.3 Categories of Machine Learning

Discussion of the three broad categories of machine learning: supervised learning, unsupervised learning, and reinforcement learning.

1.4 Notation

Machine learning formalities and notation that we will be using in this course.

1.5 ML application

The main steps for approaching a machine learning application along with a categorization with the different aspects of machine learning components.

1.6 ML motivation

Different perspectives and motivations regarding studying machine learning.

L02: Nearest Neighbor Methods

2.1 Introduction to NN

Introducing nearest neighbor methods, going over some applications of nearest neighbors and covering the 1-nearest neighbor algorithm.

2.2 Nearest neighbor decision boundary

Covering the intuition behind the 1-nearest neighbor’s decision boundary. Also, it lists some of the common distance measures.

2.3 K-nearest neighbors

Extending the 1-nearest neighbor concepts to the k-nearest neighbors method for classification and regression.

2.4 Big O of K-nearest neighbors

Looking at the Big-O runtime complexity of a naive implementation of k-nearest neighbors

2.5 Improving k-nearest neighbors

Summarizing some of the common tricks for making k-nearest neighbors more efficient in terms of computational performance and predictive performance.

2.6 K-nearest neighbors in Python

Using k-nearest neighbors in Python using scikit-learn. The Jupyter Notebook referenced in this video is available at https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/L02/code/02_knn_demo.ipynb.

Part II: Computational Foundations

L03: (Optional) Python Programming

L03 Lecture Notes

3.1 Python overview

Talking about the use of Python in this course. I will also show a quick demo using C (a statically typed language) vs Python. It’s probably not the most exciting lecture :).

3.2 Python setup

Demonstrating how to install Python using Miniconda on macOS. Also, I provide some brief demo of the conda package manager.

3.3 Running Python code

Showing the different ways for running Python code: the REPL, IPython, .py scripts, and Visual Studio Code.

L04: Scientific Computing in Python

L04 Code

4.1 Intro to NumPy

Introducing NumPy on a basic level before diving into more details in the following videos.

4.2 NumPy Array Construction and Indexing

4.3 NumPy Array Math and Universal Functions

4.4 NumPy Broadcasting

4.5 NumPy Advanced Indexing – Memory Views and Copies

4.6 NumPy Random Number Generators

4.7 Reshaping NumPy Arrays

4.8 NumPy Comparison Operators and Masks

4.9 NumPy Linear Algebra Basics

4.10 Matplotlib

L05: Machine Learning with Scikit-Learn

5.1 Reading a Dataset from a Tabular Text File

5.2 Basic data handling

5.3 Object Oriented Programming & Python Classes

5.4 Intro to Scikit-learn

5.5 Scikit-learn Transformer API

5.6 Scikit-learn Pipelines

Part III: Tree-Based Methods

L06: Decision Trees

6.1 Intro to Decision Trees

6.2 Recursive algorithms & Big-O

6.3 Types of decision trees

6.4 Splitting criteria

6.5 Gini & Entropy versus misclassification error

Explaining why we use entropy (or Gini) instead of the misclassification error as impurity metric in the information gain equation of CART decision trees.

6.6 Improvements & dealing with overfitting

Convering some issues with decision trees (like overfitting) and discusses some improvements such as the gain ratio, pre-pruning, and post-pruning.

6.7 Code Example

Showing a quick demo of how to train and visualize a decision tree with scikit-learn.

L07: Ensemble Methods

7.1 Intro to ensemble methods

Discussing ensemble methods, including majority voting, bagging, random forests, stacking, and gradient boosting – those are some of the most popular and widely used applied ML methods of all time! :)

7.2 Majority Voting

Gpoing over one of the most basic case of model ensembles, majority voting. Using a toy example (making certain assumptions), we see why majority voting can be better than using a single classifier alone.

7.3 Bagging

Looking at bagging (bootstrap aggregating) and also introduce the bias-variance trade-off and decomposition in order to understand why bagging is useful.

7.4 Boosting and AdaBoost

Discussing the general concept behind boosting – one of the model ensembling approaches in machine learning. Then, it goes over an early boosting algorithm and approach called adaptive boosting (AdaBoost), which boosts weak learners (i.e., decision tree stumps) to strong classifiers.

7.5 Gradient Boosting

In this video, we will take the concept of boosting a step further and talk about gradient boosting. Where AdaBoost uses weights for training examples to boost the trees in the next round, gradient boosting uses the gradients of the loss to compute residuals on which the next tree in the sequence is fit.

XGBoost paper mentioned in the video: hhttps://dl.acm.org/doi/pdf/10.1145/2939672.2939785

7.6 Random Forests

Discussing random forests, how random forests are related to bagging, and why random forests might perform better than bagging in practice.

7.7 Stacking

Exolaining Wolpert’s stacking algorithm (stacked generalization) and shows how to use stacking classifiers in mlxtend and scikit-learn.

Part IV: Evaluation

L08: Model Evaluation Part 1 – Basics: Underfitting & Overfitting

8.1 Intro to overfitting and underfitting

A brief overview of the topics to be covered in the model evaluation lectures. It will then start by giving a brief introduction to overfitting and underfitting.

8.2 Intuition behind bias and variance

Providing some intuition behind the terms bias and variance in the context of bias-variance decomposition and machine learning.

8.3 Bias-Variance Decomposition of the Squared Error

Decomposing the squared error loss into its bias and variance components.

8.4 Bias and Variance vs Overfitting and Underfitting

Discussing the connection between bias & variance and overfitting & underfitting.

8.5 Bias-Variance Decomposition of the 0/1 Loss

Discussing the tricky topic of decomposing the 0/1 loss into bias and variance terms.

8.6 Different Uses of the Term “Bias”

The different uses of the term “bias” in machine learning by introducing the concepts of machine learning bias and fairness bias.

L09: Model Evaluation Part 2 – Resampling Methods

9.1 Introduction

Going over the contents being covered in L09 (issues with the holdout method, resampling methods, and confidence intervals). Then, it introduces some of the motivations behind model evaluation.

9.2 Holdout Evaluation

Using a test set for estimating the generalization performance of a model. Technically, an independent test set can provide an unbiased estimate, but we can see that in practice it can actually be pessimistically or optimistically biased.

9.3 Holdout Model Selection

Discussing the holdout method for model evaluation in the previous video, this video covers the holdout method for model selection (aka hyperparameter tuning).

9.4 ML Confidence Intervals via Normal Approximation

The simplest way of making confidence intervals for machine learning classifiers using the test set performance: normal approximation intervals.

9.5 Resampling and Repeated Holdout

Learning curves and how to assess whether a model can benefit from more data. Then it covers the repeated holdout method.

9.6 Bootstrap Confidence Intervals

The Leave One Out Bootstrap (i.e., computing the model performances on out-of-bag samples) for constructing confidence intervals.

9.7 The .632 and .632+ Bootstrap methods

The .632 bootstrap, which addresses the pessimistic bias of the OOB bootstrap covered in the previous video. Then, we discuss the .632+ Bootstrap, which addresses the optimism bias introduced by the .632 method.

L10: Model Evaluation Part 3 – Cross Valdiation

10.1 Cross-validation lecture overview

Going over the topics we are going to cover in this lecture: cross-validation and model selection. Also, it gives a big-picture overview discussing recommended techniques for model evaluation and model selection.

10.2 Hyperparameters

Recapping the concept of hyperparameters.

10.3 k-fold CV for model evaluation

Introduces the concept of k-fold cross-validation and explains how it can be used for evaluating models. Also, it discusses why 10-fold cross-validation is a good choice (compared to 2-fold and 5-fold CV as well as leave-one-out-cross-validation).

10.4 k-fold CV for model eval. code examples

Explaining how we can evaluate models via k-fold cross-validation in Python using scikit-learn. A later video will show how we can use k-fold cross-validation for hyperparameter tuning and model selection.

10.5 k-fold CV for model selection

After talking about k-fold cross-validation for model evaluation in the last two videos, we are now going to talk about k-fold cross-validation for model selection, including hyperparameter tuning techniques such as grid search and randomized search.

10.6 k-fold CV for model evaluation code examples

Looking at code examples for using k-fold cross-validation for model selection. In particular, we are looking at GridSearchCV and RandomizedSearchCV in scikit-learn. Jupyter notebook link: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/L10/code/10_06_kfold-sele.ipynb.

10.7 k-fold CV 1-standard error method

Suggesting the 1-standard error method as a tie breaker for selecting one model from a set of similarly well performing models.

10.8 k-fold CV 1-standard error method code example

Goes over a code example for applying the 1-standard error method, which can be used as a tie breaker for selecting one model from a set of similarly well performing models. Jupyter notebook link: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/L10/code/10_08_1stderr.ipynb

L11: Model Evaluation Part 4 – Statistical Tests and Algorithm Selection

11.1 Lecture Overview

Going over the model and algorithm comparison-related topics that are covered in Lecture 11.

11.2 McNemar’s Test for Pairwise Classifier Comparison

Introducing McNemar’s test, which is a nonparametric statistical test for comparing the performance of two models with each other on a given test set.

11.3 Multiple Pairwise Comparisons

Extending the concept of McNemar’s test, which is a pairwise procedure, but recommending Cochran’s Q (a generalization of McNemar’s test) as an omnibus test.

11.4 Statistical Tests for Algorithm Comparison

Giving a brief overview of different statistical tests that exist for model and algorithm comparisons.

11.5 Nested CV for Algorithm Selection

Introducing the main concept behind nested cross-validation for algorithm selection.

11.6 Nested CV for Algorithm Selection Code Example

Picking up where the previous video left off, this video goes over nested cross-validation by looking at a scikit-learn code example.

L12: Model Evaluation Part 5 – Evaluation Metrics

12.0 Lecture Overview

This first video in L12 gives an overview of what’s going to be covered in L12.

12.1 Lecture Overview

Going over the concept of a confusion matrix and how it relates to the true positive and false positives rates, among others.

12.2 Precision, Recall, and F1 Score

Looking at binary performance metrics such as 12.2 Precision, Recall, and F1 Score.

12.3 Balanced Accuracy

Discussing the balanced accuracy (also known as the average-per-class accuracy), which is an alternative to the standard accuracy and can be useful in the context of class imbalance.

12.4 Receiver Operating Characteristic

Explaining the concept behind receiver operating characteristic curves, relating it back to the concept of true and false positive rates.

12.5 Extending Binary Metric to Multiclass Problems

This last video discusses how binary classifiers can be extended to multi-class settings. Then, it discusses how binary evaluation metrics, i.e., via micro- and macro-averaging.

Student Presentations

These presentations are shared with the students’ permission.

Modeling COVID Positivity Rates at U.S. College Campuses (Student Presentation, Group 16)

Using News to Predict Stock Movement (Student Presentation, Group 12)

Using Machine Learning to Predict NBA Games (Student Presentation, Group 22)

Machine Learning-Based Authorship Identification in Web Fictions (Student Presentation, Group 17)

Unemployment Rate Forecasting using Machine Learning (Student Presentation, Group 3)

Machine Learning for Characterizing Climate-related Disasters (Student Presentation, Group 20)

Predicting Pitch Outcomes in Major League Baseball (Student Presentation, Group 11)

Twitter Posts Political Ideology Classification (Student Presentation, Group 15)