Table of Contents


End-of-the-semester Content

Presentations

After the successful conclusion of this semester, here are some project presentations from students who volunteered to share their work.

Modeling COVID Positivity Rates at U.S. College Campuses (Student Presentation, Group 16) Twitter Posts Political Ideology Classification (Student Presentation, Group 15) Using News to Predict Stock Movement (Student Presentation, Group 12)
Using Machine Learning to Predict NBA Games (Student Presentation, Group 22) Machine Learning-Based Authorship Identification in Web Fictions (Student Presentation, Group 17) Unemployment Rate Forecasting using Machine Learning (Student Presentation, Group 3)
Machine Learning for Characterizing Climate-related Disasters (Student Presentation, Group 20) Predicting Pitch Outcomes in Major League Baseball (Student Presentation, Group 11)

Reports and GitHub Repositories

Project Awards

Best Oral Presentation

Modeling COVID Positivity Rates at U.S. College Campuses (Group 16)

by Christopher Kardatzke, Sebastian Khattabi, Abby Kisicki, and Andrew Tenjum

Most Creative Project

Authorship Identification in Web Fictions (Group 17)

by Fangying Zhan, Weijia Cao, and Yuan Tian

Best Visualizations

San Francisco Crime Rate Classification (Group 2)

by Xiu Xie, Evangeline Lim, Lynette Gao

Course Topics and Calendar

Below is a list of the topics I am planning to cover in this course. Since course topics are among the most often requested information about this course, I am placing this on top of this website. More information about this course can be found in the sections that follow the course content below.

Part 1: Introduction

  • L01 - Course overview, introduction to machine learning
  • L02 - Introduction to Supervised Learning and k-Nearest Neighbors Classifiers

Part 2: Computational foundations

  • L03 - Using Python
  • L04 - Introduction to Python’s scientific computing stack
  • L05 - Data preprocessing and machine learning with scikit-learn

Part 3: Tree-based methods

  • L06 - Decision trees
  • L07 - Ensemble methods

Part 4: Model evaluation

  • Midterm exam
  • L08 - Model evaluation 1 – overfitting
  • L09 - Model evaluation 2 – confidence intervals
  • L10 - Model evaluation 3 – cross-validation and model selection
  • L11 - Model evaluation 4 – algorithm selection
  • L12 - Model evaluation 5 – evaluation and performance metrics

Part 5: Dimensionality reduction and unsupervised learning

  • L13 - Feature selection
  • L14 - Feature extraction
  • L15 - Clustering

Part 6: Bayesian learning

  • L16 - Introduction to Bayesian methods
  • L17 - Bayes optimal classifiers
  • L18 - Naive Bayes classifiers
  • L19 - Bayesian networks

Part 7: Class projects and final exam

  • Course summary
  • Student project presentations
  • Final exam

Course Description

Credits: 3

Course Description:

Introduction to machine learning for pattern classification, regression analysis, clustering, and dimensionality reduction. For each category, fundamental algorithms, as well as selections of contemporary, current state-of-the-art algorithms, are being discussed. The evaluation of machine learning models using statistical methods is a particular focus of this course. Statistical pattern classification approaches, including maximum likelihood estimation and Bayesian decision theory, are compared and contrasted to algorithmic and nonparametric approaches. While fundamental mathematical concepts underlying machine learning and pattern classification algorithms are being taught, the practical use of machine learning algorithms using open source libraries from the Python programming ecosystem will be of equal focus in this course.

Course Requisites:

MATH 340, 341, Graduate Student Standing, or member of the Statistics Visiting International Scholars program.

Along with introducing of the concepts of machine learning and pattern classification, the in-class lectures will provide a refresher on relevant concepts from calculus and linear algebra; however, a calculus background (e.g., Math 221) and a linear algebra background (e.g., Math 340) is recommended. While this course will also provide an introduction to the basics of the Python programming language for machine learning, it is highly recommended that students are familiar with basic programming and have completed an introductory programming class.

Learning Outcomes:

  • Understanding the different subfields of machine learning, such as supervised and unsupervised learning and being familiar with essential algorithms from each subfield.
  • Being able to identify whether machine learning is appropriate for solving a given problem task and which class of algorithms is best suited for real-world problem solving.
  • Using statistical learning theory to combine multiple machine learning models via ensemble methods.
  • Learning about best-practices for statistical model evaluation, model selection and algorithm comparisons including suitable statistical hypothesis tests.
  • Using contemporary programming languages and machine learning libraries for implementing machine learning algorithms such that they can be readily applied for practical problem solving.
  • Connecting concepts from probability theory with supervised learning by implementing models based on Bayes’ theorem.

Course Information, Resources, and Communication

For this course, we will be using the Canvas platform, which you should have access to, given that you are enrolled in this class, via the following link: https://canvas.wisc.edu/courses/220884/.

  • General information and schedule: General information about this course will be provided through this website. This is so that students who are currently on the waitlist can view this material. However, throughout the course, you will only have to check the STAT451 Canvas Class for updates and material: All course content, information, and resources will be shared or linked on Canvas.

  • Course material: For each week, I will create a new “page” on Canvas containing the lecture material for the given week, this will include

    • Lecture videos
    • Download links for the lecture slides
    • Download links for the lecture notes (if applicable)
    • Download links for additional material (if applicable)

Some of the course material (PDF files and code files) will be served through a GitHub repository. The reason for this is that it permits updates with transparent date stamps and the tracking of changes. Also, the machine learning research community relies heavily on GitHub for sharing code and research results, which is why it is beneficial for you to become familiar with it. You can obtain the course material (slides, code examples, etc.) directly from the GitHub repository. However, note that links to these materials will always be shared on the Canvas Pages, so you do not check different websites separately.

If there are problems with viewing or obtaining these files (for example, because of restricted internet access), please let me know – I am happy to find alternative solutions then, such as uploading the material to Google Drive or the internal Canvas storage if possible.

  • Important information and announcements: Important course information and deadlines (as well as updates or changes) will be shared via the Announcements on Canvas.

You should get an automated email each time I upload a new announcement there, but it does not hurt to check the Announcements page manually every day to make sure you did not miss any important information.

  • Submissions: Homework assignment submissions and project submissions are to be submitted via the Canvas Assignmens function. I will provide more information and instructions regarding submissions throughout the semester.

  • Questions: The best place for asking questions is the Piazza forum I set up for this course. Asking questions via Piazza (instead of using email) is most efficient in case multiple students have the same or similar questions. Students are also encouraged to help other students on Piazza. However, for personal questions (missed assignments etc.), please contact me or the TA via email directly (please use the prefix “STAT451:” as the email subject header to ensure we do not miss it).

Course Logistics

When

  • There are no specific times when you have to watch the lectures. However, it is highly recommended and very important for your success in this class that you keep up with each week’s lecture content. The lecture material will be shared on Canvas (as described above) at the beginning of each week.

Where

  • Online

Instructors

  • Instructor: Dr. Sebastian Raschka
  • Teaching Assistant: Zhongjie Yu

Office Hours

  • Prof. Sebastian Raschka (Instructor) :

    • Thu 4:00 pm to 5:30 pm, virtual video conference via BBCollaborate

  • Zhongjie Yu (Teaching Assistant):

    • Fri 10:30 am to 11:30 am, virtual video conference via BBCollaborate

Resources

I will link resources, including internet articles and research articles that are relevant for the course. The book suggestions are recommendations but not requirements.

Machine Learning Books

Python Machine Learning, 3rd Edition (highly recommended)

  • Raschka, S., & Mirjalili, V. (2019). Python Machine Learning, 3rd Ed. Birmhingham, UK: Packt Publishing. ISBN-13: 978-1789955750
  • Many of the hands-on code examples, topics, and figures discussed in class were adopted from this book; hence, it is highly recommended to read through the chapters in this book.
  • Code examples and figures are freely available online under an open source license at https://github.com/rasbt/python-machine-learning-book-3rd-edition.

Python Resources

Regarding Python, we will mainly focus on two libraries: NumPy and Scikit-learn. You can think of NumPy as a linear algebra library that provides utilities similar to MatLab (if you are familiar with MatLab). It’s a library that is used in almost any scientific computing task and other libraries in Python and is generally useful. Scikit-learn is the main machine learning library we will be using.

In any case, you don’t need to be an expert Python programmer to use these libraries (and I will teach you about Scikit-learn in this course, so no worries about learning it beforehand). However, some basic familiarity with Python will be necessary in order to use these libraries.

Python for Beginners (Video Lectures)

A great video series by educators at Microsoft, which was recently made available for free on YouTube: https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6.

Learn Python (Interactive Tutorials)

On https://www.learnpython.org/, you can find a interactive exercises that help you learn Python through a sequence of coding exercises.

Illustrated Guide to Python (Book)

  • “Illustrated Guide to Python 3: A Complete Walkthrough of Beginning Python with Unique Illustrations Showing how Python Really Works. Now covering Python 3.6 (Treading on Python) (Volume 1)” by Matt Harrison, ISBN-13: 978-1977921758.

For instance, another great book is Allen Downey’s Think Python 2e (free PDF available at https://greenteapress.com/wp/think-python-2e/).

Python Like You Mean It

A short, free intro for getting started with Python and its main scientific computing libraries: https://www.pythonlikeyoumeanit.com.

Grading

The final grade will be computed using the following weighted grading scheme:

  • 20% Problem Sets (Homeworks and Quizzes)
  • 50% Exams:
    • 20% Midterm Exam
    • 30% Final Exam
  • 30% Class Project:
    • 5% Project proposal
    • 10% Project presentation
    • 15% Project report

To make the grading more transparent and provide students with a better intuition of their performance throughout the course, there will be a total of 1000 points in this course. For instance, 200 points can be obtained from homework assignments (30% of the final grade), 500 points from exams (50% of the final course grade), and 300 points for the class project (30% of the final grade).

Tentatively, the final letter grade will be based on the total number of points/percent of the total points accumulated in the course:

  • A: >= 920 points or >= 92%
  • AB: >= 880 points or >= 88%
  • B: >= 840 points or >= 84%
  • BC: >= 800 points or >= 80%
  • C: >= 700 points or >= 70%
  • D: >= 500 points or >= 50%
  • F: < 500 points or < 50%

However, due to the COVID-19 context and the change to all-online instructions and exams, grades may be curved to adjust for differences in online teaching compared to previous in-person semesters.

Exams

Both the midterm and final exam will be “conceptual,” which means that you will not be asked to write code in the exam. The exams will take place online through Canvas during specific times:

  • Midterm exam: Thursday October 15th, 4:00 pm - 5:15 pm
  • Final exam: Monday December 14th, 10:05 am - 12:05 pm

The final will be cumulative in the sense that some of the earlier topics may be relevant to the final exam; however, the final exam will largely focus on the parts covered after the midterm. In other words, you still should be familiar with all concepts covered in the course, but questions will be centered around the topics after the midterm.

While there will be different types of questions, one question could be as follows:

Q: Does the (computational) time complexity of a k-Nearest Neighbor classifier grow linearly, quadratically, or exponentially with the number of samples in the training dataset? Explain your answer in 1-2 sentences.

Answer: Linearly. For each new training point there is an additional distance computation.

Class Project

Overview

The goal of working on a class project is three-fold. First, it will provide you with the opportunity to apply the concepts learned in this class creatively, which helps you with understanding material more deeply. Second, designing and working on a unique project in a team which is something that you will encounter, if you haven’t already, rather sooner than later in life, and this course project helps with preparing for that. Third, along with the opportunity to practice and the satisfaction of working creatively, students can use this project to enhance their portfolio or resume (for example, by sharing it publicly on your GitHub account or personal website – this is optional).

Note about grading

There is no “perfect project.” While you are encouraged to be ambitious, the most important aspect of this project is your learning experience. Hence, you don’t want to pick something that is too easy for you, but similarly, you don’t want to choose a project where you are not certain that is out of the scope of this class. (However, note that the more comprehensive and interesting the project is, the easier you’ll find it to write the 6-8-page project report.) The project proposal is not graded by how exciting your project is but based on whether you follow the objectives of the project proposal, project presentation, and project report. For instance, if your project ends up being unsuccessful – for example, if you choose to design a classifier and it doesn’t achieve the desired accuracy – it will not negatively affect your grade as long as you are honest, describe the potential issues well, and suggest improvements or further experiments. Again, the objective of this project is to provide you with hands-on practice and an opportunity to learn.

The project consists of 3 parts:

  1. a project proposal,
  2. a short project presentation,
  3. and a project report.

The expectations for each part will be discussed in the following sections.

1) Project Proposal

Please note that you should use the proposal-latex file(s) for writing and submitting your proposal!

The main purpose of the project proposal is to receive feedback from the TAs/the instructor regarding whether your project is feasible and whether it is within the scope of this class. Also, the project proposal offers a chance to receive useful feedback and suggestions on your project.

For this project, you will be working in a team consisting of three students. You are encouraged to form groups by yourself, as discussed in class. If you cannot find group members, the TA and I will randomly assign you to a group. If you have any concerns working with someone in your group, please talk to a TA or the instructor for accommodations.

Proposal Format:

  • The project proposal is a 2-4 pages document, excluding references.
  • You will be required to use the LaTeX proposal template. The goal is two-fold: You will likely be using LaTeX to write reports later in your career (it is very common in statistics and machine learning), and it will ensure consistency among submissions to make the gradient easier and fairer. The proposal template can be obtained from https://github.com/rasbt/stat451-machine-learning-fs20/tree/master/report-template/proposal-latex
  • You are encouraged (not required) to use 1-2 figures to illustrate technical concepts.
  • The proposal must be formatted and submitted as a PDF document (the submission deadline will be later announced via the calendar & email).

Introduction:

  • Describe what you are planning to do.
  • Briefly describe related work (if applicable).

Motivation:

  • Describe why your project is exciting. E.g., you can describe why your project could have a broader societal impact. Or, you may describe the motivation from a personal learning perspective.

Evaluation:

  • What would the successful outcome of your project look like? In other words, under which circumstances would you consider your project to be “successful?”
  • How do you measure success, specific to this project, from a technical standpoint?

Resources:

  • What resources are you going to use (datasets, computer hardware, computational tools, etc.)?

Contributions:

You are expected to share the workload evenly, and every group member is expected to participate in both the experiments and writing. (As a group, you only need to submit one proposal and one report, though. So you need to work together and coordinate your efforts.)

  • Clearly indicate what computational and writing task each member of your group will be participating in.

It is crucial that you talk to each other regularly!!! Schedule regular meetings and/or use online communication tools (e.g., Gitter, Slack, or email) to stay in touch with your group members throughout the semester regarding the process of your project.

Modifications to the proposal

After you have received feedback from me and your project proposal has been graded, you are advised to stick to the project outline in the proposal as closely as possible. However, if there is a concept introduced in a later lecture (for instance, a machine learning algorithm that you think is more appropriate then the one you proposed), you have the option to modify your proposal, but you are not penalized if you don’t. If you wish to update your project outline, talk to me or the TA first.

Project Proposal Assessment

The proposal will be graded based on completeness of each of the 5 sections (Introduction, Motivation, Evaluation, Resources, and Contributions) and not be based on language, style, and how “exciting” or “interesting” the project is. For each section, you can receive a maximum of 10 points, totaling 50 pts for the proposal overall.

The proposal assessment is summarized at https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/report-template/project-proposal-assessment.md.

Also, it is important to make sure that you acknowledge previous work and use citations properly when referring to other people’s work. Even minor forms of plagiarism (e.g., copying sentences from other texts) will result in a subtraction of at least 10 pts each per incidence. And university guidelines dictate that severe incidents need to be reported. If you are unsure about what constitutes plagiarism and how to avoid it, please see the helpful guides at https://conduct.students.wisc.edu/plagiarism/.

2) Project Presentation

During the last three lectures, you will be presenting your project to the class. The presentation is “free form” but should cover the following:

  • introduce the topic to a general audience (your class);
  • summarize the main approach or method;
  • highlight the outcomes of your project.

The presentation should be 8-10 minutes long. All members of the group should participate in the presentation.

  • The talks will be all virtual and be submitted via Canvas. I will then prepare the videos for upload on Canvas so that the other students can watch them. Here is a video of student presentation from STAT 453 to give you an idea of how the talks may look like (the students volunteered to share their talks publicaly, but please not that this is not required.)
  • There will be 3 awards:
  1. Best Oral Presentation
  2. Most Creative Project
  3. Best Visualizations
  • The awards will be determined by voting, each student will fill out an online quiz via Canvas, voting for each presentation (on a scale from 1-10 for each of the 3 categories, where 10 is best).

The voting card should be filled out as follows:

  1. Title of the Presentation, a/10, b/10, c/10
  2. Title of the Presentation, a/10, b/10, c/10 …

where

  • a are the points for 1. Best Oral Presentation
  • b are the points for 2. Most Creative Project
  • c are the points 3. Best Visualizations

The awards will be computed based on the highest number of points for each category. However, one project can only receive one of the prizes. The points for the grade are considered independently from the 3 prize categories. The rubric for the grades is provided in the subsection Project Presentation Assessment below.

Project Presentation Assessment

The rubric for assigning the points (out of 100) for the presentation is provided below:

  • 10 pts: Is there a motivation for the project given?
  • 40 pts: Is the project described well enough that a general audience, familiar with machine learning, can understand the project?
  • 20 pts: Figures are all legible and explained well
  • 20 pts: Are the results presented adequately discussed?
  • 10 pts: Did all team members contribute to the presentation?

3) Project Report

The project report is expected to be 6-8 pages long (excluding references) and should contain the follwing sections:

  1. Introduction
  2. Related Work
  3. Proposed Method
  4. Experiments
  5. Results and Discussion
  6. Conclusions
  7. Contributions

More details are provided in the LaTeX report template at https://github.com/rasbt/stat451-deep-learning-fs20/tree/master/report-template.

Please note that you should use the report-latex file for writing and submitting your report!

Also, you are required to submit all the code, computations, and experiments you developed and conducted for this project. Note that the quality of code will not have any influence on your grad and will merely serve as a basis to establish that the report contains original and “real” results.

Project Report Assessment

The rubric for grading the project reports is provided below.

Abstract: 15 pts

  • Is enough information provided get a clear idea about the subject matter?
  • Is the abstract conveying the findings?
  • Are the main points of the report described succinctly?

Introduction: 15 pts

  • Does the introduction cover the required background information to understand the work?
  • Is the introduction well organized: it starts out general and becomes more specific towards the end?
  • Is there a motivation explaining why this project is relevant, important, and/or interesting?

Related Work: 15 pts

  • Is the similar and related work discussed adequately?
  • Are references cited properly (here, but also throughout the whole paper)?
  • Is the a discussion or paragraph on comparing this project with other people’s work adequate?

Proposed Method: 25 pts

  • Are there any missing descriptions of symbols used in mathematical notations (if applicable)?
  • Are the main algorithms described well enough so that they can be implemented by a knowledgeable reader?

Experiments: 25 pts

  • Is the experimental setup and methodology described well enough so that it can be repeated?
  • If datasets are used, are they referenced appropriately?

Results and Discussion: 30 pts

  • Are the results described clearly?
  • Is the data analyzed well, and are the results logical?
  • Are the figures clear and have no missing labels?
  • Do the figure captions have sufficient information to understand the figure?
  • Is each figure referenced in the text?
  • Is the discussion critical/honest, and are potential weaknesses/shortcomings are discussed as well?

Conclusions: 15 pts

  • Do the authors describe whether the initial motivation/task was accomplished or not based on the results?
  • Is it discussed adequately how the results relate to previous work?
  • If applicable, are potential future directions given?

Contributions: 10 pts

  • Are all contributions listed clearly?
  • Did each member contribute approximately equally to the project?

Optional: Sharing your Project

You are encouraged to share your project/final project report online after you completed the course – for example, via GitHub or on a personal website online.

Other Important Course Information

Late Submission Policy

Homework, quizzes, and projects that are submitted late will

  • Submitted within 6 hours of the deadline: 10% deduction from the maximum possible points.
  • Submitted within 6 and 24 hours of the deadline: 20% deduction from the maximum possible points.
  • Submitted more than 24 hours late: No points.

Rules, Rights & Responsibilities

See the Guides’s Rules, Rights and Responsibilities

Academic Integrity

By enrolling in this course, each student assumes the responsibilities of an active participant in UW-Madison’s community of scholars in which everyone’s academic work and behavior are held to the highest academic integrity standards. Academic misconduct compromises the integrity of the university. Cheating, fabrication, plagiarism, unauthorized collaboration, and helping others commit these acts are examples of academic misconduct, which can result in disciplinary action. This includes but is not limited to failure on the assignment/course, disciplinary probation, or suspension. Substantial or repeated cases of misconduct will be forwarded to the Office of Student Conduct & Community Standards for additional review. For more information, refer to studentconduct.wiscweb.wisc.edu/academic-integrity/.

Accommodations for Students with Disabilities

McBurney Disability Resource Center syllabus statement: “The University of Wisconsin-Madison supports the right of all enrolled students to a full and equal educational opportunity. The Americans with Disabilities Act (ADA), Wisconsin State Statute (36.12), and UW-Madison policy (Faculty Document 1071) require that students with disabilities be reasonably accommodated in instruction and campus life. Reasonable accommodations for students with disabilities is a shared faculty and student responsibility. Students are expected to inform faculty [me] of their need for instructional accommodations by the end of the third week of the semester, or as soon as possible after a disability has been incurred or recognized. Faculty [I], will work either directly with the student [you] or in coordination with the McBurney Center to identify and provide reasonable instructional accommodations. Disability information, including instructional accommodations as part of a student’s educational record, is confidential and protected under FERPA.” http://mcburney.wisc.edu/facstaffother/faculty/syllabus.php

Diversity and Inclusion

Institutional statement on diversity: “Diversity is a source of strength, creativity, and innovation for UW-Madison. We value the contributions of each person and respect the profound ways their identity, culture, background, experience, status, abilities, and opinion enrich the university community. We commit ourselves to the pursuit of excellence in teaching, research, outreach, and diversity as inextricably linked goals.

The University of Wisconsin-Madison fulfills its public mission by creating a welcoming and inclusive community for people from every background – people who as students, faculty, and staff serve Wisconsin and the world.” https://diversity.wisc.edu/

COVID-19 Context

During the global COVID-19 pandemic, we must prioritize our collective health and safety to keep ourselves, our campus, and our community safe. As a university community, we must work together to prevent the spread of the virus and to promote the collective health and welfare of our campus and surrounding community.

Information on COVID-19 is constantly changing. Students should be attentive to University communications regarding COVID-19 that may alter instruction and supersede parts of this syllabus.

UW-Madison Badger Pledge

https://smartrestart.wisc.edu/badgerpledge/

UW-Madison Face Covering Guidelines

While on campus all employees and students are required to wear appropriate and properly fitting face coverings while present in any campus building unless working alone in a laboratory or office space.

Face Coverings During In-person Instruction Statement (COVID-19)

Individuals are expected to wear a face covering while inside any university building. Face coverings must be worn correctly (i.e., covering both your mouth and nose) in the building if you are attending class in person. If any student is unable to wear a face-covering, an accommodation may be provided due to disability, medical condition, or other legitimate reason. Students with disabilities or medical conditions who are unable to wear a face covering should contact the McBurney Disability Resource Center or their Access Consultant if they are already affiliated. Students requesting an accommodation unrelated to disability or medical condition, should contact the Dean of Students Office. Students who choose not to wear a face covering may not attend in-person classes, unless they are approved for an accommodation or exemption. All other students not wearing a face covering will be asked to put one on or leave the classroom. Students who refuse to wear face coverings appropriately or adhere to other stated requirements will be reported to the Office of Student Conduct and Community Standards and will not be allowed to return to the classroom until they agree to comply with the face covering policy. An instructor may cancel or suspend a course in-person meeting if a person is in the classroom without an approved face covering in position over their nose and mouth and refuses to immediately comply.

Quarantine or Isolation Due to COVID-19

Students should continually monitor themselves for COVID-19 symptoms and get tested for the virus if they have symptoms or have been in close contact with someone with COVID-19. Students should reach out to instructors as soon as possible if they become ill or need to isolate or quarantine, in order to make alternate plans for how to proceed with the course. Students are strongly encouraged to communicate with their instructor concerning their illness and the anticipated extent of their absence from the course (either in-person or remote). The instructor will work with the student to provide alternative ways to complete the course work.