Methodological training „Introduction to Machine Learning for the Social Sciences“
Training will be delivered by dr. Bruno Castanho Silva (University of Cologne, Germany)
The training is conducted under the project “Disparities in school achievement from a person and variable-oriented perspective: A prototype of a learning analytics tool NO-GAP” (Head of the project – dr. Rasa Erentaitė).
This project has received funding from European Regional Development Fund (project No 01.2.2-LMT-K-718-03-0059) under grant agreement with the Research Council of Lithuania (LMTLT).
Machine Learning is an analytical approach in which users can build statistical models that “learn” from data to make accurate predictions and decisions. From customer-recommendation systems to policy design and implementation, machine learning algorithms are becoming ubiquitous in a big data world. Their potential is now being explored in the social sciences. In this course participants will learn the fundamentals of machine learning as a data analysis approach, and will have an overview of the most common and versatile classes of ML techniques in use today. The goal is that at the end participants will be able to identify what kind of technique is more suitable for their question and data, and how to design, test and interpret their models. They will also be equipped with sufficient basic knowledge to proceed independently for more advanced algorithms and problems. This is an introductory course, so that math and programming technicalities will be kept to a minimum.
Day One
Date: September 22nd
General Introduction to Machine Learning and Classification Problems
The first day will be a general introduction to the logic of machine learning, and how ML is related to more familiar statistical methods. We will discuss the ML emphasis on prediction (and how this is only another side of the explanation coin) and substantive effect over statistically significant ones; the problems and potential solutions to overfitting, and the bias/variance trade-off. We will also talk about the building blocks of machine learning models and algorithms: the role of hyperparameters, tuning, cross-validation, and supervised vs. unsupervised learning. In our first discussions we will refer to (and apply in R) the K-nearest-neighbor classifier and logistic regression to illustrate those conceptual points. The last part of the class will introduce one of the most common ML techniques for supervised problems: support vector machines (SVM’s).
Day Two
Classification Problems with Tree-based Methods
Day 2 will discuss classification problems, some of the most common problems for which ML can be applied in social sciences. It is especially useful, for example, as substitute to human coding when there is some raw data available – think of classifying political parties as left or right, or countries as democratic or not. We will cover decision trees and its extensions, such as random forests and boosting, which are some of the most powerful methods in machine learning today. All these methods can also be used for regression problems, which we focus on day 3.
Day Three
Date: September 24th
Regression problems and Wrap-up
In day 3 we move to regression problems – i.e., continuous outcomes. We look again at decision trees and tree-based models. These methods tend to have excellent predictive performance, and are also able to generate interpretable models. They allow users to identify what are the most important variables for predicting the outcome (substantive importance, not statistical significance), what means they can be applied to any problem in which linear regression could be used. We then cover regularized regression through variable selection models, which are especially useful for cases of many more variables than observations (p > N), common in text analysis. The day ends with a some tips on workflow and how to report your results in academic publications.