Spring 2018 - 61005 - PA 397C – Advanced Empirical Methods for Policy Analysis

Statistical Analysis/Learning

COURSE DESCRIPTION Large datasets are increasingly becoming available across many sectors such as healthcare, energy, and online markets. This course focuses on methods that allow “learning” from such datasets to uncover underlying relationships and patterns in the data, with a focus on predictive performance of various models that can be built to represent the underlying function generating the data. The course starts with a review of basic statistical concepts and linear regression. But the course will focus mostly on classification and clustering based on non-regression techniques such as tree-based approaches, support vector machines, and unsupervised learning. In the problem sets and tutorials we will examine applications in: healthcare; energy; transportation; online markets; and patent systems. This course is intended for first and second year Masters students and Ph.D. students. Topics to be covered: Linear Regression, Classification, Resampling Methods, Linear Model Selection and Regularization, Tree-Based Methods, Support Vector Machines, Unsupervised Learning. In covering the material from the assigned textbook (see below), this course will emphasize both on formulaic and conceptual understanding of the discussed methods. As necessary, the instructor will draw on material from outside the textbook for driving conceptual clarity. PREREQUISITES Basic grasp of linear regression would be helpful. However, all relevant concepts will be reviewed in class. Problem sets will include applied problems from the textbook that will require use of the popular open-source statistical software package R. Thus, though not required, some prior experience with programming or use of statistical packages would be helpful. The instructor and TA(s) will conduct optional lab sessions in R to provide the necessary background and toolsets in R that will be necessary in solving the problem sets. Timing for the lab sessions will be set after the first class meeting. REQUIREMENTS 1. Required readings: Students are expected to complete the required readings (when assigned) each week prior to the class meeting for the unit and to contribute to the class discussion. 2. Problem Sets: There will be 8-10 problem sets (PS). The problems will be selected from the textbook. Problem sets will be due within one week of when they are assigned. 3. Exams: Mid-term and final exams. REQUIRED READINGS We will use material from the following textbook: An Introduction to Statistical Learning, by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. A free pdf of the text and all associated datasets are available. Additional required readings: Relevant material from the textbook for each unit may be augmented with 2-3 additional readings, typically highly accessible and relevant journal articles selected by the instructor. These readings will be posted on Canvas at least one week prior to the unit when these readings will be discussed. This will happen only for a few select units.