Spring 2020 - 58675 - PA 397C - Advanced Empirical Methods for Policy Analysis

Data Visualization, Statistics and Econometrics for Policy Anaysis, using Python

This class will introduce the application of data visualization, statistical, econometric, and machine learning methods to real world data, using the Python software platform and its data science-oriented data visualization and analysis packages. The learning goal is for each student to finish the class with a working knowledge of basic econometric concepts and data science software tools, and their application to real world data, to better inform real world policymaking. Are there any concrete, practical advantages to this approach to a master’s level statistics and econometrics class? Go to indeed.com (an online job recruitment site) and enter the keywords “statistics” and “python” in the “what” box (leave the “where” box blank for the moment). If you actually page through these listings, you will be served additional sponsored ads from would-be employers seeking to get you to apply for their jobs on the spot. One benefit of this class is that you will be able to honestly write in both words-- ‘statistics’ and ‘Python’- when asked by a prospective employer to describe what data analytical tools you were introduced to in grad school. The Python software tools we will use are components in the most widely used, non-proprietary, open data science software platform, and readily allow access to excellent visualization, statistical, and econometric analysis tools capable of handling even the largest datasets. The same software platform can also be integrated with, and run, R and Stata statistical analysis code. This class is intended to be doubly valuable to a student interested in public policy. First, the class will introduce you to cutting edge computer software tools that can be applied to real data for practical policy purposes (and hopefully both give you some advantages in post-graduation job markets, and facilitate future acquisition of even more advanced skills over the rest of your careers). Second, the class is designed to motivate learning statistics and econometric concepts by showing that they can be simply and practically applied to real world data, and to give you some first-hand experience in doing so. Much of the learning will be structured as completion of data analysis exercises. In addition to these exercises, every student will undertake two small group analysis projects, with in-class presentation, discussion, and critique. Every student will also present and submit an individual final empirical data analysis project, in the form of a Jupyter (interactive Python) notebook. The class assumes you have previously taken an introductory statistics course as a prerequisite. If you have previously had even greater exposure to statistics, you will be encouraged to assist those of your peers who have not. Lectures will be based on interactive Python notebooks (aka Jupyter notebooks). Students will follow along class lectures using open source data science software installed on a personal laptop computer (Windows, Mac, or Linux). All students must read all assigned reading, since this will be assumed as background to all the Jupyter notebook content on statistical, econometric, and machine learning concepts we go through in class. There are no computer programming prerequisites, but you will need to bring to class a personal computer with the Anaconda distribution of Python installed (more specific instructions are given below). In addition to other course requirements, this class will require completion of approximately 12 hours of introductory online courses covering basic skills in the Python analysis and visualization software we will be using. The online course modules assume no software experience or other prerequisites. The class itself will be structured as a group walkthrough of a Jupyter notebook containing the key concepts and examples that serve as the foundation for that week’s material. Time will be reserved for in-class laptop “data analysis lab” exercises, to provide you with real time feedback on conceptual or programming questions you may have. The Anaconda distribution of Python should work on a computer running either Windows, or the Mac OS, or Linux. Because access to electrical outlets in the classroom may be limited, your personal computer should probably have at least a 3-hour battery life, and be fully charged before class. We will spend the first two weeks of class in a Python “data science boot camp,” then use these Python software tools as the main language for learning statistics and econometrics, and data science, and applying it to real data over the balance of the semester. Be warned. This class is likely to be a substantial amount of work if this is the first time you have ever tried to do simple programming on a computer. In addition, working through and understanding statistical and econometric concepts can be demanding. But there should be a real payoff in useful things you know about, and know how to do, by the end of semester. Every student will be asked to participate in three project presentations to the class. These exercises will be oral presentations of analyses and solutions to a real world data analysis problem. Two of these will be group projects, and one an individual project. In addition, each student will be asked to complete problem sets, and to submit their final empirical project. The two group presentations will count for 30% of the final grade, the problem sets for 20%, the individual project presentation for 20%, and the final project submission for 30%.