Chapter 13 Workbooks

Brian Kim, Christoph Kern, Jonathan Scott Morgan, Clayton Hunter, Avishek Kumar

This final chapter provides an overview of the Python workbooks that accompany this book. The workbooks combine text explanation and code you can run, implemented in Jupyter notebooks¹⁰⁷, to explain techniques and approaches selected from each chapter and to provide thorough implementation details, enabling students and interested practitioners to quickly get up to speed on and start using the technologies covered in the book. We hope you have a lot of fun with them.

13.1 Introduction

We provide accompanying Juptyer workbooks for most chapters in this book. The workbooks provide a thorough overview of the work needed to implement the selected technologies. They combine explanation, basic exercises, and substantial additional Python and SQL code to provide a conceptual understanding of each technology, give insight into how key parts of the process are implemented through exercises, and then lay out an end-to-end pattern for implementing each in your own work. The workbooks are implemented using Jupyter notebooks, interactive documents that mix formatted text and Python code samples that can be edited and run in real time in a Jupyter notebook server, allowing you to run and explore the code for each technology as you read about it.

The workbooks are centered around two main substantive examples. The first example is used in the Databases, Dataset Exploration and Visualization, Machine Learning, Bias and Fairness, and Errors and Inference workbooks, which focus on corrections data. The second example includes the APIs, Record Linkage, Text Analysis, and Network Analysis workbooks, which primarily use patent data from PatentsView¹⁰⁸ and grant data from Federal RePORTER¹⁰⁹ to investigate innovation and funding.

The Jupyter notebooks are designed to be run online in a cloud environment using Binder¹¹⁰ and don’t need additional software installed locally. Individual workbooks can be opened by following the corresponding Binder link, and everything can be done in your browser. The full set of workbooks is available at (https://workbooks.coleridgeinitiative.org), and additional workbooks may be added over time and made available in this repository.

The workbooks can also be run locally. In that case, you will need to install Python on your system, along with some additional Python packages needed by the workbooks. The easiest way to get this all working is to install the free Anaconda Python distribution¹¹¹. Anaconda includes a Jupyter server and precompiled versions of many packages used in the workbooks. It includes multiple tools for installing and updating both Python and installed packages. It is separate from any OS-level version of Python, and is easy to completely uninstall.

13.2 Notebooks

Below is a list of the workbooks, along with a short summary of the content that each covers.

13.2.1 Databases

The Databases notebook builds the foundation of using SQL to query data. Much of the later notebooks will involve using these tools. This workbook also introduces you to the main data source that is used in the online workbooks, the North Carolina Department of Corrections Data (https://webapps.doc.state.nc.us/opi/downloads.do?method=view). In this notebook, you will:

Build basic queries using SQL,
Understand and perform various joins.

13.2.2 Dataset Exploration and Visualization

The Dataset Exploration and Visualization notebook further explores the North Carolina Department of Corrections data, demonstrating how to work with missing values and date variables and join tables by using SQL in Python. Though some of the SQL from the Databases notebook is revisited here, the focus is on practicing Python code and using Python for data analysis. The workbook also explains how to pull data from a database into a dataframe in Python and continues by exploring the imported data using the numpy and pandas packages, as well as matplotlib and seaborn for visualizations. In this workbook, you will learn how to:

Connect to and query a database through Python,
Explore aggregate statistics in Python,
Create basic visualizations in Python.

13.2.3 APIs

The APIs notebook introduces you to the use of Internet-based web service APIs for retrieving data from online data stores. This notebook walks through the process of retrieving data about patents from the PatentsView API from the United States Patent and Trademark Office. The data consist of information about patents, inventors, companies, and geographic locations since 1976. In this workbook, you will learn how to:

Construct a URL query,
Get a response from the URL,
Retrieve the data in JSON form.

13.2.4 Record Linkage

In the Record Linkage workbook, you will use Python to implement the basic concepts behind record linkage using data from PatentsView and Federal RePORTER. This workbook will cover how to pre-process data for linkage before demonstrating multiple methods of record linkage, including probabilistic record linkage, in which different types of string comparators are used to compare multiple pieces of information between two records to produce a score that indicates how likely it is that the records are data about the same underlying entity. In this workbook, you will learn how to:

Prepare data for record linkage,
Use and evaluate the results of common computational string comparison algorithms including Levenshtein distance, Levenshtein–Damerau distance, and Jaro–Winkler distance,
Understand the Fellegi–Sunter probabilistic record linkage method, with step-by-step implementation guide.

13.2.5 Text Analysis

In the Text Analysis notebook, you will use the data that you pulled from the PatentsView API in the API notebook to find topics from patent abstracts. This will involve going through every step of the process, from extracting the data to cleaning and preparing to using topic modeling algorithms. In this workbook, you will learn how to:

Clean and prepare text data,
Apply Latent Dirichlet Allocation for topic modeling,
Improve and iterate models to focus in on identified topics.

13.2.6 Networks

In the Networks workbook you will create network data where the nodes are researchers who have been awarded grants, and ties are created between each researcher on a given grant. You will use Python to read the grant data and translate them into network data, calculate node- and graph-level network statistics and create network visualizations. In this workbook, you will learn how to:

Calculate node- and graph-level network statistics,
Create graph visualizations.

13.2.7 Machine Learning – Creating Labels

The Machine Learning Creating Labels workbook is the first of a three-part Machine Learning workbook sequence, starting with how to create an outcome variable (label) for a machine learning task by using SQL in Python. It uses the North Carolina Department of Corrections Data to build an outcome that measures recidivism, i.e. whether a former inmate returns to jail in a given period of time. It also shows how to define a Python function to automate programming tasks. In this workbook, you will learn how to:

Define and compute a prediction target in the machine learning framework,
Use SQL with data that has a temporal structure (multiple records per observation).

13.2.8 Machine Learning – Creating Features

The Machine Learning Creating Features workbook prepares predictors (features) for the machine learning task that has been introduced in the Machine Learning Creating Labels workbook. It shows how to use SQL in Python for generating features that are expected to predict recidivism, such as the number of times someone has been admitted to prison prior to a given date. In this workbook, you will learn how to:

Generate features with SQL for a given prediction problem,
Automate SQL tasks by defining Python functions.

13.2.9 Machine Learning – Model Training and Evaluation

The Machine Learning Model Training and Evaluation workbook uses the label and features that were created in the previous workbooks to construct a training and test set for model building and evaluation. It walks through examples on how to train machine learning models using scikit-learn in Python and how to evaluate prediction performance for classification tasks. In addition, it demonstrates how to construct and compare many different machine learning models in Python. In this workbook, you will learn how to:

Pre-process data to provide valid inputs for machine learning models,
Properly divide data with a temporal structure into training and test sets,
Train and evaluate machine learning models for classification using Python.

13.2.10 Bias and Fairness

The Bias and Fairness workbook demonstrates an example of using the bias and fairness audit toolkit Aequitas in Python. This workbook uses an example from criminal justice and demonstrates how Aequitas can be used to detect and evaluate biases of a machine learning system in relation to multiple (protected) subgroups. You will learn how to:

Calculate confusion matrices for subgroups and visualize performance metrics by groups,
Measure disparities by comparing, e.g., false positive rates between groups,
Assess model fairness based on various disparity metrics.

13.2.11 Errors and Inference

The Errors and Inference workbook walks through how one might think critically about issues that might arise in their analysis. In this notebook, you will evaluate the machine learning models from previous notebooks and learn about ways to improve the data to use as much information as possible to make conclusions. Specifically, you will learn how to:

Perform sensitivity analysis with machine learning models,
Use imputation to fill in missing values.

13.2.12 Additional Workbooks

An additional set of workbooks that accompanied the first edition of this book is available at (https://github.com/BigDataSocialScience/Big-Data-Workbooks). This repository provides two different types of workbooks, each needing a different Python setup to run. The first type of workbooks is intended to be downloaded and run locally by individual users. The second type is designed to be hosted, assigned, worked on, and graded on a single server, using jupyterhub (https://github.com/jupyter/jupyterhub) to host and run the notebooks and nbgrader (https://github.com/jupyter/nbgrader) to assign, collect, and grade.

13.3 Resources

We noted in Section Introduction: Resources the importance of Python, SQL, and Git/GitHub for the social scientist who intends to work with large data. See that section for pointers to useful online resources, and also see https://github.com/BigDataSocialScience, where we have collected many useful web links, including the following.

For more on getting started with Anaconda, see the Anaconda documentation.¹¹²

For more information on IPython and the Jupyter notebook server, see the IPython site (http://ipython.org/), IPython documentation (http://ipython.readthedocs.org/), Jupyter Project site (http://jupyter.org/), and Jupyter Project documentation (http://jupyter.readthedocs.org/).

For more information on using jupyterhub and nbgrader to host, distribute, and grade workbooks using a central server, see the jupyterhub GitHub repository (https://github.com/jupyter/jupyterhub/), jupyterhub documentation (http://jupyterhub.readthedocs.org/), nbgrader GitHub repository (https://github.com/jupyter/nbgrader/), and nbgrader documentation (http://nbgrader.readthedocs.org/).