Data Model Motivation

Module 2: Notebook 2

Author

Roy McKenzie, Ben Feder

3 Introduction

Hi all, and welcome to the second notebook of Module 2! This notebook is intended to be relatively brief, and provide a few motivating examples to show how the data model we learned about in class this week can make our lives much easier when it comes to manipulating and combining the raw data we have available to us in this class.

Our hope is that this notebook can provide a jumping off place as you begin to explore the data model, which should be the primary data source you use for the rest of your team’s analysis. With that in mind, we also encourage you to examine a few of the data model tables yourselves, and to review the Fact Table Explainers available on the resources page of the class website. These will help provide a broader overview of the decisions rules that go into fitting the raw data you began to explore in the EDA notebook into the structure of the data model.

Don’t worry if all of this data model material feels confusing - over the next several notebooks, you will see plenty of examples of how you can use the data model’s tables in practice to answer questions relevant to your projects. In fact, in this notebook, we won’t even dive very far into understanding the structure of the data model or its nuances. Instead, we just want to show a few simple examples that should help clarify why it’s worth going through the trouble of building a data model in the first place, and how we can understand and interpret the results that the data model gives us.

4 Technical setup

This workbook will leverage both SQL and R coding concepts, so we need to set up our environment to connect to the proper database and run R code only accessible in packages external to the basic R environment. Typically, throughout these workbooks, we use SQL for the majority of data exploration and creation of the analytic frame, and then read that analytic frame into R for the descriptive analysis and visualization.

Note: If you would like to view the material to establish your own environment for running the code displayed in this notebook, you can expand the following “Environment Setup” section by clicking on its heading.

For working with the database directly using SQL, the easiest way is to still copy the commands from this notebook into a script in DBeaver. As a reminder, the steps to do so are as follows:

To create a new .sql script:

  1. Open DBeaver, located on the ADRF Desktop. The icon looks like this:

  2. Establish your connection to the database by logging in. To do this, double-click Redshift11_projects on the left hand side, and it will ask for your username and password. Your username is adrf\ followed by the name of your U: drive folder - for example, adrf\John.Doe.T00112. Your password is the same as the second password you used to log in to the ADRF - if you forgot it, you adjust it in the ADRF management portal!

    After you successfully establish your connection, you should see a green check next to the database name, like so:

  3. In the top menu bar, click SQL Editor then New SQL Script:

  4. To test if your connection is working, try pasting the following chunk of code into your script:

    SELECT * 
    FROM tr_e2e.dim_person
    LIMIT 5

    Then run it by clicking the run button next to the script, or by pressing CTRL + Enter.

  5. You should then be able to see the query output in the box below the code.

The easiest way to work with the Redshift data from within an R Script is using a combination of a JBDC connection and the dbplyr package. The necessary setup for this approach is described below.

Load libraries

Just like we did in the Foundations Module, in running SQL and R code together through R, we need to load the RJDBC package. In addition, we will load the tidyverse suite of packages, as they will help us implement some of our fundamental data operations while maintaining a consistent syntax. Lastly, to enable an option discussed in the coming sections, we will load a new package for working with databases in R, dbplyr.

Every time you create a new R file, you should copy and run the following code snippet. You can easily copy the entire snippet by hovering over the cell and clicking the clipboard icon on the top-right edge.

options(scipen = 999) # avoid scientific notation
library(RJDBC)
library(tidyverse)
library(dbplyr)
library(zoo) # time/date manipulations
New package dbplyr

In the examples that follow, we continue using our new package for interacting with databases without writing SQL code is also shown. The dbplyr package interfaces with a database using standard dplyr and tidyverse workflows, and complementary code is shown for all SQL query examples. Just like the SQL code, this should be run and executed from R using the connection we established in the collapsed “Environment Setup” section above.

Warning

If you receive an error message saying there is no package called ..., then please first run:

install.packages("PACKAGENAME")

where you replace PACKAGENAME with the name of the package that is missing.

Establish Database Connection

To load data from the Redshift server into R, we need to first set up a connection to the database. The following set of commands accomplish this:

dbusr=Sys.getenv("DBUSER")
dbpswd=Sys.getenv("DBPASSWD")

url <- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;loginToRp=urn:amazon:webservices:govcloud;ssl=true;AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;ssl_insecure=true;plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"

driver <- JDBC(
  "com.amazon.redshift.jdbc42.Driver",
  classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
  identifier.quote="`"
)

con <- dbConnect(driver, url, dbusr, dbpswd)

As a reminder, don’t worry too much about the details of this connection - you can simply copy and paste this code each time you want to connect your R script to the Redshift database.

New .Renviron

For this code to work, you need to create a new .Renviron file in your user folder (i.e. U:\\John.Doe.P00002) that contains the following:

DBUSER='adrf\John.Doe.P00002'
DBPASSWD='xxxxxxxxxxxx'

where John.Doe.P00002 is replaced with your username and xxxxxxxxxx is replaced with your password (both still in quotes!) The setup of this code is nearly identical to that required in the Foundations Module workspace - however, DBUSER should now end with .T00113 instead of .T00112.

A detailed video from the Foundations Module, “Introduction to RStudio,” demonstrating how to create an .Renviron file is available on the Resources page on class website in the subsection “Quick Links.”

5 A motivating example: the Arkansas PIRL data

To motivate our data model, we are going to use the Arkansas PIRL data. The Arkansas PIRL (Participant Individual Record Layout) data has been provided by the Arkansas Department of Workforce Services and is our main source of information about WIOA (Workforce Innovation and Opportunity) program enrollment. The PIRL data are stored on Redshift under ds_ar_dws.pirl.

As you may have heard in class, this is a very large file - with over 500 variables! So, for exploration of this data, it is especially important to narrow in on the data values that are crucial to our research. Luckily, the data documentation for PIRL is also very robust, and is available either in the ADRF or on the class website.

Let’s take a quick look at the PIRL data like so:

select * 
from ds_ar_dws.pirl 
limit 5;
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "pirl")) %>%
  head(n = 5) %>%
  collect() 

Just as we might have expected (or feared!) - there are over 500 columns for each of our observation. We can also get the total number of observations like so:

select count(*) 
from ds_ar_dws.pirl 
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "pirl")) %>%
  tally() %>%
  collect()

We see that there is approximately 2.5 million records - that is a lot of data.

Where we start to run into problem, though, is when we think about defining a row. Recall from last week’s EDA notebook that we usually want to find the distinct variable or set of variables which uniquely defines each row of the data. How can we do this for the PIRL data?

In class, we learned that, for the PIRL data:

One row: 1 person, 1 participation spell

This seems straightforward enough. To map this onto variables in the dataset, we can look through the data dictionary. As usual, we might guess that social_security_number (PIRL element number 2700) defines the “1 person”, while a combination of date_of_program_entry_wioa and date_of_program_exit_wioa define the spell. Let’s see how many unique combinations of those variables there are:

select count(
  distinct social_security_number || 
  coalesce(date_of_program_entry_wioa::text, 'null') ||
  coalesce(date_of_program_exit_wioa::text, 'null')
)
from ds_ar_dws.pirl 
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "pirl")) %>%
  select(social_security_number, date_of_program_entry_wioa, date_of_program_exit_wioa) %>%
  distinct() %>%
  tally() %>%
  collect()

This is a much smaller number of records! We can see that there on average almost three records per unique individual-start-end combination.

Now, we could continue adding variables to our row definition and get closer and closer to uniquely defining each row. But here we have to ask ourselves - if an individual is reported to be enrolled in the same program, with the same start date, and with the same end date, how much do we really care about the other information available on the PIRL file at this point in our analysis? For example, if an individual has two rows reporting that they began WIOA Adult Program Benefits in Q2 of 2020, how much should we worry if one observation gives their birthday as 11/1/78 and another as 11/3/78? These kinds of discrepancies might be important, but perhaps not so much just for establishing our basic counts of program participants.

6 A solution: using the data model

As we learned in class, one useful way to tackle this issue is through the use of a data model. Not only can a data model make record linkage much easier, it can also handle a lot of the duplication issues like we saw above, through mastering and de-duplication.

We’re going to slowly introduce the main tables of our data model over the next few notebooks. In this notebook, we don’t want to drive to deep into the actual usage of the data model, but instead provide some understanding for how it helps us deal with the kinds of duplication issues we saw in the PIRL data above.

program_participation_start_end

There are two primary data model fact tables that we will use in this class. The first one we will look at is the program_participation_start_end table. As the name suggests, this table gives individual’s start and end dates for participation in each of our main programs. Let’s take a look at this table here:

select * 
from tr_e2e.fact_person_program_start_end
limit 5
con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "fact_person_program_start_end")) %>%
  head(n = 5) %>%
  collect()

We have the following columns:

  • person_program_start_end_id: a unique ID for each row of the fact table
  • person_key: A unique person identifier that links to social security number
  • program_key: A unique identifier for each program (e.g. SNAP = 9, Adult WIOA = 1)
  • entry_year_quarter_key and exit_year_quarter_key: Unique identifier for the quarters the individual entered and exited the program

You might be wondering why we have all of these keys, instead of, for example, just the actual social security number, or name of the program. These keys actually, hold all the power of the dimensional model. With them, we are able to store lots and lots of observations in our main fact table very easily (numbers are quick to store), and then link those to further information in our dimension tables that actually tell us what each key corresponds to.

If that still feels confusing, then feel free to refer back to the lecture slides for the data model introduction available on the class website for a little bit more of an explanation of fact and dimension tables. But don’t worry about it too much - you’ll get plenty of practice using and interpreting the data model over the next few notebooks.

Turning back to the problem at hand, we can deduce that this table shows program participation at the person-program-spell level - i.e., there is one row for each unique person/program/start-date combination. This is just what we learned in class, but it’s good to see it here in practice.

Let’s attempt to use this table to answer a simple question: how many individuals started an Adult WIOA benefit spell in 2019 Q4. To do so is actually pretty simple with the data model - we can limit the above table to observations with program_key = 1 (Adult WIOA) and entry_year_quarter_key = 40 (2019-Q4). Remember that to determine the actual ids we wanted we needed to consult the dim_program and dim_year_quarter tables - but since this is just an example, we’re not too worried about that.

Let’s use these limitations now, and count the number of rows:

select count(*) 
from tr_e2e.fact_person_program_start_end
where program_key = 1 and entry_year_quarter_key = 40
con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "fact_person_program_start_end")) %>%
  filter(program_key == 1 & entry_year_quarter_key == 40) %>%
  tally() %>%
  collect()

We can also count the number of unique individuals, instead of the number of rows, but because our observations are already unique at the person/program/start-date level, we should get the same number. Let’s verify that here:

select count(distinct person_key) 
from tr_e2e.fact_person_program_start_end
where program_key = 1 and entry_year_quarter_key = 40
con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "fact_person_program_start_end")) %>%
  filter(program_key == 1 & entry_year_quarter_key == 40) %>%
  select(person_key) %>% 
  distinct() %>%
  tally() %>%
  collect()

Now, let’s turn back to the PIRL data to get a better understanding of where these numbers come from. As expected, if we just count all observations in the PIRL with a program start falling in Q4 of 2019 we get a much larger number:

select count(*) 
from ds_ar_dws.pirl 
where 
  adult_wioa != 0 and 
  date_of_program_entry_wioa >= '2019-10-01' and 
  date_of_program_entry_wioa <= '2020-01-01'
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "pirl")) %>%
  filter(
    adult_wioa != '0' &
    date_of_program_entry_wioa >= '2019-10-01' &
    date_of_program_entry_wioa <= '2020-01-01'
  ) %>%
  tally() %>%
  collect()

But, again, here we’re not really concerned among the variation between these duplicated rows. We just want a count of the number of individuals who started on the WIOA Adult program in this quarter. Let’s therefore count only the unique social security numbers from the list above:

select count(distinct social_security_number) 
from ds_ar_dws.pirl 
where 
  adult_wioa != 0 and 
  date_of_program_entry_wioa >= '2019-10-01' and 
  date_of_program_entry_wioa <= '2020-01-01'
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "pirl")) %>%
  filter(
    adult_wioa != '0' &
    date_of_program_entry_wioa >= '2019-10-01' &
    date_of_program_entry_wioa <= '2020-01-01'
  ) %>%
  select(social_security_number) %>%
  distinct() %>%
  tally() %>%
  collect()

Now we see that the count aligns with that presented in the data model. Hopefully, this highlights what the data model is intended to do - rather than capturing all of the variation and detail separate observations contain in the underlying data table. Instead, it is helping us prepare in advance a mastered down list of individual participation spells, which can provide a starting point for understanding program enrollment.

If you did want to capture additional information from the underlying PIRL data (for example, the job center at which individuals received their services), then you might need to add additional decision rules to your analysis to handle potential issue cases where, for example, multiple jobs centers are reported for the same person-program-spell combination. Even in this case, however, the data model should still be able to provide a valuable jumping off point for getting the underlying list of program participants that you need to create that mastered data for.

fact_person_quarterly_program_enrollment

The second fact table that we want to look at is the fact_person_quarterly_program_enrollment table. The purpose of this table is similar to the start and end date table we saw above: we want to reduce the variation and duplication in our different underlying data sources to a simplified table that helps us parse program participation over time. The difference is that the table above had one row per person-program-spell, where as this fact table is at the person-program-quarter level.

To understand this difference, consider an individual who has an Adult WIOA benefit spell beginning in April 2019 and ending in December of the same year. In the start and end date table above, this would represent a single observation, with the entry quarter marked as 2019-Q1 and the exit quarter marked as 2019-Q4. In this second fact table, however, this same spell would translate into three observations - one each for 2019-Q1, 2019-Q2, and 2019-Q3. This allows us to explicitly and easily see all the individuals enrolled in a program in a given quarter. Let’s see what it looks like now:

select * 
from tr_e2e.fact_person_quarterly_program_enrollment
order by person_key, program_key, enrollment_quarter_key
limit 5
con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "fact_person_quarterly_program_enrollment")) %>%
  arrange(person_key, program_key, enrollment_quarter_key) %>%
  head(n = 5) %>%
  collect()

In the output above, we see the person_key and program_key variables just as were in the prior fact table. Now, however, instead of having quarter keys for the start and end quarters, we have just one variable: enrollment_quarter_key. This captures each quarter that the individual described by person_key was enrolled in the program described by program_key. The remaining variables provide additional context for if the individual was enrolled the entire quarter.

To see the power of this, let’s first try to identify all the Adult WIOA participants (rather than just entrants) in 2019-Q4. To change things up, let’s start this time by first doing this using the underlying PIRL data. This seems like it should be pretty easy - we just need to select the distinct count of social security numbers reported where:

  • The adult_wioa field is not equal to zero (Adult WIOA participants)
  • The date of program entry is before January of 2020 (entered before or during 2019-Q4)
  • The date of program exit is after September of 2019 (exited during or after 2019-Q4)

That logic alone is a little complicated, but things actually get even worse. One tricky aspect of working with the PIRL data is that an individual’s data will first show up in the PIRL file when they first enroll in a WIOA program. The problem is that at this point, that individual hasn’t exited the program yet, so their program exit date is going to be missing!

Let’s see an example of this:

select
  social_security_number,
  adult_wioa,
  date_of_program_entry_wioa,
  date_of_program_exit_wioa
from ds_ar_dws.pirl 
where 
  social_security_number = 'REDACTED'
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "pirl")) %>%
  filter(social_security_number == 'REDACTED') %>%
  select(
    social_security_number, 
    adult_wioa, 
    date_of_program_entry_wioa, 
    date_of_program_exit_wioa
  ) %>%
  collect()

We see that this individual has three records in the PIRL file, all corresponding to an Adult WIOA program enrollment. On the first observation, we see that the date_of_program_exit_wioa variable is missing - they haven’t yet exited the program! Then, later on, when the data is updated, another record is generated for this individual, showing their true exit date.

The problem here arises because sometimes this missing value is valid - an individual still actually is on the program! For example, if the individual from the example above had only the first row, then we would want to consider them as a valid participant for 2019-Q4, since as far as we would know, they would not have exited the program.

Thus to truly get the count of all participants for 2019-Q4 we need to take all individuals where:

  • The adult_wioa field is not equal to zero (Adult WIOA participants)
  • The date of program entry is before January of 2020 (entered before or during 2019-Q4)
  • The date of program exit is after September of 2019 (exited during or after 2019-Q4)
    • OR the only reported date of program exit for a given start date is missing

We can implement that in code here, but it’s a bit of a mess:

with correctEndDates as (
  select distinct 
    social_security_number, 
    date_of_program_entry_wioa, 
    max(date_of_program_exit_wioa) as date_of_program_exit_wioa
  from ds_ar_dws.pirl 
  where adult_wioa != 0 
  group by social_security_number,  date_of_program_entry_wioa 
)
select count(distinct social_security_number)
from correctEndDates
where 
  date_of_program_entry_wioa <= '2019-12-31' and 
  (date_of_program_exit_wioa >= '2019-10-01' or date_of_program_exit_wioa is NULL)
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "pirl")) %>% 
  filter(adult_wioa != '0 ') %>%
  group_by(social_security_number, date_of_program_entry_wioa) %>%
  summarize(
    date_of_program_exit_wioa = max(date_of_program_exit_wioa, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  filter(
    date_of_program_entry_wioa <= '2019-12-31' &
    (date_of_program_exit_wioa >= '2019-10-01' | is.na(date_of_program_exit_wioa))
  ) %>%
  select(social_security_number) %>%
  distinct() %>%
  tally() %>%
  collect()

Now, in contrast, let’s try to find the same count using the data model. Here we just need to count the number of rows where program_key = 1 (Adult WIOA) and enrollment_quarter_key = 40 (2019-Q4).

select count(*)
from tr_e2e.fact_person_quarterly_program_enrollment
where 
  program_key = 1 and
  enrollment_quarter_key = 40 
con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "fact_person_quarterly_program_enrollment")) %>%
  filter(program_key == 1 & enrollment_quarter_key == 40) %>%
  tally() %>%
  collect()

So, so much simpler.

And so much more…

All of the examples we’ve seen above have used the PIRL data and been focused on Adult WIOA, and that provides a great motivating example for the power of the data model. But as a final piece of our motivation, we should also at least preview how much easier this data model makes our lives when it comes to record linkage as well.

Let’s say we wanted to modify our example above to look at how many individuals participated in either Adult WIOA or SNAP in 2019-Q4. We won’t even begin to present here how you would handle this question without the data model - roughly, you would have to repeat all the calculations we did above on the raw SNAP file (dealing with any weird cases that arose there as well) and then try to combine that with our PIRL calculations to find the individuals that were in either. Using the data model, things are much simpler:

select count(distinct person_key)
from tr_e2e.fact_person_quarterly_program_enrollment
where 
  (program_key = 1 or program_key = 9)  and
  enrollment_quarter_key = 40 
con %>% 
  tbl(in_schema(schema = "tr_e2e",
                table = "fact_person_quarterly_program_enrollment")) %>%
  filter((program_key == 1 | program_key == 9) & enrollment_quarter_key == 40) %>%
  select(person_key) %>% 
  distinct() %>%
  tally() %>%
  collect()

7 Conclusion

Hopefully, this has helped you understand the value of the data model in helping you handle working with some of the messy underlying data that we have to use in this course. In future notebooks, you’ll see plenty more examples of how to use this data model for your analysis, and in particular, how it simplifies the process of record linkage. In the meantime, we strongly encourage you again to review the Fact Table Explainers available on the Resources page of your class website to begin building up your intuition for the decision rules that the data model uses to combine so many disparate sources of data.