Data Model Motivation
Module 2: Notebook 2
3 Introduction
Hi all, and welcome to the second notebook of Module 2! This notebook is intended to be relatively brief, and provide a few motivating examples to show how the data model we learned about in class this week can make our lives much easier when it comes to manipulating and combining the raw data we have available to us in this class.
Our hope is that this notebook can provide a jumping off place as you begin to explore the data model, which should be the primary data source you use for the rest of your team’s analysis. With that in mind, we also encourage you to examine a few of the data model tables yourselves, and to review the Fact Table Explainers available on the resources page of the class website. These will help provide a broader overview of the decisions rules that go into fitting the raw data you began to explore in the EDA notebook into the structure of the data model.
Don’t worry if all of this data model material feels confusing - over the next several notebooks, you will see plenty of examples of how you can use the data model’s tables in practice to answer questions relevant to your projects. In fact, in this notebook, we won’t even dive very far into understanding the structure of the data model or its nuances. Instead, we just want to show a few simple examples that should help clarify why it’s worth going through the trouble of building a data model in the first place, and how we can understand and interpret the results that the data model gives us.
4 Technical setup
This workbook will leverage both SQL and R coding concepts, so we need to set up our environment to connect to the proper database and run R code only accessible in packages external to the basic R environment. Typically, throughout these workbooks, we use SQL for the majority of data exploration and creation of the analytic frame, and then read that analytic frame into R for the descriptive analysis and visualization.
Note: If you would like to view the material to establish your own environment for running the code displayed in this notebook, you can expand the following “Environment Setup” section by clicking on its heading.
5 A motivating example: the Arkansas PIRL data
To motivate our data model, we are going to use the Arkansas PIRL data. The Arkansas PIRL (Participant Individual Record Layout) data has been provided by the Arkansas Department of Workforce Services and is our main source of information about WIOA (Workforce Innovation and Opportunity) program enrollment. The PIRL data are stored on Redshift under ds_ar_dws.pirl
.
As you may have heard in class, this is a very large file - with over 500 variables! So, for exploration of this data, it is especially important to narrow in on the data values that are crucial to our research. Luckily, the data documentation for PIRL is also very robust, and is available either in the ADRF or on the class website.
Let’s take a quick look at the PIRL data like so:
select *
from ds_ar_dws.pirl
limit 5;
%>%
con tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
head(n = 5) %>%
collect()
Just as we might have expected (or feared!) - there are over 500 columns for each of our observation. We can also get the total number of observations like so:
select count(*)
from ds_ar_dws.pirl
%>%
con tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
tally() %>%
collect()
We see that there is approximately 2.5 million records - that is a lot of data.
Where we start to run into problem, though, is when we think about defining a row. Recall from last week’s EDA notebook that we usually want to find the distinct variable or set of variables which uniquely defines each row of the data. How can we do this for the PIRL data?
In class, we learned that, for the PIRL data:
One row: 1 person, 1 participation spell
This seems straightforward enough. To map this onto variables in the dataset, we can look through the data dictionary. As usual, we might guess that social_security_number
(PIRL element number 2700) defines the “1 person”, while a combination of date_of_program_entry_wioa
and date_of_program_exit_wioa
define the spell. Let’s see how many unique combinations of those variables there are:
select count(
distinct social_security_number ||
coalesce(date_of_program_entry_wioa::text, 'null') ||
coalesce(date_of_program_exit_wioa::text, 'null')
)from ds_ar_dws.pirl
%>%
con tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
select(social_security_number, date_of_program_entry_wioa, date_of_program_exit_wioa) %>%
distinct() %>%
tally() %>%
collect()
This is a much smaller number of records! We can see that there on average almost three records per unique individual-start-end combination.
Now, we could continue adding variables to our row definition and get closer and closer to uniquely defining each row. But here we have to ask ourselves - if an individual is reported to be enrolled in the same program, with the same start date, and with the same end date, how much do we really care about the other information available on the PIRL file at this point in our analysis? For example, if an individual has two rows reporting that they began WIOA Adult Program Benefits in Q2 of 2020, how much should we worry if one observation gives their birthday as 11/1/78 and another as 11/3/78? These kinds of discrepancies might be important, but perhaps not so much just for establishing our basic counts of program participants.
6 A solution: using the data model
As we learned in class, one useful way to tackle this issue is through the use of a data model. Not only can a data model make record linkage much easier, it can also handle a lot of the duplication issues like we saw above, through mastering and de-duplication.
We’re going to slowly introduce the main tables of our data model over the next few notebooks. In this notebook, we don’t want to drive to deep into the actual usage of the data model, but instead provide some understanding for how it helps us deal with the kinds of duplication issues we saw in the PIRL data above.
program_participation_start_end
There are two primary data model fact tables that we will use in this class. The first one we will look at is the program_participation_start_end
table. As the name suggests, this table gives individual’s start and end dates for participation in each of our main programs. Let’s take a look at this table here:
select *
from tr_e2e.fact_person_program_start_end
limit 5
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "fact_person_program_start_end")) %>%
head(n = 5) %>%
collect()
We have the following columns:
person_program_start_end_id
: a unique ID for each row of the fact tableperson_key
: A unique person identifier that links to social security numberprogram_key
: A unique identifier for each program (e.g. SNAP = 9, Adult WIOA = 1)entry_year_quarter_key
andexit_year_quarter_key
: Unique identifier for the quarters the individual entered and exited the program
You might be wondering why we have all of these keys, instead of, for example, just the actual social security number, or name of the program. These keys actually, hold all the power of the dimensional model. With them, we are able to store lots and lots of observations in our main fact table very easily (numbers are quick to store), and then link those to further information in our dimension tables that actually tell us what each key corresponds to.
If that still feels confusing, then feel free to refer back to the lecture slides for the data model introduction available on the class website for a little bit more of an explanation of fact and dimension tables. But don’t worry about it too much - you’ll get plenty of practice using and interpreting the data model over the next few notebooks.
Turning back to the problem at hand, we can deduce that this table shows program participation at the person-program-spell level - i.e., there is one row for each unique person/program/start-date combination. This is just what we learned in class, but it’s good to see it here in practice.
Let’s attempt to use this table to answer a simple question: how many individuals started an Adult WIOA benefit spell in 2019 Q4. To do so is actually pretty simple with the data model - we can limit the above table to observations with program_key = 1
(Adult WIOA) and entry_year_quarter_key = 40
(2019-Q4). Remember that to determine the actual ids we wanted we needed to consult the dim_program
and dim_year_quarter
tables - but since this is just an example, we’re not too worried about that.
Let’s use these limitations now, and count the number of rows:
select count(*)
from tr_e2e.fact_person_program_start_end
where program_key = 1 and entry_year_quarter_key = 40
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "fact_person_program_start_end")) %>%
filter(program_key == 1 & entry_year_quarter_key == 40) %>%
tally() %>%
collect()
We can also count the number of unique individuals, instead of the number of rows, but because our observations are already unique at the person/program/start-date level, we should get the same number. Let’s verify that here:
select count(distinct person_key)
from tr_e2e.fact_person_program_start_end
where program_key = 1 and entry_year_quarter_key = 40
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "fact_person_program_start_end")) %>%
filter(program_key == 1 & entry_year_quarter_key == 40) %>%
select(person_key) %>%
distinct() %>%
tally() %>%
collect()
Now, let’s turn back to the PIRL data to get a better understanding of where these numbers come from. As expected, if we just count all observations in the PIRL with a program start falling in Q4 of 2019 we get a much larger number:
select count(*)
from ds_ar_dws.pirl
where
!= 0 and
adult_wioa >= '2019-10-01' and
date_of_program_entry_wioa <= '2020-01-01' date_of_program_entry_wioa
%>%
con tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
filter(
!= '0' &
adult_wioa >= '2019-10-01' &
date_of_program_entry_wioa <= '2020-01-01'
date_of_program_entry_wioa %>%
) tally() %>%
collect()
But, again, here we’re not really concerned among the variation between these duplicated rows. We just want a count of the number of individuals who started on the WIOA Adult program in this quarter. Let’s therefore count only the unique social security numbers from the list above:
select count(distinct social_security_number)
from ds_ar_dws.pirl
where
!= 0 and
adult_wioa >= '2019-10-01' and
date_of_program_entry_wioa <= '2020-01-01' date_of_program_entry_wioa
%>%
con tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
filter(
!= '0' &
adult_wioa >= '2019-10-01' &
date_of_program_entry_wioa <= '2020-01-01'
date_of_program_entry_wioa %>%
) select(social_security_number) %>%
distinct() %>%
tally() %>%
collect()
Now we see that the count aligns with that presented in the data model. Hopefully, this highlights what the data model is intended to do - rather than capturing all of the variation and detail separate observations contain in the underlying data table. Instead, it is helping us prepare in advance a mastered down list of individual participation spells, which can provide a starting point for understanding program enrollment.
If you did want to capture additional information from the underlying PIRL data (for example, the job center at which individuals received their services), then you might need to add additional decision rules to your analysis to handle potential issue cases where, for example, multiple jobs centers are reported for the same person-program-spell combination. Even in this case, however, the data model should still be able to provide a valuable jumping off point for getting the underlying list of program participants that you need to create that mastered data for.
fact_person_quarterly_program_enrollment
The second fact table that we want to look at is the fact_person_quarterly_program_enrollment
table. The purpose of this table is similar to the start and end date table we saw above: we want to reduce the variation and duplication in our different underlying data sources to a simplified table that helps us parse program participation over time. The difference is that the table above had one row per person-program-spell, where as this fact table is at the person-program-quarter level.
To understand this difference, consider an individual who has an Adult WIOA benefit spell beginning in April 2019 and ending in December of the same year. In the start and end date table above, this would represent a single observation, with the entry quarter marked as 2019-Q1 and the exit quarter marked as 2019-Q4. In this second fact table, however, this same spell would translate into three observations - one each for 2019-Q1, 2019-Q2, and 2019-Q3. This allows us to explicitly and easily see all the individuals enrolled in a program in a given quarter. Let’s see what it looks like now:
select *
from tr_e2e.fact_person_quarterly_program_enrollment
order by person_key, program_key, enrollment_quarter_key
limit 5
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment")) %>%
arrange(person_key, program_key, enrollment_quarter_key) %>%
head(n = 5) %>%
collect()
In the output above, we see the person_key
and program_key
variables just as were in the prior fact table. Now, however, instead of having quarter keys for the start and end quarters, we have just one variable: enrollment_quarter_key
. This captures each quarter that the individual described by person_key
was enrolled in the program described by program_key
. The remaining variables provide additional context for if the individual was enrolled the entire quarter.
To see the power of this, let’s first try to identify all the Adult WIOA participants (rather than just entrants) in 2019-Q4. To change things up, let’s start this time by first doing this using the underlying PIRL data. This seems like it should be pretty easy - we just need to select the distinct count of social security numbers reported where:
- The
adult_wioa
field is not equal to zero (Adult WIOA participants) - The date of program entry is before January of 2020 (entered before or during 2019-Q4)
- The date of program exit is after September of 2019 (exited during or after 2019-Q4)
That logic alone is a little complicated, but things actually get even worse. One tricky aspect of working with the PIRL data is that an individual’s data will first show up in the PIRL file when they first enroll in a WIOA program. The problem is that at this point, that individual hasn’t exited the program yet, so their program exit date is going to be missing!
Let’s see an example of this:
select
social_security_number,
adult_wioa,
date_of_program_entry_wioa,
date_of_program_exit_wioafrom ds_ar_dws.pirl
where
= 'REDACTED' social_security_number
%>%
con tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
filter(social_security_number == 'REDACTED') %>%
select(
social_security_number,
adult_wioa,
date_of_program_entry_wioa,
date_of_program_exit_wioa%>%
) collect()
We see that this individual has three records in the PIRL file, all corresponding to an Adult WIOA program enrollment. On the first observation, we see that the date_of_program_exit_wioa
variable is missing - they haven’t yet exited the program! Then, later on, when the data is updated, another record is generated for this individual, showing their true exit date.
The problem here arises because sometimes this missing value is valid - an individual still actually is on the program! For example, if the individual from the example above had only the first row, then we would want to consider them as a valid participant for 2019-Q4, since as far as we would know, they would not have exited the program.
Thus to truly get the count of all participants for 2019-Q4 we need to take all individuals where:
- The
adult_wioa
field is not equal to zero (Adult WIOA participants) - The date of program entry is before January of 2020 (entered before or during 2019-Q4)
- The date of program exit is after September of 2019 (exited during or after 2019-Q4)
- OR the only reported date of program exit for a given start date is missing
We can implement that in code here, but it’s a bit of a mess:
with correctEndDates as (
select distinct
social_security_number,
date_of_program_entry_wioa, max(date_of_program_exit_wioa) as date_of_program_exit_wioa
from ds_ar_dws.pirl
where adult_wioa != 0
group by social_security_number, date_of_program_entry_wioa
)select count(distinct social_security_number)
from correctEndDates
where
<= '2019-12-31' and
date_of_program_entry_wioa >= '2019-10-01' or date_of_program_exit_wioa is NULL) (date_of_program_exit_wioa
%>%
con tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
filter(adult_wioa != '0 ') %>%
group_by(social_security_number, date_of_program_entry_wioa) %>%
summarize(
date_of_program_exit_wioa = max(date_of_program_exit_wioa, na.rm = TRUE)
%>%
) ungroup() %>%
filter(
<= '2019-12-31' &
date_of_program_entry_wioa >= '2019-10-01' | is.na(date_of_program_exit_wioa))
(date_of_program_exit_wioa %>%
) select(social_security_number) %>%
distinct() %>%
tally() %>%
collect()
Now, in contrast, let’s try to find the same count using the data model. Here we just need to count the number of rows where program_key = 1
(Adult WIOA) and enrollment_quarter_key = 40
(2019-Q4).
select count(*)
from tr_e2e.fact_person_quarterly_program_enrollment
where
= 1 and
program_key = 40 enrollment_quarter_key
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment")) %>%
filter(program_key == 1 & enrollment_quarter_key == 40) %>%
tally() %>%
collect()
So, so much simpler.
And so much more…
All of the examples we’ve seen above have used the PIRL data and been focused on Adult WIOA, and that provides a great motivating example for the power of the data model. But as a final piece of our motivation, we should also at least preview how much easier this data model makes our lives when it comes to record linkage as well.
Let’s say we wanted to modify our example above to look at how many individuals participated in either Adult WIOA or SNAP in 2019-Q4. We won’t even begin to present here how you would handle this question without the data model - roughly, you would have to repeat all the calculations we did above on the raw SNAP file (dealing with any weird cases that arose there as well) and then try to combine that with our PIRL calculations to find the individuals that were in either. Using the data model, things are much simpler:
select count(distinct person_key)
from tr_e2e.fact_person_quarterly_program_enrollment
where
= 1 or program_key = 9) and
(program_key = 40 enrollment_quarter_key
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment")) %>%
filter((program_key == 1 | program_key == 9) & enrollment_quarter_key == 40) %>%
select(person_key) %>%
distinct() %>%
tally() %>%
collect()
7 Conclusion
Hopefully, this has helped you understand the value of the data model in helping you handle working with some of the messy underlying data that we have to use in this course. In future notebooks, you’ll see plenty more examples of how to use this data model for your analysis, and in particular, how it simplifies the process of record linkage. In the meantime, we strongly encourage you again to review the Fact Table Explainers available on the Resources page of your class website to begin building up your intuition for the decision rules that the data model uses to combine so many disparate sources of data.