Cross-Sectional Analysis
Module 2: Notebook 3
4 Introduction
Welcome to the third notebook for Module 2 of this course. This notebook will build on the data coverage and structure discussion in the exploratory data analysis notebook to a method rooted in a specific moment in time, a cross-section analysis, while presenting a component of the data model. A cross-section allows us to look at a slice of our data at a specific moment in time so we can evaluate the stock of observations, just at that particular snapshot. Through the remainder of the class notebooks, we will apply each topic to the same focused research topic, all aimed at identifying promising pathways for a specific set of TANF-enrolled individuals before COVID-imposed restrictions were enforced in the state.
Composing a cross-section enables for broad understandings of volume and in this context, TANF recipient compositions. In addition, within this point in time, we can evaluate the prevalence of participation in multiple benefit programs and drill down into specific combinations. Especially from a benefit provider standpoint, it can be useful to understand common characteristics and situations of those receiving TANF benefits, regardless of benefit duration, particularly in evaluating scenarios that might lead to employment and self-sufficiency.
5 Technical setup
Here, we will reintroduce the code required to set up our R environment to connect to the proper database and load certain packages. If you would rather plan on running the SQL code separately, instructions for creating a .sql
script and running the code directly in DBeaver are also available allow.
Code chunks in this notebook will be given in both SQL and R, so follow the setup that applies to your preferred language below.
6 Data Model: Quarterly Enrollment
In the previous notebook, we began introducing the data model for the class, briefly motivating the program participation fact tables using WIOA data. A data model allows us to create the “modeled data” we use for our analysis - the set of physical, easy-to-combine tables that adhere to the same rule-based framework, and does so by taking the underlying “raw” data tables and converting them into events and observations so they can be analyzed easier.
Here, we will focus on the quarterly program participation fact table, fact_person_quarterly_program_enrollment
, located in the tr_e2e
schema. As indicated by its name, this table tracks participation by quarter across the different WIOA, SNAP, and TANF programs. Since it contains participation information at consistent points in time, the table can facilitate the construction of a cross-sectional analytic frame. We will link this fact table, which stores the actual observations (facts) of interest, to three dimension tables:
- Person dimension, storing information on the unique collection of persons available in the data, merging person-level attributes from a variety of sources, resulting in a “golden record” with a higher completeness than is available in individual sources
- Time dimension, storing all possible values for a period of time (day, week, quarter, month, year) across a long period and allows for easy cross-referencing across different periods
- Program Dimension, a reference table mapping a numeric surrogate identifier in the fact table to a more descriptive program name
The following diagram (available by selecting the “ER Diagram” after clicking on the specific table in the Database Navigator of DBeaver) visualizes the relationships between this fact table and these three dimension tables in the data model:
Let’s dive a bit deeper into each of these tables first before we proceed with our cross-sectional analysis.
Person Dimension: DIM Person
The person table, located at tr_e2e.dim_person
, stores information on the characteristics of people, including date of birth, gender, and race/ethnicity. Although it has two separate individual identifiers, person_key
and person_uid
, person_key
is the only one to exist in other tables in the data model. person_key
is a simplified version of person_uid
, which refers to the hashed SSN of the individual and can be used to link to the various raw data tables.
We can take a look at a small subset of the table:
SELECT *
FROM tr_e2e.dim_person
ORDER BY person_key
LIMIT 5
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "dim_person")) %>%
arrange(person_key) %>%
head(5)
NULL
values refer to missing values or information not captured across the various data sources.
Time Dimension: DIM Year Quarter
The time dimension table is located at tr_e2e.dim_year_quarter
and stores information about specific time periods. The surrogate key is the year_quarter_key
, which is a simplified version of a specific year/quarter combination. The following query shows the fields in the table:
SELECT *
FROM tr_e2e.dim_year_quarter
ORDER BY year_quarter_key
LIMIT 5
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
arrange(year_quarter_key) %>%
head(5)
We see here that year_quarter_key
0 is associated with a nonsensical date in the year 9999, which suggests this is a missing data code.
Program Dimension: DIM Program
The program dimension table, tr_e2e.dim_program
, is a reference table mapping the surrogate identifier in the fact table, program_key
, to its longer program name. We can see the complete table with the following query:
SELECT *
FROM tr_e2e.dim_program
ORDER BY program_key
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
arrange(program_key)
Fact: Quarterly Enrollment by Person and Program
As mentioned above, the quarterly enrollment information by person and program is available in tr_e2e.fact_person_quarterly_program_enrollment
. Each row is a unique combination of person, program, and enrollment, meaning that not only may a person in a program have potentially multiple rows in the data (one for each quarter they are enrolled), but they may also have multiple rows in a given quarter if they are enrolled in multiple programs at the same time. This table actually makes it very easy to examine concurrent participation across programs, as we will demonstrate later in this notebook.
SELECT *
FROM tr_e2e.fact_person_quarterly_program_enrollment
LIMIT 5
%>%
con tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment")) %>%
head(5)
Note the presence of the three surrogate columns linking to each of the dimension tables: person_key
, program_key
, and enrollment_quarter_key
. Beyond this information, the table contains indicators denoting the enrollment patterns within the quarter, as most program information is tracked at least at the monthly level.
7 Cross-section
Cross-sectional analyses allow us to examine the characteristics of people at a selected point in time, permitting us to observe patterns and differences across subpopulations. In general, though, they are limited in gaining a deep understanding of experiences over time because they consist of observations at certain points in time, rather than observations consistently over a time period.
In the figure below, we illustrate that a cross-section actually captures multiple cohorts of individuals at a single point in time, compared to a longitudinal sample, which would follow a single cohort, or multiple cohorts over time. If we take repeated cross-sections of data, we can compare the composition and characteristics of participants across those points in time. This is like comparing information from two census years. Every vertical colored line in the diagram represents a cross-sectional analysis. By developing our own cross-section, we are going to explore one of those vertical lines.
A cohort is a defined group based on common time-based characteristics. In the diagram, you can think of each cohort as a separate group of enterers based on the date of their initial benefit reception.
A different analysis method is more appropriate for a longitudinal study, one that we will introduce in the next notebook. Even though we will eventually build out a longitudinal study for recipients starting to receive TANF benefits before COVID-related restrictions were imposed in the state, starting with a cross-sectional analysis will help us better understand the dynamics of the entire set of individuals receiving TANF benefits at this time. Here, we aim to evaluate this stock of claimants in a variety of ways:
- Race/Ethnicity
- Age
- Concurrent participation in other programs at the same time
Do you think a cross-sectional analysis would be helpful for your group project? Either way, jot down some ideas and provide a supporting argument for doing so (or not) in your team’s project template. Identify specific date ranges that may be valuable to consider in developing a cross section if you are arguing in favor of the analysis.
Creating a Cross-section
The specific pre-COVID time frame we will focus on in this notebook is the first quarter of 2019. For context, in the EDA notebook, we graphed the number of observations over time for the person-level file, ds_ar_dhs.tanf_member
. The blue area in the line plot below highlights the observations that will define our cross-section.
Note: We are using
tanf_member
instead oftanf_case
because our population of interest is individuals, not cases.
Our cross-section comes roughly a year and a half after the peak in observations in the person-level TANF file. Looking forward, the number of rows continues to decrease over time, losing roughly half of the total number of observations by the start of the 2022 calendar year.
Using the original data source
We can define and derive our cross-section solely from the original person-level TANF data source. Before we do so, let’s remind ourselves of the basic layout of the file:
SELECT *
FROM ds_ar_dhs.tanf_member
LIMIT 5
%>%
con tbl(in_schema(schema = "ds_ar_dhs",
table = "tanf_member")) %>%
head(5)
Since the file contains observations at the social_security_number
/reporting_month
level, we should have enough information just using this table to define our cross-section by limiting reporting_month
:
SELECT COUNT(*) as num_rows, COUNT(DISTINCT(social_security_number)) as num_ppl
FROM ds_ar_dhs.tanf_member
WHERE reporting_month in ('201901', '201902', '201903')
%>%
con tbl(in_schema(schema = "ds_ar_dhs",
table = "tanf_member")) %>%
filter(reporting_month %in% c("201901", "201902", "201903")) %>%
summarize(
num_rows = n(),
num_ppl = n_distinct(social_security_number)
)
For consistency with future work, where we will link observations across data sources by their hashed social security numbers, we will only consider observations with a validated hashed social security number as part of the cross-section. You may have noticed the column valid_ssn_format
. It contains two values, either Y
or N
. We can add this condition to our pre-existing query:
SELECT COUNT(*) as num_rows, COUNT(DISTINCT(social_security_number)) as num_ppl
FROM ds_ar_dhs.tanf_member
WHERE reporting_month in ('201901', '201902', '201903') AND
= 'Y' valid_ssn_format
%>%
con tbl(in_schema(schema = "ds_ar_dhs",
table = "tanf_member")) %>%
filter(
%in% c("201901", "201902", "201903"),
reporting_month == 'Y'
valid_ssn_format %>%
) summarize(
num_rows = n(),
num_ppl = n_distinct(social_security_number)
)
Note that even though we have attempted to remove the time dimension by limiting our focus to TANF recipients in 2019Q1, since the TANF data is recorded at the monthly grain, and not quarterly, there may be some individuals appearing in the data during multiple months within the quarter. This is indicated by comparing the difference in the number of rows and unique individuals.
We can then read this into R for further analysis:
The code in the SQL query can be run in DBeaver by pasting just the code inside the quotations after
qry <-
<- "
qry SELECT *
FROM ds_ar_dhs.tanf_member
WHERE reporting_month in ('201901', '201902', '201903') AND
valid_ssn_format = 'Y'
"
<- dbGetQuery(con, qry)
cross_section_ods
head(cross_section_ods)
<- con %>%
cross_section_ods tbl(in_schema(schema = "ds_ar_dhs",
table = "tanf_member")) %>%
filter(
%in% c("201901", "201902", "201903"),
reporting_month == 'Y'
valid_ssn_format %>%
) # bring into R
collect()
head(cross_section_ods)
Using the fact table
Since the fact_person_quarterly_program_enrollment
table already separates out information by person, program, and quarter, our query for identifying TANF recipients in 2019Q1 only requires two inner joins to the dim_program
and dim_year_quarter
dimension tables:
SELECT COUNT(*) as num_rows, COUNT(DISTINCT(fact.person_key)) as num_ppl
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
--INNER JOIN WORKS BECAUSE DIM TABLES CONTAIN ALL POSSIBLE VALUES IN FACT
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq ON
= dyq.year_quarter_key)
(fact.enrollment_quarter_key --WHERE IS OKAY TO USE BECAUSE AN INNER JOIN
--BE CAREFUL USING 'WHERE' WITH A LEFT OR OUTER JOIN
WHERE dp.program_name = 'Temporary Assistance for Needy Families (TANF)' AND
= '2019' AND dyq.calendar_quarter = '1' dyq.calendar_year
<- con %>%
fact tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment"))
<- con %>%
dp tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name == "Temporary Assistance for Needy Families (TANF)")
<- con %>%
dyq tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
filter(calendar_year == '2019', calendar_quarter == '1')
%>%
fact inner_join(dp, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
summarize(
num_rows = n(),
num_ppl = n_distinct(person_key)
)
In fact, if you compare the number of unique people retrieved in this query to that using the original data source, you will see we have the same number of people! Additionally, since the fact table stores observations quarterly, and not monthly, we only have one row per person. If you refer to the “Fact Table Explainers” guide available on the Class Resources page of the website, you will see that we adhered to the same rules in working with the tanf_member
file that were used to populate this fact table. Let’s take a look at the resulting structure of the merged table:
SELECT fact.*, dp.program_name, dyq.quarter_start_date, dyq.quarter_end_date
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq ON
= dyq.year_quarter_key)
(fact.enrollment_quarter_key WHERE dp.program_name = 'Temporary Assistance for Needy Families (TANF)' AND
= '2019' AND dyq.calendar_quarter = '1'
dyq.calendar_year LIMIT 5
<- con %>%
fact tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment"))
<- con %>%
dp tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name == "Temporary Assistance for Needy Families (TANF)")
<- con %>%
dyq tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
filter(calendar_year == '2019', calendar_quarter == '1') %>%
# don't need these variables
select(-c(calendar_year, calendar_quarter))
%>%
fact inner_join(dp, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
head(5)
With a join to the person dimension table, little bit of date manipulation, and logic for aggregating the different race/ethnicity columns into one larger variable, we can pull in a data frame to facilitate our next goal, evaluating the race/ethnicity and age compositions of the cross-section:
Note: We are calculating
age
relative to the start of the quarter.
<- "
qry SELECT fact.*
, dp.program_name
, dyq.quarter_start_date
, dyq.quarter_end_date
, dper.date_of_birth
--DATEDIFF FINDS DIFFERENCE BETWEEN TWO DATES, AND REDSHIFT ROUNDS DOWN
, DATEDIFF(quarter, dper.date_of_birth, dyq.quarter_start_date)/4 as age
, CASE
WHEN dper.ethnicity_american_indian_or_alaska_native = 'Yes' THEN 'AIAN'
WHEN dper.ethnicity_black_or_african_american = 'Yes' THEN 'AfAm'
WHEN dper.ethnicity_hispanic_or_latino= 'Yes' THEN 'Hisp'
WHEN dper.ethnicity_asian = 'Yes' THEN 'Asian'
WHEN dper.ethnicity_white = 'Yes' THEN 'Wht'
WHEN dper.ethnicity_other ='Yes' THEN 'Other'
ELSE 'Missing'
END AS eth_recode
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq
ON (fact.enrollment_quarter_key = dyq.year_quarter_key)
JOIN tr_e2e.dim_person dper ON (dper.person_key = fact.person_key)
WHERE dp.program_name = 'Temporary Assistance for Needy Families (TANF)' AND
dyq.calendar_year = '2019' AND dyq.calendar_quarter = '1'
ORDER BY dper.person_key
"
<- dbGetQuery(con, qry) %>%
cross_section_fact # redshift using SQL code converts to character automatically, so need to recast as numeric
mutate(
age = as.numeric(age)
)
head(cross_section_fact)
<- con %>%
fact tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment"))
<- con %>%
dp tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name == "Temporary Assistance for Needy Families (TANF)")
<- con %>%
dyq tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
filter(calendar_year == '2019', calendar_quarter == '1') %>%
# don't need these variables
select(-c(calendar_year, calendar_quarter))
<- con %>%
dper tbl(in_schema(schema = "tr_e2e",
table = "dim_person")) %>%
mutate(
eth_recode = case_when(
== 'Yes' ~ 'AIAN',
ethnicity_american_indian_or_alaska_native == 'Yes' ~ 'AfAm',
ethnicity_black_or_african_american == 'Yes' ~ 'Hisp',
ethnicity_hispanic_or_latino == 'Yes' ~ 'Asian',
ethnicity_asian == 'Yes' ~ 'Wht',
ethnicity_white == 'Yes' ~ 'Other',
ethnicity_other TRUE ~ 'Missing'
)%>%
) # keep only relevant columns
select(person_key, date_of_birth, eth_recode)
<- fact %>%
cross_section_fact inner_join(dp, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
inner_join(dper, by = "person_key") %>%
arrange(person_key) %>%
# take floor to round down to integer like redshift using SQL code
mutate(age = floor(datediff(sql("QUARTER"), date_of_birth, quarter_start_date)/4)) %>%
collect()
head(cross_section_fact)
While we could proceed with our cross_section_ods
file, using the data model allows us to leverage the dim_person
table containing our golden records of time invariant variables, as opposed to working with the separate TANF microdata, which may contain some discrepancies across months for the same person. We read the results of the resulting query into R since moving forward, the methods we will apply require manipulation and visualization either not easily executable or not accessible in SQL relative to R.
Analysis
With our cross-section identified, we can proceed with our analysis. Keep in mind that the variables we explore in the following subsections represent only a subset of possible comparison groups.
Race/Ethnicity
We can start by looking at our combined race/ethnicity variable, eth_recode
.
Note:
CASE WHEN
in SQL (case_when()
in R) will stop evaluating the row after the firstWHEN
clause to return TRUE. In the code, for example, ifethnicity_asian
andethnicity_white
were both'yes'
, the individual will be assigned aneth_recode
value ofAsian
.
%>%
cross_section_fact group_by(eth_recode) %>%
summarise(npersons = n_distinct( person_key )) %>%
# ungroup so we can take the percentage with denominator as all in next step
ungroup() %>%
mutate(pct = 100 * (npersons / sum( npersons )))
Instead of visualizing this distribution using a common bar plot, we can opt for the more adventurous lollipop plot, which uses the geom_point
and geom_segment
geometries.
%>%
cross_section_fact group_by(eth_recode) %>%
summarise(npersons = n_distinct( person_key )) %>%
ungroup() %>%
mutate(pct = 100 * (npersons / sum( npersons ))) %>%
ggplot(aes(y =pct,
x=eth_recode))+
geom_point() +
# line should go from 0 to the percentage for each group
geom_segment(aes(y = 0, yend = pct, x = eth_recode, xend = eth_recode)) +
# x- and y-axes are flipped
coord_flip() +
labs(title = "Percentage of TANF Recipients by Race/Ethnicity",
subtitle = "Cross-section of participants in Q1 2019")
We can see that the vast majority of TANF recipients in the first quarter of 2019 were categorized as either African American or White.
Age
We can treat age as a continuous variable, visualizing the distribution using a histogram:
%>%
cross_section_fact ggplot(aes(x=age)) +
geom_histogram()
While this distribution plot is somewhat useful, we can get a more exact understanding of the distribution by grouping age into separate 5-year intervals:
%>%
cross_section_fact # chose range greater than overall distribution to ensure all are captured
mutate(age_group = cut(age, breaks = seq(0, 110, 5), include.lowest = T)) %>%
group_by(age_group) %>%
summarise(npersons = n_distinct( person_key )) %>%
ungroup() %>%
mutate(pct = 100 * (npersons / sum( npersons )))
From this we see the highest percentage of recipients in this quarter is between 25 and 30 years of age. Since we plan on eventually analyzing workforce outcomes, let’s get a sense of the youngest and oldest individuals in the cross-section:
%>%
cross_section_fact summarize(
youngest = min(age),
oldest = max(age)
)
Just from an age standpoint, the cross-section seems to be satisfactory from a workforce-eligibility standpoint on the lower end. However, moving forward, we might want to add an additional condition to narrow in on only those deemed workforce-eligible.
Concurrent Participation
Above, we developed a cross-section just focusing on TANF recipients. But what if we are interested in looking at concurrent participation across multiple programs within our cross-section of TANF recipients?
We can do so leveraging the fact_person_quarterly_program_enrollment
table. In the following query, we restrict all potential observations those in the 2019Q1 TANF cross-section, adding additional information if they were enrolled in SNAP or Adult Education at the same time. We could expand this to include all other programs, but for the ease of interpretation, we will just focus on these programs in this example. This is where we really get to witness the power of the data model. Imagine if we had to do all of this linkage across the original SNAP and PIRL data sources!
There are two
program_name
values pertaining to Adult Education in the fact table. We will only focus onprogram_name = Adult Education (JOINT)
because the Joint PIRL table is more complete than the Adult Education information tracked in the regular PIRL file. The Adult Education information from the PIRL file is accessible in the data model by settingprogram_name = Adult Education (WIOA)
.
<- "
qry WITH tanf AS (
--CROSS SECTION OF 2019Q1 TANF PARTICIPANTS
SELECT dper.person_key
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq
ON (fact.enrollment_quarter_key = dyq.year_quarter_key)
JOIN tr_e2e.dim_person dper ON (dper.person_key = fact.person_key)
WHERE dp.program_name = 'Temporary Assistance for Needy Families (TANF)' AND
dyq.calendar_year = '2019' AND dyq.calendar_quarter = '1'
),
adult_snap AS (
--CROSS SECTION OF 2019Q1 SNAP AND ADULT EDUCATION PARTICIPANTS
SELECT dper.person_key,
fact.program_key
FROM tr_e2e.fact_person_quarterly_program_enrollment fact
JOIN tr_e2e.dim_program dp ON (dp.program_key = fact.program_key)
JOIN tr_e2e.dim_year_quarter dyq
ON (fact.enrollment_quarter_key = dyq.year_quarter_key)
JOIN tr_e2e.dim_person dper ON (dper.person_key = fact.person_key)
WHERE dp.program_name IN ('Supplemental Nutrition Assistance Program (SNAP)', 'Adult Education (JOINT)') AND
dyq.calendar_year = '2019' AND dyq.calendar_quarter = '1'
)
SELECT tanf.*, adult_snap.program_key
FROM tanf
--LEFT JOIN KEEPS EVERYONE FROM TANF. IF THEY DON'T JOIN THEY HAVE NA PROGRAM_KEY
LEFT JOIN adult_snap
ON (tanf.person_key = adult_snap.person_key)
ORDER BY tanf.person_key, program_key
"
<- dbGetQuery(con, qry)
co_xsection
head(co_xsection)
<- con %>%
fact tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment"))
<- con %>%
dp tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name == "Temporary Assistance for Needy Families (TANF)")
<- con %>%
dyq tbl(in_schema(schema = "tr_e2e",
table = "dim_year_quarter")) %>%
filter(calendar_year == '2019', calendar_quarter == '1') %>%
# don't need these variables
select(-c(calendar_year, calendar_quarter))
<- con %>%
dper tbl(in_schema(schema = "tr_e2e",
table = "dim_person"))
# cross-section of 2019Q1 tanf participants
<- fact %>%
tanf inner_join(dp, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
inner_join(dper, by = "person_key") %>%
select(person_key)
<- con %>%
dp_as tbl(in_schema(schema = "tr_e2e",
table = "dim_program")) %>%
filter(program_name %in% c("Supplemental Nutrition Assistance Program (SNAP)", "Adult Education (JOINT)"))
# cross-section of 2019Q1 snap and adult education participants
<- fact %>%
adult_snap inner_join(dp_as, by = "program_key") %>%
inner_join(dyq, by = c("enrollment_quarter_key" = "year_quarter_key")) %>%
inner_join(dper, by = "person_key") %>%
select(person_key, program_key)
<- tanf %>%
co_xsection # left_join keeps everyone from tanf. if they don't join they have NA program_key
left_join(adult_snap, by = "person_key") %>%
arrange(person_key, program_key) %>%
collect()
head(co_xsection)
We can confirm that our join worked as intended by comparing the number of unique person_key
values in co_xsection
to that of cross_section_fact
.
%>%
co_xsection summarize(
n_rows = n(),
n_people = n_distinct(person_key)
)
We can see that we have the same number of people as before, which aligns with the logic of our left join. Additionally, we see that we do not have exactly one row per person - and that is okay! We may have multiple rows per person_key
if the individual is concurrently participating in all three programs in 2019Q1. In fact, the structure of co_xsection
may not be as clear as we would like, but it does adhere to the rules of the joining mechanism. Let’s take a look at the possible values of program_key
:
%>%
co_xsection distinct(program_key)
Given that TANF corresponds to program_key = 10
, it may be surprising to see that there are not any observations with this value. But again, this is due to the nature of the join - every individual in co_xsection
is enrolled in TANF! This means that if they have a NULL program_key
value, the individual only received TANF benefits in 2019Q1. Otherwise, if they have a non-null program_key
, then in addition to receiving TANF benefits, they also participated in a separate benefit program.
With some manipulation, we can aggregate this table to categorize the nature of concurrent participation, or lack thereof:
# identify those in all three programs
# will have two rows in data frame
<- co_xsection %>%
all_three group_by(person_key) %>%
summarize(
n = n()
%>%
) filter(n > 1) %>%
ungroup()
# of all either in just TANF or TANF + either SNAP or Adult Ed
<- co_xsection %>%
not_three # anti_join ignores all observations in all_three
anti_join(all_three, by = 'person_key') %>%
group_by(program_key) %>%
summarize(
n = n_distinct(person_key)
%>%
) ungroup() %>%
mutate(
program_name = case_when(
== 9 ~ "TANF + SNAP",
program_key == 12 ~ "TANF + Adult ED Joint",
program_key is.na(program_key) ~ "TANF",
TRUE ~ "missing"
)%>%
) select(-program_key)
%>%
not_three add_row(
# each person is a row in all_three
n = nrow(all_three),
program_name = "TANF + SNAP + Adult ED Joint"
%>%
) mutate(perc = 100*n/sum(n)) %>%
arrange(desc(perc)) %>%
# rearrange columns to bring program name to front
select(program_name, everything())
We can visualize this output using a treemap. Unfortunately, a treemap is not a default geometry available in ggplot2
(loaded with library(tidyverse)
), but there are thankfully other libraries available in R to create them. In fact, besides the treemap
package we will be using below, there is a separate treemapify
package that integrates with the overall ggplot2
framework. We will be using treemap
because the default image is a tad cleaner. That being said, the treemapify
approach may be a bit more flexible.
#load treemap package
library(treemap)
# if using treemapify
#library(treemapify)
%>%
not_three add_row(
# each person is a row in all_three
n = nrow(all_three),
program_name = "TANF + SNAP + Adult ED Joint"
%>%
) treemap(index = "program_name", vSize = "n")
Interesting! Between the output table and treemap, we can see that the majority of individuals receiving TANF benefits in 2019Q1 were also participating in SNAP at the same time as well.
Now that we have demonstrated a use case of the quarterly program participation fact table, think about how your team may use this table in your project. Are there particular programs that you are interested in? How could you modify the code in this notebook to suit the time period, demographic group, and set of participants in which you are interested?
8 Next Steps: Applying the notebook to your project
This workbook applies the concepts of a cross-sectional analysis to the Arkansas data and covers some of the considerations and potential of such a investigation. Through the motivation of building a cross-sectional analytic frame, we leveraged a relatively new facet of the data model, a fact table covering quarterly program participation. We also showed how you can apply the same decision rules populating the data model to the original data source. Beyond limiting a cross-section to a singular program, we demonstrated the full power of the data model structure in using this fact table to identify concurrent participation in related programs. If you want to apply a different set of decision rules to identify a cross-section that are not consistent with the data model, you can still use the model by limiting observations to just those satisfying your team’s constraints.
In upcoming notebooks, we will expand the scope of the information covered and focus on carrying out a longitudinal analysis. Even if your team’s ultimate plan expands beyond it, a cross-sectional approach may be useful. If your team deems it appropriate to develop a cross-section or use the quarterly program participation fact table, you are encouraged to repurpose as much code as possible in developing your initial snapshot and subsequent descriptive analysis. As you work through your project, remember to add your thoughts and findings to your team’s project template in the ADRF.
9 Citations
Arkansas Applied Data Analytics Training Program - Notebook 3A: Cross-Sectional Analysis (citation to be updated)
Wisconsin Applied Data Analytics Training Program - Notebook 1B: Cross-Sectional Analysis (citation to be updated)