Working with NAICS Crosswalk
Module 2: Supplemental Notebook
Introduction
This supplemental notebook focuses on linking the NAICS-employer crosswalk with the data model, particularly the fact table. A similar procedure can be followed for matching the crosswalk with the UI wage records table.
We will start by loading necessary packages not readily available in the base R setup.
As a reminder, every time you create a new R file, you should copy and run the following code snippet.
options(scipen = 999) # avoid scientific notation
library(RJDBC)
library(tidyverse)
And then set up a connection to the Redshift database:
=Sys.getenv("DBUSER")
dbusr=Sys.getenv("DBPASSWD")
dbpswd
<- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;
url loginToRp=urn:amazon:webservices:govcloud;ssl=true;
AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;
ssl_insecure=true;
plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"
<- JDBC(
driver "com.amazon.redshift.jdbc42.Driver",
classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
identifier.quote="`"
)
<- dbConnect(driver, url, dbusr, dbpswd) con
For this code to work, you need to have an .Renviron
file in your user folder (i.e. U:\\John.Doe.P00002
) containing your username and password.
Exploration
Before linking the crosswalk with additional employer information available in other tables, it is helpful to identify any potential discrepancies that may affect the quality of the linkage. Here, we will investigate the columns we plan to use in our join to ensure consistency between the sources.
Employer information is available in all three data sources - PROMIS, UI Wage Records, and of course, the NAICS crosswalk. We’ll start with the data we’re already using.
Fact table
The fact table contains two columns with employer information:
last_employer
: Formerlyui_number
in the PROMIS data, tracks the individual’s most recent employer while claiming UI benefits.primary_employer_id
: Corresponds toui_account
in the wage records table, tracks the individual’s primary employer in a given quarter.
Let’s take a look at how these variables are currently stored within the fact table:
# filter for non-null observations
<- "
qry select last_employer, primary_employer_id
from tr_wi_2023.wi_fact_weekly_observation
where last_employer is not null and primary_employer_id is not null
limit 5
"
dbGetQuery(con, qry)
When developing the fact table, we decided to keep the variable types consistent with their original source - here, that results in the employer UI account numbers stored as integers and characters within the PROMIS and wage record data, respectively. In comparing these two columns, there are processing issues in coercing primary_employer_id
to an integer, so we recommend instead transforming last_employer
into a character variable.
We can also find the possible lengths of the UI account number variables:
# last employer (from PROMIS)
<- "
qry select length(last_employer), count(*)
from tr_wi_2023.wi_fact_weekly_observation
where last_employer is not null
group by length(last_employer)
order by length(last_employer)
"
dbGetQuery(con, qry)
# current employer (from UI wages)
<- "
qry select length(primary_employer_id), count(*)
from tr_wi_2023.wi_fact_weekly_observation
where primary_employer_id is not null
group by length(primary_employer_id)
order by length(primary_employer_id)
"
dbGetQuery(con, qry)
We can see that the potential lengths of both variables span from 1-6.
Crosswalk
The crosswalk table has been ingested into the ds_wi_dwd
schema as employer_naics_xwalk
. As we did in the exploratory data analysis, or EDA, process, we can start by looking at a few rows from the table.
<- "
qry select *
from ds_wi_dwd.employer_naics_xwalk
limit 5
"
dbGetQuery(con, qry)
Thankfully, the table is quite simple, just containing the two columns required to link industry-level information to employers existing in other tables. Note the leading zeroes here - if you recall, the other variables tracking employer UI account numbers both do not have leading zeroes. Before looking at account number lengths, let’s verify that each row contains a unique UI account number:
<- "
qry select count(*) as total_rows, count(distinct(ui_account_root_number)) as total_emps
from ds_wi_dwd.employer_naics_xwalk
"
dbGetQuery(con, qry)
There appear to be instances of an employer UI account number appearing multiple times in the data. We can check to see if there are complete duplicate rows or potentially more than one NAICS code per employer with the following query:
# find instances of more than one unique naics per ui account
# ignore "None" and null values
<- "
qry select ui_account_root_number, count(distinct(employer_naics_code))
from ds_wi_dwd.employer_naics_xwalk
where employer_naics_code != 'None' and employer_naics_code is not null
group by ui_account_root_number
having count(distinct(employer_naics_code)) > 1
limit 5
"
dbGetQuery(con, qry)
When using the crosswalk table, we need to be careful of cases where an employer may have multiple NAICS codes, as these values cannot be prioritized with the available data. We will introduce an approach in the following section.
Lastly, we can compare lengths by converting the variable with leading zeroes to an integer using cast()
in SQL.
# ignoring ui account numbers with processing issues in WHERE clause
<- "
qry with int_conv as (
--CONVERT UI ACCOUNT TO INTEGER, IGNORE PROBLEMATIC VALUES IN WHERE CLAUSE (CONTAIN COMMAS)
select cast(ui_account_root_number as int) as num
from ds_wi_dwd.employer_naics_xwalk enx
where ui_account_root_number not like '%,%'
)
--FIND DISTRIBUTION OF UI ACCOUNT NUMBER LENGTHS IGNORING LEADING ZEROES
select length(num), count(*)
from int_conv
group by length(num)
order by length(num)
"
dbGetQuery(con, qry)
Note that there are some UI account numbers in the crosswalk with more than 6 digits, excluding non-leading zeroes. Although it would be theoretically possible to simply link on the last six digits of the UI account numbers, there may be different employers with the same last six digits, resulting in inaccurate joins. Therefore, the UI account numbers with more than six digits will not join to any of the employer-level information in the other files. That being said, the employer information in the PROMIS datA already contains NAICS codes, so we’re really just focused on joining to the employers in the wage records.
Merging Crosswalk with our Analytic Frame
Given the information we’ve learned about the various columns storing employer UI account numbers in the different tables, the recommended process for joining the information in the crosswalk table to the fact (and UI wage records too) is as follows:
- Get rid of leading zeroes in the crosswalk table by converting
ui_account_root_number
to an integer. - Convert this new variable (can be done in the same step) back to a character to align with
primary_employer_wages
. - Left join the updated crosswalk to the fact table, ensuring the preservation of all rows in the fact table.
We will demonstrate this on our analytic frame below, which can be accessed with the following SQL code:
<- "
qry select f.*
from tr_wi_2023.nb_cohort c
join tr_wi_2023.wi_mdim_person p on (c.ssn = p.ssn)
join tr_wi_2023.wi_fact_weekly_observation f on (p.person_id = f.person_id)
limit 5
"
dbGetQuery(con, qry)
Although you can read this table, along with the crosswalk one, separately into R and then join, for speed and compiling efficiency reasons, we recommend performing the join in SQL.
The most general join between the analytic frame and the crosswalk table can be achieved with the following code:
<- "
qry with xwalk_conv as (
--CONVERT UI ACCOUNT TO INTEGER, IGNORE PROBLEMATIC VALUES IN WHERE
select cast(cast(ui_account_root_number as int) as varchar) as ui_account, employer_naics_code
from ds_wi_dwd.employer_naics_xwalk enx
where ui_account_root_number not like '%,%'
)
select f.*, xwalk_conv.employer_naics_code as primary_employer_naics
from tr_wi_2023.nb_cohort c
join tr_wi_2023.wi_mdim_person p on (c.ssn = p.ssn)
join tr_wi_2023.wi_fact_weekly_observation f on (p.person_id = f.person_id)
left join xwalk_conv on (f.primary_employer_id = xwalk_conv.ui_account)
limit 5
"
dbGetQuery(con, qry)
This code will provide lots of duplication in cases where the primary employer has multiple NAICS codes, since the quarterly wage information is on the quarterly grain, whereas the UI benefit information is stored weekly. For our more specific example, built purely for pedagogical purposes, we keep the information we read into R at the quarterly grain, only pursuing employment observations in the same quarter as a corresponding UI benefit claim. This will allow us to easily display an approach for working with multiple NAICS codes corresponding to the same employer.
<- "
qry with xwalk_conv as (
--CONVERT UI ACCOUNT TO INTEGER, IGNORE PROBLEMATIC VALUES IN WHERE
select cast(cast(ui_account_root_number as int) as varchar) as ui_account, employer_naics_code
from ds_wi_dwd.employer_naics_xwalk enx
where ui_account_root_number not like '%,%'
)
--ONLY TAKE DISTINCT QUARTERLY ENTRIES FOR RELEVANT COLUMNS
select distinct f.person_id, f.last_employer_naics, f.calendar_year, f.calendar_quarter,
f.primary_employer_id, xwalk_conv.employer_naics_code as primary_employer_naics
from tr_wi_2023.nb_cohort c
join tr_wi_2023.wi_mdim_person p on (c.ssn = p.ssn)
join tr_wi_2023.wi_fact_weekly_observation f on (p.person_id = f.person_id)
left join xwalk_conv on (f.primary_employer_id = xwalk_conv.ui_account)
where f.last_employer_naics is not null and f.primary_employer_id is not null
"
<- dbGetQuery(con, qry) cohort_subset
If, for example, we wanted to identify individuals who remained in the same industry pre- and post-benefit reception, our data frame - not the process for creating the data frame - may look similar to cohort_subset
.
Reminder: This data frame does not contain post-benefit reception, but a similar dataframe could be created for tracking pre- and post-benefit reception employment.
head(cohort_subset)
With all of the information now existing in the data frame, to account for employers matching to more than one NAICS code in the industry crosswalk, we can develop an indicator that returns TRUE
when last_employer_naics
is the same as any of the primary_employer_naics
values for a given individual/quarter combination. To avoid over-counting the number of entries that satisfy the condition, we can take the distinct combination of person/year/quarter and our indicator, as shown below:
%>%
cohort_subset group_by(person_id, calendar_year, calendar_quarter) %>%
mutate(
naics_test = any(primary_employer_naics == last_employer_naics),
# account for any employers with NULL naics values in xwalk
naics_test = ifelse(is.na(naics_test), FALSE, naics_test)
%>%
) ungroup() %>%
distinct(person_id, calendar_year, calendar_quarter, naics_test) %>%
head()
There are certainly other approaches to working with multiple industry values per employer - for example, a more strict interpretation may employ the complement to any()
, all()
.