Foundations Module: Unit 1
1 Introduction
Welcome to Unit 1 of the Foundations module! This workbook walks you through the process of data discovery and provides SQL code examples for exploratory tasks to help you understand the coverage, availability, and key features of the data.
As a reminder, this unit is the first of four in our Foundations of Data and Technical Literacy module. During this module, we are going to work through each step of carrying out a miniature research project to help you gain a familiarity with the full suite of technical and data literacy tools you will need for the rest of the class.
In doing so, we will recreate some key components of the Workforce Overview Report for Kentucky Regions (WORKR) built and maintained by the Kentucky Center for Statistics. The dashboard relies on the publicly-available Local Area Unemployment Statistics (LAUS) data from the U.S. Bureau of Labor Statistics (BLS), which contains information on employment and unemployment rates calculated by BLS. This dashboard presents and compares unemployment rates for each of Kentucky’s four Workforce Planning Regions. Since the LAUS data is available nationally, we can adapt aspects of the WORKR product to the state of your choosing.
If you have not done so already, please watch the Unit 1 introduction videos available on the class website under the Foundations Module page. Otherwise, let’s begin!
Throughout these notebooks, we will store important notes in callout blocks like this.
Optional technical checkpoints will be stored in blocks like this.
2 Technical setup
In this section, we will walk you through the technical steps to configure a coding file and connect to the data. If you are not interested in doing this coding work yourself, no worries - you are welcome to skip ahead to the next section, Data Discovery.
For the first half of this module, we will be using SQL to access and manipulate our data. SQL is a language designed for a specific purpose: to interact with relational databases. We have defined some key terms below:
- Database: A database is a structured collection of data. Think of an Excel sheet with multiple tabs.
- Relational Database Management System (RDBMS): In an RDBMS, data records are stored in tables. Each table has a predefined set of columns, the pieces of information captured for each record in a table, and rows, which store the values pertaining to each column in the table.
- Query: A query is a request for data from the database. In this module, our queries will be written in a specific type of SQL called “PostgreSQL.” We can store our queries in .sql files.
2.1 Create a new SQL script
A SQL script contains a set of SQL commands that you can store and easily rerun on the same database. You can think of it as a SQL recipe that you can return to and continue modifying throughout your analysis.
To create a new .sql script:
Open DBeaver, located on the ADRF Desktop. The icon looks like this:
Establish your connection to the database by logging in. To do this, double-click
Redshift11_projectson the left hand side, and it will ask for your username and password. Your username isadrf\followed by the name of your U: drive folder - for example,adrf\John.Doe.T00112. Your password is the same as the second password you used to log in to the ADRF - the one sent to you by Coleridge’s tech team.After you successfully establish your connection, you should see a green check next to the database name, like so:
In the top menu bar, click SQL Editor then New SQL Script:
To test if your connection is working, try pasting the following chunk of code into your script:
SELECT * FROM ds_public_1.laus LIMIT 5Then run it by clicking the run button next to the script, or by pressing CTRL + Enter:
You should then be able to see the query output in the box below the code.
If you have any issues completing this setup, please reach out to any member of the Coleridge staff for assistance.
2.2 Writing SQL code
We will present and explain several examples of SQL code in this notebook, introducing some foundational SQL commands and their syntax. We will include the results of these commands in this notebook, but if you are looking to practice your SQL skills, we recommend either copying and pasting or rewriting these code chunks in a separate SQL script, and then running them to verify that you obtain the same output displayed in the file.
Additionally, if you’re interested, we encourage you to modify these commands to further explore the data in which you might be interested.
3 Data Discovery
At its core, data discovery consists of answering two questions:
- What does the data look like?
- What is in our data?
We will devote the rest of the notebook to answering these two questions, through looking at the data, reading the data dictionary, and evaluating broad trends. We will start this process with a description of the data from the BLS website (https://www.bls.gov/lau/):
“The Local Area Unemployment Statistics (LAUS) program produces monthly and annual employment, unemployment, and labor force data for Census regions and divisions, States, counties, metropolitan areas, and many cities, by place of residence.”
As we work through the data discovery phase, you should also begin to understand how the data can be used to populate the outputs presented in the WORKR dashboard.
Broadly, before we start looking at the data, let’s cover the difference between categorical and continuous variables:
Categorical data will take on a limited set of values that represent different categories. In the LAUS data,
area_textandareatype_textrepresent categorical variables.Continuous data are numeric and can take on any value in a given range. In the LAUS data,
unemployment_rateandlabor_forceare examples of continuous variables.
A general rule of thumb is if you can perform mathematical operations on a variable and it makes sense to do so, then it is usually continuous. For instance, we might calculate the average unemployment rate using the unemployment_rate variable. But, if we had a variable for ZIP code, then we might not derive much value from finding the average ZIP code.
3.1 What does our data look like?
To rephrase this section’s question in other words, we want to make sure we understand the structure of our dataset. The easiest way to do so is, of course, just to look at the data. We can preview ten rows of the data like so:
The SQL code to preview the data and an explanation is available below. If you are interested in improving your coding skills, you should work to understand these code snippets and replicate their output in your own SQL file. Otherwise, you can skim through the SQL explanations and focus on the rest of the text and output.
SELECT * -- the "*" means get all variables
FROM ds_public_1.laus -- this is the name of the LAUS dataset
LIMIT 10 -- we only want 10 observations
We see that there are 15 variables in the LAUS dataset. The data dictionary (Right-click on link to open) on the P: drive provides more information about these variables.
The data dictionary descriptions, while helpful, do not provide the entire context for these variables. We also need to have a clear definition of what each observation - or row - in the LAUS data represents - that is, what person/place/thing/point in time is the data in that row describing. Only when we know this will be able to carry out an informed analysis of our data.
To find out what each row represents, we need to understand which variable or combination of variables uniquely define each row of the dataset. That is, we want the smallest set of variables such that if you kept only that set of variables in the dataset, and counted all unique rows, it would return the same number of rows as the total dataset.
In that case, we would know that those variables (and only those variables) are essential in defining that observation, or data point. If we try any smaller set of variables, then we will have multiple rows that look identical - not helping us uniquely identify each observation.
To know what we are aiming for, we can find the total number of rows in the LAUS data. The aggregation fundamental data operation, COUNT, can help us with this task. Remember, this operation (and aggregation operations in general) combine all the observations in our full dataset into a single number using a formula or operation. The easiest of these operations is COUNT, which aggregates our full dataset into a single number representing the total number of rows. Let’s apply this to the LAUS data:
SELECT COUNT(*) -- now we add the COUNT aggregation operator
FROM ds_public_1.laus count
1 1461462
That’s a lot of data - almost a million and a half observations. It’s a good thing we didn’t try to look at all of that at once!
With the total row count available, let’s think about the three categories of variables that might appear in our dataset to uniquely define a row.
3.1.1 Unit of observation
The first category to consider is variables that describe the unit of observation. The unit of observation refers to the type of entity or object about which data is collected. This could be a person, a organization, a state, or any other entity that is the focus of data collection.
Based on our initial data description provided by the BLS website, we know that our unit of observation is a geographical area - the unit for which employment, unemployment, and labor force data is reported. In addition to the dataset description, we can leverage the data dictionary to see that the unit of observation is likely stored in just one column - area_text, which contains the “name of the geographic data to which the data refer.”
To completely verify, we can look at the set of values area_text takes on, or in other words, the distinct values of the area_text variable. This is a unique case of our filtering data operation which keeps only unique combinations of the variable or variables we see in our dataset. It is very useful for exploring the types of values a categorical variable assumes in a dataset:
SELECT DISTINCT area_text -- DISTINCT returns only unique values of area_text
FROM ds_public_1.laus
That’s a lot of different areas, with representation at the state, county, parish, and census area levels!
Importantly, we need to note that the unit of observation alone does not necessarily define a row for our dataset. While we might guess that each row of the LAUS data defines a single geographical area, and therefore that, area_text alone defines a row, we need to test this to know for sure. We can do so by counting to aggregate only the distinct values of area_text to to find the total number of unique values area_text takes on:
SELECT COUNT(DISTINCT area_text) -- returns the number of unique values of area_text
FROM ds_public_1.laus
We see that there are far fewer unique area_text values than total rows in the table. This indicates that area_text alone does not define a row of our data - some geographical areas must appear in the data more than once. Why might there be multiple observations for the same unit of observation? The most common reason is the one we will cover next: time.
3.1.2 Period of observation
As we just saw, a given dataset might have multiple rows for each unit of observation. There are many reasons why this could be, but the most common one is that data about each unit of observation is observed at multiple points in time.
This introduces a separate category of variables in our dataset: those that describe the period of observation, or the frequency and time interval over which data is collected about each unit of observation. This might also be called the time dimension of the data.
Before we go on, note that not every dataset will have variables representing the period of observation. For example, consider a dataset containing a list of all U.S. states - each state should only appear in the dataset once. But, if any specific unit is observed at more than one point in time, then your data does have a time dimension, and you need to figure out what that dimension looks like!
Three variables in the LAUS data seem to specifically denote time elements - year, period_name, and seasonal_text. If you recall our first query looking at 10 rows of the table, you might recall that all rows had a year value of 1976. But how can we know the full range of time covered by our data?
Well first, let’s use another aggregation function, MIN to find the earliest (or minimum) year of our data:
SELECT MIN(year)
FROM ds_public_1.laus min
1 1976
Another aggregation function - MAX - tells us the most recent year of our data:
SELECT MAX(year)
FROM ds_public_1.laus max
1 2023
Great - the LAUS data is available for 1976 at the earliest and 2023 at the latest. But we don’t know (yet) anything about the years in between - for all we know, we have data for 1976 and 2023, and no other year. This sounds like a job for our distinct operation! Here, we will also use our ordering fundamental data operation to sort our list by year. Remember that this operation allows us to see our data in a more logical order than we might otherwise see.
SELECT DISTINCT year
FROM ds_public_1.laus
ORDER BY year -- we also add ORDER BY to present our output in order
Flipping through this list, we can confirm every year from 1976 to 2023 is covered by our data. We can test to see if a row is defined by the unique combination of area_text and year, again using our distinct and aggregation operations.
SELECT COUNT(DISTINCT area_text || year) -- || combines values of variables
FROM ds_public_1.laus
We are getting closer! Let’s forge ahead with the two other time variables we mentioned earlier that might help us define a row: period_name and seasonal_text.
According to the data dictionary, period_name represents the “full name of period to which the data value refers.” This doesn’t tell us a whole lot, but you might recall from our first query that period_name contained the names of different months of the year.
To confirm that this is the pattern for the entire LAUS dataset, let’s explore the distinct values of period_name:
SELECT DISTINCT period_name
FROM ds_public_1.laus
Aha! As we might have guessed, we see that period_name seems to represent the months of the year - this means the measures in the data must be reported by month.
This is just what we hoped to see - in the WORKR dashboard, unemployment rates for each region were displayed by month. If we only had data by year, we would really be in trouble!
Before we move on, note the one value of period_name that doesn’t fit in: “Annual Average.” We can easily infer that this represents (unsurprisingly) the average for the year listed in year, rather than the value for a single month.
Let’s try our luck again on defining a row, this time adding in period_name:
SELECT COUNT(DISTINCT area_text || year || period_name)
FROM ds_public_1.laus
Even closer! At this point, it seems like our data is reported monthly (plus annual averages) at both the state and county level. To check this inference, let’s use our filtering fundamental data operation to look at the monthly data for Arkansas in 2023. For the sake of clarity, let’s also order our results by month:
SELECT *
FROM ds_public_1.laus
WHERE area_text = 'Arkansas' AND year = 2023
ORDER BY period
Oddly enough, we see that there are two lines reported for each month! It appears as though this is where our third time-based variable, seasonal_text contributes to the row definition in the LAUS table. Let’s take a look at this column:
We cannot use our traditional numeric-based aggregation functions, such as MIN or MAX on seasonal_text because it is a categorical, not continuous, variable.
SELECT DISTINCT seasonal_text
FROM ds_public_1.laus
We see “Seasonally Adjusted” and “Not Seasonally Adjusted” as possible values. According to the BLS website:
Seasonal adjustment is a statistical technique that attempts to measure and remove the influences of predictable seasonal patterns to reveal how employment and unemployment change from month to month.
Over the course of a year, the size of the labor force and levels of employment and unemployment undergo fluctuations due to seasonal events including changes in weather, harvests, major holidays, and school schedules. Because these seasonal events follow a more or less regular pattern each year, their influence on statistical trends can be eliminated by seasonally adjusting the statistics from month to month. These seasonal adjustments make it easier to observe the cyclical, underlying trend, and other non-seasonal movements in the series.
Perhaps that is final missing piece of our data puzzle: our employment-related statistics are reported both directly and adjusted for the normal seasonal patterns of our data. We can also see that the adjusted numbers are only available statewide; county level data are available only as not adjusted statistics by bringing in areatype_text:
SELECT DISTINCT areatype_text, seasonal_text
FROM ds_public_1.laus
Let’s now count the number of unique combinations of all of our unit and period of observation variables:
SELECT COUNT(DISTINCT area_text || year || period_name || seasonal_text)
FROM ds_public_1.laus
We are almost there! Perhaps if we turn to our final category of variables we will find the complete set of variables that define a row.
3.1.3 Attributes
The remaining variables in the data represent the information about our geographical area-month-year observations. There are several different types of attributes we can see in our data dictionary:
Geographical attributes:
areatype_textandarea_type_codedescribe the geographical area of focus, witharea_type_coderepresenting the shorthand version ofareatype_text.Statistical attributes:
unemployment_rate,unemployment,employment,labor_force,employment_population_ratio,labor_force_participation_rate, andcivilian_noninstitutional_population, which describe the employment, unemployment, and labor force information BLS collects.Miscellaneous attributes:
periodandseasonal_code, which are the shorthand notations corresponding toperiod_nameandseasonal_text, respectively.
Since our statistical attributes are reported for each row and thus cannot define a row, and our miscellaneous directly reference variables representing our unit and period of observation, we have one option for an additional variable to define a row: our remaining geographical attribute. Let’s confirm this:
SELECT COUNT(DISTINCT area_text || year || period_name || seasonal_text || areatype_text)
FROM ds_public_1.laus
With some data digging, you will find that the District of Columbia is reported at both areatype_text levels - “Statewide” and “Counties and equivalents.”
Fill in the following SQL command to verify the two values of areatype_text associated with D.C.:
SELECT DISTINCT ____
FROM ds_public_1.laus
WHERE ____Why do you think D.C. is in this situation? Does our combination of variables that uniquely defines each row make sense to you otherwise? What have you learned about the overall structure of the LAUS data?
3.2 What is in our data?
Now that we finally understand the structure of our dataset, we can start investigating its contents more thoroughly in a process known as exploratory data analysis (EDA). Under the umbrella of data discovery, EDA is the process of beginning to develop simple descriptive statistics that can help identify interesting questions for later analysis, as well as irregularities to address in data cleaning.
Given our eventual goal to recreate parts of the WORKR dashboard, our primary variable of interest for EDA will be unemployment_rate. In the next few sections, we’ll delve into its values and identify any caveats we may need to take into account.
3.2.1 Aggregation at large
A good way to start getting an idea of what is going on with this data is to use some more aggregation functions to explore the basic statistical properties of our data and its subsets. We already have performed a few types of aggregation - counting the number of rows, taking the minimum and the maximum of the year variables - and adding the average to these will help us get a good handle on if the unemployment_rate variable takes on the type of values we might expect.
SELECT COUNT(*) as count_all,
COUNT(unemployment_rate) as count_unemployment,
MIN(unemployment_rate),
MAX(unemployment_rate),
AVG(unemployment_rate)
FROM ds_public_1.laus
Let’s take a look at the output above. The first column, count_all, is something we’ve seen before - the count of all rows in the dataset. The second column, count_unemployment, is the count of all non-missing values of the unemployment_rate variable in our dataset.
Here we already see something surprising - count_unemployment is less than count_all! We’ll come back to this momentarily, but for now note that this type of checking for missing values is a very important part of EDA and data discovery.
The last three columns show the minimum, maximum, and average of unemployment rate. We see that there is a very large range, from 0 to 58.7, but that the average feels about like we’d expect: 6.1.
3.2.2 Aggregation by group
The analysis above is an important first step for our exploration of the unemployment_rate variable, but these overall descriptive statistics are pretty limited in value. This is especially true for our LAUS dataset, where we have already seen that we are combining both statewide and county-level data, as well as adjusted and unadjusted statistics.
We can add some more robustness to our EDA process by evaluating the same descriptive statistics, this time broken down by states and counties. The simplest way is just to filter the data down and then use an aggregation. For example, we can look at our descriptive statistics for just the county-level data:
SELECT COUNT(*) as count_all,
COUNT(unemployment_rate) as count_unemployment,
MIN(unemployment_rate),
MAX(unemployment_rate),
AVG(unemployment_rate)
FROM ds_public_1.laus
WHERE areatype_text = 'Counties and equivalents'
The picture for counties looks pretty similar to the overall picture - there is a wide range of data, a mostly reasonable average, and a few missing observations of unemployment_rate. We could run a few more queries to look at the state level, but this can quickly get pretty tedious, especially if we want to see these types of statistics for many different groups (state, year, etc.).
Instead, we can use the grouping data operation to first decide on a grouping structure for our dataset and then use our aggregation operations to get these descriptive statistics for each group. To see this in action, let’s first group by areatype_text and seasonal_code to see how the summary statistics change for statewide vs county-level geographies and adjusted vs unadjusted statistics:
SELECT areatype_text, seasonal_code,
COUNT(*) as count_all,
COUNT(unemployment_rate) as count_unemployment,
MIN(unemployment_rate),
MAX(unemployment_rate),
AVG(unemployment_rate)
FROM ds_public_1.laus
GROUP BY areatype_text, seasonal_code -- this is our grouping operation!
ORDER BY areatype_text
Now we are getting somewhere! Our grouped output has three rows - the first describes the unadjusted unemployment rates for the county-level geographies, and the second two describe both the adjusted and unadjusted rates for the statewide observations. Perhaps unsurprisingly, we see that both the number of observations and the range of values is smaller for the statewide data than the county-level data. We also see that there are fewer seasonally adjusted statewide observations than there are unadjusted. This is a very powerful data operation to add to our toolkit. For example, what if we want to explore our data over time? We can group by year to do this:
SELECT year,
COUNT(*) as count_all,
COUNT(unemployment_rate) as count_unemployment,
MIN(unemployment_rate),
MAX(unemployment_rate),
AVG(unemployment_rate)
FROM ds_public_1.laus
GROUP BY year -- now we group by year!
ORDER BY year
This is very interesting too! First, note that up until 1990, we consistently have 1300 observations per year. In 1990, the number of observations then jumps to over 40,000, and the range expands greatly.
This seems like it might indicate that we only had state data before that year, but how can we confirm that? Well, once again, grouping can help us out! We don’t really want to print out all the statistics for every geography group for every year, though - we really just want the number of unique values for year for each geography group. Luckily, counting unique observations works just as well with grouping as counting all observations:
SELECT areatype_text,
COUNT(DISTINCT(year)) as count_year -- adding distinct here
FROM ds_public_1.laus
GROUP BY areatype_text
ORDER BY areatype_text
Just as we expected, the statewide observations exist for 48 unique years (how many years is 1976 to 2023?) and the the county-level observations for 34 unique years (how many years is 1990 to 2023?).
Write a SQL query to find the maximum and minimum years observed in each state. Do we have data on all states for all years? What other types of aggregation are interesting to you?
3.2.3 Missing data
In one of our earlier commands, we saw a discrepancy between the count of total observations and the count of observations for unemployment_rate. This seems like it implies that some observations having missing values for unemployment_rate, which definitely is something we want to note for our analysis. But how can we find out just how many missing values there are, and for what types of observation?
Well, note that we do know how to calculate the number of missing values - it is the count of all observations minus the count of total observations. One solution (of many) is to use our variable creation data operation to add a new variable to some of our aggregated tables to indicate the number of missing values. This type of table is very useful to help summarize areas where there may be potential data quality issues.
Let’s do this first for the overall dataset:
SELECT COUNT(*) AS count_total,
COUNT(unemployment_rate) AS count_unemployment,
count_total - count_unemployment AS count_missing
FROM ds_public_1.laus
Here we see that there are 323 total observations with missing unemployment_rate values. That isn’t bad for over 40 years of data, but it still is important to know more about which observations are missing. For example, if they are all from the same period of time, or all from the same geography, then that could drastically impact our analysis.
Let’s check that. We can do so by adding a grouping operation to our prior query to count the number of missing by year and geography. We will also filter our output to only include rows with at least one missing observation. This output will give us a better idea of where our missing data is concentrated:
SELECT year,
area_text,
COUNT(*) as count_total,
COUNT(unemployment_rate) as count_unemployment,
count_total - count_unemployment as count_missing
FROM ds_public_1.laus
GROUP BY year, area_text
HAVING count_missing > 0 -- NOTE: we have to use HAVING for aggregates!
ORDER BY year
From this we can see that we are missing data for some parishes (county equivalents) in Louisiana during 2005 and 2006, and for many counties in Puerto Rico during 2020. All of this data is older than we will need for our WORKR dashboard recreation, so there’s no need to worry for now, but it’s good to note this in case it could impact other analyses we might want to run.
Why do you think the data is missing for these geographies and years? How could this affect another analysis, or a more far-reaching version of our WORKR dashboard? What solutions might you propose for handling this missing data if it could impact our analysis?
4 Conclusion
By working through this notebook, you should feel a bit more comfortable working with the LAUS data and have a contextual understanding of some of the variables within the dataset. We also hoped that you have begun to develop an appreciation for the art of data discovery.
All of our exploration has resulted from playing with the data and trying to uncover interesting aspects that might inform our analysis moving forward. This type of exploration is not a regimented set of steps, but instead a cyclical question and answer process between you and the data you are working on, where new results raise subsequent questions, which in turn lead to new results again.
The questions we have already answered today should prepare you as we move forward in our project toward recreating parts of the WORKR dashboard. In the next unit, we will work on whittling down the full LAUS dataset to the analytic frame we need for the WORKR-inspired analysis. We will learn how to identify the data we need for analysis and construct a dataset that can meet these needs.
What do the results we’ve seen so far tell us about the LAUS data? What remaining questions do you have? Feel free to try any other queries you think would be of value given the overarching goal of recreating parts of the WORKR dashboard, or that you feel would deepen your understanding of the LAUS data in general.
5 Citations
Foundations Module (LODES Data) - Unit 1 notebook (citation to be added)
Local Area Unemployment Statistics Overview: U.S. Bureau of Labor Statistics. (2023, March 17). https://www.bls.gov/lau/
Wisconsin Applied Data Analytics Training Program - Exploratory Data Analysis notebook (citation to be added)