References

Abowd, John M., John Haltiwanger, and Julia Lane. 2004. “Integrated Longitudinal Employer-Employee Data for the United States.” American Economic Review 94 (2): 224–29.

Abowd, John M., Martha Stinson, and Gary Benedetto. 2006. “Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project.” Suitland, MD: Census Bureau, Longitudinal Employer-Household Dynamics Program.

Christen, Peter. 2012a. “A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication.” IEEE Transactions on Knowledge and Data Engineering 24 (9). IEEE: 1537–55.

Christen, Peter. 2012b. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science & Business Media.

Clifton, Chris, Murat Kantarcioglu, AnHai Doan, Gunther Schadow, Jaideep Vaidya, Ahmed Elmagarmid, and Dan Suciu. 2006. “Privacy-Preserving Data Integration and Sharing.” In 9th Acm Sigmod Workshop on Research Issues in Data Mining and Knowledge Discovery, 19–26. ACM.

Economic and Social Research Council. 2016. “Administrative Data Research Network.”

Elmagarmid, Ahmed K., Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. “Duplicate Record Detection: A Survey.” IEEE Transactions on Knowledge and Data Engineering 19 (1). IEEE: 1–16.

Fellegi, Ivan P., and Alan B. Sunter. 1969. “A Theory for Record Linkage.” Journal of the American Statistical Association 64 (328). Taylor & Francis Group: 1183–1210.

Glennon, Britta. 2019. “How Do Restrictions on High-Skilled Immigration Affect Offshoring? Evidence from the H-1b Program.” http://brittaglennon.com/research/.

Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. 2007. Data Quality and Record Linkage Techniques. Springer Science & Business Media.

Huang, Jian, Seyda Ertekin, and C. Lee Giles. 2006. “Efficient Name Disambiguation for Large-Scale Databases.” In Knowledge Discovery in Databases: PKDD 2006, 536–44. Springer.

Jarmin, Ron S., and Javier Miranda. 2002. “The Longitudinal Business Database.” Available at SSRN: https://ssrn.com/abstract=2128793.

Köpcke, Hanna, Andreas Thor, and Erhard Rahm. 2010. “Evaluation of Entity Resolution Approaches on Real-World Match Problems.” Proceedings of the VLDB Endowment 3 (1–2). VLDB Endowment: 484–93.

Kuhn, H. W. 2005. “The Hungarian Method for the Assignment Problem.” Naval Research Logistics 52 (1). Wiley Online Library: 7–21.

Lahiri, Partha, and Michael D Larsen. 2005. “Regression Analysis with Linked Data.” Journal of the American Statistical Association 100 (469). Taylor & Francis: 222–30.

National Center for Health Statistics. 2019. “The Linkage of National Center for Health Statistics Survey Data to the National Death Index – 2015 Linked Mortality File (Lmf): Methodology Overview and Analytic Considerations.” https://www.cdc.gov/nchs/data-linkage/mortality-methods.htm.

Rodolfa, K., E. Salomon, L. Haynes, I. Mendieta, J. Larson, and R. Ghani. 2020. “Predictive Fairness to Reduce Misdemeanor Recidivism Through Social Service Interventions.” In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (ACM FAT*) 2020.

Scheuren, Fritz, and William E. Winkler. 1993. “Regression Analysis of Data Files That Are Computer Matched.” Survey Methodology 19 (1): 39–58.

Schnell, Rainer. 2014. “An Efficient Privacy-Preserving Record Linkage Technique for Administrative Data and Censuses.” Statistical Journal of the IAOS 30: 263–70.

Schnell, Rainer. 2016. “German Record Linkage Center.”

Schnell, Rainer, Tobias Bachteler, and Jörg Reiher. 2009. “Privacy-Preserving Record Linkage Using Bloom Filters.” BMC Medical Informatics and Decision Making 9 (1). BioMed Central Ltd: 41.

Steorts, Rebecca C, Rob Hall, and Stephen E Fienberg. 2014. “SMERED: A Bayesian Approach to Graphical Record Linkage and de-Duplication.” https://arxiv.org/abs/1312.4645.

Tanner, Adam. 2013. “Harvard Professor Re-Identifies Anonymous Volunteers in DNA Study.” Forbes, http://www.forbes.com/sites/adamtanner/2013/04/25/harvard-professor-re-identifies-anonymous-volunteers-in-dna-study/#6cc7f6b43e39.

Ventura, Samuel L., Rebecca Nugent, and Erica R. H. Fuchs. 2015. “Seeing the Non-Stars:(Some) Sources of Bias in Past Disambiguation Approaches and a New Public Tool Leveraging Labeled Records.” Research Policy. Elsevier.

Whang, Steven Euijong, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. 2009. “Entity Resolution with Iterative Blocking.” In Proceedings of the 2009 Acm Sigmod International Conference on Management of Data, 219–32. ACM.

Wick, Michael, Sameer Singh, Harshal Pandya, and Andrew McCallum. 2013. “A Joint Model for Discovering and Linking Entities.” In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, 67–72. ACM.

Winkler, William E. 2009. “Record Linkage.” In Handbook of Statistics 29a, Sample Surveys: Design, Methods and Applications, edited by Danny Pfeffermann and C. R. Rao, 351–80. Elsevier.

Winkler, William E. 2014. “Matching and Record Linkage.” Wiley Interdisciplinary Reviews: Computational Statistics 6 (5). John Wiley & Sons, Inc.: 313–25.

Zolas, Nikolas, Nathan Goldschlag, Ron Jarmin, Paula Stephan, Jason Owen-Smith, Rebecca F Rosen, Barbara McFadden Allen, Bruce A Weinberg, and Julia Lane. 2015. “Wrapping It up in a Person: Examining Employment and Earnings Outcomes for Ph.D. Recipients.” Science 350 (6266). American Association for the Advancement of Science: 1367–71.


  1. Other names associated with record linkage are entity disambiguation, entity resolution, co-reference resolution, matching, and data fusion, meaning that records which are linked or co-referent can be thought of as corresponding to the same underlying entity. The number of names is reflective of a vast literature in social science, statistics, computer science, and information sciences.

  2. If you have examples from your own research using the methods we describe in this chapter, please submit a link to the paper (and/or code) here: https://textbook.coleridgeinitiative.org/submitexamples

  3. “Administrative data” typically refers to data generated by the administration of a government program, as distinct from deliberate survey collection.

  4. This topic is discussed in more detail in Chapter Data Quality and Inference Errors.

  5. This topic (quality of data, preprocessing issues) is discussed in more detail in Chapter Introduction.

  6. This topic is discussed in more detail in Chapter Data Quality and Inference Errors.

  7. This topic is discussed in more detail in Chapter Machine Learning.

  8. See Chapter Privacy and Confidentiality.

  9. This topic is discussed in more detail in Chapter Privacy and Confidentiality.

  10. See https://workbooks.coleridgeinitiative.org.