Feasibility and Reliability of Automated Coding of Occupation in the Health and Retirement Study

Authors

Abstract

Due to advances in computing power and the increase in coverage of longitudinal datasets in the Health and Retirement Study (HRS) that provide information about detailed occupations, demand has increased among researchers for improved occupation and industry data. The detailed data are currently hard to use because they were coded at different times, and the codeframes are, therefore, not consistent over time. Additionally, the HRS gathers new occupation and industry information from respondents every two years, and coding of new data at each wave is costly and time-consuming. In this project, we tested the NIOSH Industry and Occupation Computerized Coding System (NIOCCS) to see if it could improve processes for coding data from the HRS. We tested results from NIOCCS against results from a human coder for multiple datasets. NIOCCs does reasonably well compared to coding results from a highly-trained, professional occupation and industry coder, with kappa inter-rater reliability on detailed codes of just under 70 percent and agreement rates on broader codes of around 80 percent; however, code rates for NIOCCS for the datasets tested ranged from 60 percent to 72 percent, as compared to a professional coder’s ability to code those same datasets that ranged from 95 percent to 100 percent. In its current form, we find that NIOCCS is a tool that might be best used to reduce the number of cases human coders must code, either in coding historical data to a consistent codeframe or in coding data from future HRS waves. However, it is not yet ready to fully replace human coders.

Key Findings

  • The NIOSH Industry and Occupation Computerized Coding System (NIOCCS) works well only with short descriptions, one to three words each, of job title or job description and “what a business does or makes” as inputs.
  • NIOCCs does reasonably well compared to coding results from a highly-trained, professional occupation and industry coder, with kappa inter-rater reliability on detailed codes of just under 70 percent and agreement rates on broader codes of around 80 percent.
  • The main weakness of NIOCCS appears to be its failure to produce codes in many cases. Code rates for NIOCCS for the datasets tested ranged from 60 to 72 percent, as compared to a professional coder’s ability to code those same datasets that ranged from 95 to 100 percent.
  • NIOCCS may be a useful tool for reducing the human coder hours needed for coding industry and occupation data for the Health and Retirement Study and other studies and datasets. In its current form it would be most useful as
    • a way to reduce the number of cases a human coder must code, or
    • a way to reduce the amount of time a human coder must spend on each case, or
    • as a first cut for coding historical data that don’t crosswalk cleanly to a newer codeframe.

Citation

Helppie-McFall, Brooke and Amanda Sonnega. 2018. “Feasibility and Reliability of Automated Coding of Occupation in the Health and Retirement Study.” Ann Arbor MI: University of Michigan Retirement Research Center (MRRC) Working Paper, WP 2018-392. https://mrdrc.isr.umich.edu/publications/papers/pdf/wp392.pdf

Full Text

Download PDF

Project

Paper ID

WP 2018-392

Publication Type

Working Paper

Publication Year

2018