Data Central Research Center
Interested in using the Current Population Survey (CPS) Computer and Internet Use Supplement in original research? Looking to calculate custom statistics from raw datasets? NTIA Data Central's Research Center is your one-stop shop for downloading datasets and sample code, and for learning best practices in analyzing CPS Supplement data.
- Download Datasets (Statistical analysis software recommended)
- Get Sample Code and Documentation: NTIA maintains a GitHub repository that features the code used to produce NTIA Internet Use Survey analyses and data products, as well as extensive information on how to get started using the full datasets.
- Questions? Contact NTIA's Data Central data team at data@ntia.gov.
New to the NTIA Internet Use Survey? We recommend getting started with Important Notes for Researchers
In making raw datasets available in multiple formats and providing sample code, NTIA seeks to stimulate original research into important policy questions affecting the spread of technology and Internet access. However, it is important for users to understand the complexities of CPS Supplement public use datasets in order to produce accurate analyses.
CPS Supplement dataset files consist of one observation per person in the sample, and several hundred variables per observation. A typical dataset contains approximately 150,000 observations, including 20-30,000 non-respondents. Depending on the format in which the dataset was downloaded, observations and variables may be separated in different ways. In CSV-formatted datasets, each observation is contained on a single line, with the value of each variable separated by a comma. The fixed-format datasets originally provided by the Census Bureau dedicate specific ranges of characters on each line to particular variables; matching variables to their correct locations requires use of the record layout specifications laid out in the Census Bureau’s technical documentation.
The most common way to open and manipulate a dataset is using statistical software, such as Stata, R, SAS, SPSS, and similar packages. These programs are specifically designed to handle large datasets, and contain built-in functions that enable complex statistical analyses. While it may be possible to open these datasets with other, more generalized software such as spreadsheet applications, the experience may not be optimal.
The CPS is a survey of households that gathers data on the individual or individuals within households (though note that only one member of a household is generally interviewed and answers questions on behalf of every other member). This leads to a few considerations when using the resulting datasets. First, it is sometimes necessary to identify which persons recorded in a dataset live in the same household. Each observation includes one or more variables for identifying unique households within a dataset (for all but the oldest CPS Supplement datasets, the variable QSTNUM contains a number corresponding to the household to which the person belongs). Second, some variables contain data about a household as a whole (e.g., state of residence and family income), while others are specific to each person (e.g., age and educational attainment). It is important to understand which variables are recorded at the person level and which are household-level in order to produce accurate statistics. For example, when tallying a variable that indicates whether a household has a wired Internet connection in the home, a user will likely want to count each household only once despite the existence of multi-person households in the sample (and therefore, multiple observations in the dataset reporting on the existence of an Internet connection in the same household). Alternately, tallying that variable for all persons in the dataset would yield the number of people living in a household with a wired Internet connection, rather than the number of households with one—quite possibly a useful metric, but a different one nonetheless. Fortunately, variable names generally indicate whether a particular variable is person-specific (if a variable name begins with a ‘P’), or general to the household (if it begins with an ‘H,’ or a ‘G’ when describing geographic data associated with the household).
Finally, users should understand the circumstances under which some data will be missing. In general, the absence of information is indicated by a value of -1. The Census Bureau uses advanced imputation methods to fill in data missing due to respondents not knowing or refusing to provide an answer, as well as other circumstances in which responses are not recorded. Therefore, when data are marked as missing, it is because the household was not interviewed (indicated by HRINTSTA equaling a value other than 1), or because the person or household is not in the population of interest. There are a number of reasons why a person or household might not be part of the population of interest. The Census Bureau only gathers limited data on persons in a household who are under the age of 3 or who are current active-duty members of the armed forces. Additionally, some questions are only asked about individuals ages 15 or older, including employment status, educational attainment, and disability status. Beyond those issues, some data are missing for certain individuals either because the questions are inapplicable (e.g., Census does not ask Internet-using households why they do not use the Internet), or because the questions were only asked of a subset of the full sample (examples are discussed in the section that follows).
Using Datasets to Calculate Statistics
The CPS is designed to yield results that can be generalized to the civilian, noninstitutionalized population of the United States, as well as to the population of individual states and the District of Columbia. Data users can also break down results by selected demographics, including age group, educational attainment, family income, race or ethnicity, disability status, metropolitan statistical area (MSA) status, and other factors. In order to increase the generalizability of calculated estimates to the greater population, the Census Bureau includes weighting variables in its datasets. Weighting variables are based on the inverse of the probability that an individual or household would be selected as part of the CPS sample, adjusted to compensate for households in the sample that were not successfully interviewed, among other situations. Another way to think about weights is that they indicate the number of persons or households a given individual in the sample represents; adding together the values of the weighting variable used for person-level statistics from every observation would yield the estimated population of the United States at the time of the survey. Weights can be used to estimate, for example, the Hispanic population at the time of the survey, or the proportion of persons with disabilities who use the Internet at work.
The process of calculating weighted estimates is relatively straightforward. Determine whether the population of interest consists of persons or households, and select the weighting variable to use accordingly. For most CPS Supplements, person-level calculations should be tallied using the variable PWSSWGT, while HWHHWGT should be used for household-level calculations. For each relevant observation, tally the desired statistic and multiply it by the weighting value for that observation. If calculating a proportion, users should also total the weight values for all observations to use as the divisor. Most statistics programs allow users to specify a weighting variable to use in calculations, resulting in accurate estimates without manual calculation.
Beginning with the July 2011 data collection, NTIA and the Census Bureau added a number of questions to the CPS Supplement, primarily concerning Internet application usage habits, that household respondents only answer on behalf of a subset of the total sample—specifically, one person per household. This was done to balance NTIA’s desire for more detailed data against the need to avoid excessively long interviews that may reduce response rates. In the July 2011 and October 2012 surveys, the primary household respondent—the person in the household who speaks to the interviewer—was asked to answer these questions on his or her own behalf. Beginning in July 2013, NTIA and Census switched to randomly selecting one member of the household to be the subject of these questions, known as the random respondent. The methodology was further refined for the July 2015 survey, so that only Internet users within a household are eligible to be selected as random respondents.
Because random respondents (and their primary respondent forerunners) are chosen as a subsample of all persons in the CPS, data collected in this fashion should be treated differently from other person-level data in analysis. It is important to exclude persons who were not included in the subsample (identified by PUELGFLG having a value of 20 in July 2013 and later, or PRHRESP equaling 1 in July 2011 or October 2012), and to use a special weighting variable, PWPRMWGT. Also note that individuals must be age 15 or older to be eligible for selection as random respondents; while the same was true for primary respondents, they were in practice unlikely to be between the ages of 15 and 24, and as a result the primary respondent weighting variable does not perform well among that age group. NTIA considers random respondent-based estimates reliable for the 15 and older population, but recommends restricting primary respondent-based estimates to ages 25 and older.
Finally, note that the July 2013 CPS Supplement was only administered to approximately 75 percent of households that participated in the basic portion of the CPS. Households being sampled for the first (HRMIS = 1) or fifth (HRMIS = 5) time were not asked about computer and Internet use, and should therefore be excluded from any analyses of the July 2013 data. Furthermore, this situation required the Census Bureau to create special weights for use with July 2013 CPS Supplement data, labeled PWSUPWGT (for person-level analysis) and HWSUPWGT (for household-level analysis). NTIA does not foresee repeating this practice in the future; the July 2015 survey reverted to using the full CPS sample.
Accurately Estimating Variance
Because the CPS utilizes a complex survey design and is not a simple random sample, any estimates of variance must account for how the sample was selected. The best way to achieve this with the CPS is to use replicate weights supplied by the Census Bureau. Replicate weights for use with person- and household-level data are available for CPS Supplement datasets beginning in July 2011, and are available for use with primary/random respondent data beginning in October 2012. These replicate weights are created by the Census Bureau using a technique known as successive difference replication (SDR). They are used to estimate variance by repeating the calculation of interest using each replicate weight in place of the weighting variable used in finding the point estimate, summing the squared differences between each replicate estimate and point estimate, and then multiplying the result by 4 divided by the number of replicate weights (in the case of the CPS, 160). This can be a computationally-intensive process; fortunately, several statistical packages can be set to use this method of variance estimation without the need for manual calculation. For more information about variance estimation with CPS data, see Chapter 14 of the Census Bureau’s Technical Paper 66.
Variance estimation is more challenging when using older CPS Supplement datasets. Due in part to concerns about protecting the identity of survey respondents, public use CPS datasets do not include variables that could be used to correct for survey design when estimating variance. While NTIA does not have formal recommendation for variance estimation with older datasets, we use a technique developed by Davern et al. (2006) for person-level analyses, in which we create a synthetic strata variable, use a household identifier as a clustering variable, and estimate variance using the Taylor series linearization technique. For household-level variance estimations, and for primary respondent variance estimations in July 2011, we use the default “robust” method available in the Stata statistical software.