Home » Working with Survey & Experimental Data

Working with Survey & Experimental Data

Survey data is generally fairly easy to work with once you realize what you’re looking at. Most professional-grade surveys contain a lot of extraneous information (at least, information that’s extraneous for most of our purposes): the context of the interview, the gender of the interviewer, and other things like that. Each of these characteristics is recorded as a variable in the dataset. Most surveys also use a stratified sample, meaning that the subgroups in the sample are chosen to reflect their prevalence in the population. Sometimes these don’t match exactly, especially if the researcher wants to do additional analysis on any of the subgroups. So, for example, Asian-Americans might be represented in the sample at twice their population prevalence. These pieces of information are recorded as survey weights (another group of variables), which capture the relationship between the sample subgroup and the larger population. A typical professional-grade survey can easily contain upward of 1,000 variables, and some of the large major ones (NES, GSS, C(C)ES) can top 2,000.

As these examples suggest, the single most important thing you can do when working with survey data is to download and read through the codebook before you even open the data. It will tell you things like the weights of the various subgroups in the data set, and where you can find that information in the data file. The codebook is also where you’ll find the variables you’re looking for. If possible, make a list of the variables you’re going to use and export just those variables into a new file. (Do this via a *.do file or script in case you need to go back and generate a new copy with additional variables.) If you can’t make a reduced file conveniently, just be sure you have a clear list of the variable names that you’re going to want to use; rename those variables if necessary to be something other than v0223 and the like so that you can find them more easily.

For undergraduate research, you can then treat survey data like any other type of data. Determine the level of measurement of your DV, then use this to choose the appropriate analytical tool (regression, difference in means, etc.). The takers of surveys are referred to as “respondents.” Your results discussion might contain sentences like, “Respondents evaluated the Democratic candidate as more likeable at a rate of 3 to 1,” or “Controlling for age and gender, a one-unit increase in education makes respondents 22% more likely to support human rights policies.”

If you conducted your own survey, you may have a couple other steps of data preparation to do before you can analyze. Categorical (nominal) variables like race and religion will have to be turned into a series of dummy variables. You can do this easily in all three major statistics programs; ask your instructor to show you this shortcut. Likewise, depending on what you are doing with them, you may need to dummy out most ordinal variables; talk with your professor to find out if this is necessary for your project. Always write a codebook and keep it up to date as you generate more variables.

Experimental data is only slightly more complex. The key thing to identify in looking at experimental data is the set of variables that identify the treatment the respondent or subject (the terms are interchangeable in experimental research) received. Again, this information will be found in the codebook, which is no less essential for analyzing an experiment than analyzing a survey. If you conducted your own experiment, for the love of all that is chocolate, WRITE YOUR OWN CODEBOOK. Don’t skip this step. Even if it’s a minimalist version that’s little more than a list of variable names and what they mean or how they were generated, it will be an absolutely indispensable document as you go along doing your analysis.

Depending on what your experimental design and research question were, you have a couple more choices in analysis. If you had only one treatment and a continuous DV, the usual method of analysis would be ANOVA (analysis of variance), but a difference of means test would also suffice. If you have more than one treatment, things get more complicated. First, you cannot use ANOVA; you would have to use MANOVA (multiple ANOVA), which is more complex to interpret and not usually taught in an intro poli sci stats class. Second, you could use differences in means (t-tests), but those only work in a pairwise manner; you would need to test each treatment against the control separately. Third, you can use a multivariate approach where each treatment is a dummy variable predicting your DV. Which multivariate tool you use – OLS, probit/logit, etc. – depends on the level of measurement of your DV. The advantage to this is that it allows you to control for other variables that may plausibly affect your DV. Randomization into treatment categories ensures that there is no systematic bias in who gets which treatment, but it can’t necessarily purge all the effect of other variables, especially ones that you think may interact with your treatment. Democrats who get randomly assigned to the ‘affirmative action has socially beneficial effects’ treatment are going to produce higher scores than Republicans, so we may want to control for respondent ideology in this case.

Archives

Categories

Site contents (c) Leanne C. Powner, 2012-2026.
Background graphic: filo / DigitalVision Vectors / Getty Images.
Cover graphic: Cambridge University Press.

Powered by WordPress / Academica WordPress Theme by WPZOOM