Developing Coding Rules

Coding is the process by which information converts into data. It involves systematic summarization of descriptive information into a single numeric or qualitative value – a code – that is usable in whatever form of analysis you have chosen. Developing coding rules, then, is the process by which we establish a scheme for classifying diverse bits of information and values into a manageable number of categories, types, or values. The variables that result from coding processes are almost always at the ordinal or nominal level of measurement because of this goal of reducing the number of values. We might convert a continuous interval-ratio variable into an ordinal high-medium-low variable, or we might identify a discrete set of characteristics to identify in a decision-making process and code them as present or absent rather than trying to catalog all the different dimensions of that process.

Developing coding rules requires some familiarity with the range of values that your data do (or can plausibly) take. This familiarity allows you to establish a preliminary set of codes. The set of code options for each variable must be both exhaustive and mutually exclusive. Each case has to have a value for this variable, which means you must have a category available that fits every single case. This is particularly true for quantitative analysis, which requires variables to take one and only one value for each observation.

Making Ordinal and Nominal Variables from Continuous Data

This is a common form of coding, especially when using a quantitative variable in qualitative analysis. It also allows summarizing of continuous quantitative data when the level of detail provided by the continuous form is unnecessary, as in the case of annual income, or suspect, as in estimates of crime victims.

The key step in coding nominal and ordinal categories from continuous data is to determine the theoretically relevant cut points in the continuous data. This will help you determine the number of categories in your final variable. If you are interested in determining whether a family is “middle class” based on its income, perhaps the key number is incomes below $100,000. “Six-figure” incomes have a psychological value that vastly exceeds the difference between $98,000 and $102,000, even though it’s nominally the same as the difference between $52,000 and $56,000. Likewise, categories of age often differentiate between 18-22, 23-35, 36-49, 50-65, and 65+. These categories span different numbers of years, but they capture key lifespan milestones for most individuals: college, starting a career & family, midcareer, nearing retirement, and retired. Drawing conclusions about individuals in these groups makes more sense for many questions than arbitrary divisions by decades. Other variables take much more general classifications: high, medium, and low; present or absent; low income, middle-low income, middle-high income, high income.

Again, to stress, the number of categories you need is a theoretically derived number based on what matters for your particular variable and theory. What works for your theory may not work for someone else’s. Once you have this number, you then need to decide how to partition your data. Common approaches include above/below the mean or median, standard deviations, and division into percentiles (e.g., 25, 50, 75). I recommend using the mean if your data do not exhibit any significant skew or outliers (check your pre-analysis plots!); the median is typically better in the latter cases. With qualitative data that can be arranged into a relatively continuous scale, you may want to consider the median or modal (most common) value as a cut point.

Other times, variables may exhibit theoretically or empirically critical values which serve as natural cut points. The POLITY data use a score of 7 as the minimal score to qualify as a ‘democracy’ because obtaining this value requires a score of at least 1 on all of the criteria for democracy. The presence of any children in a household typically has a much bigger effect than how many one has. Sometimes the data exhibit a significant break or jump in the frequency of observed cases, so that this break point in the data serves as an appropriate cut point. Many other variables have characteristics like this, and/or commonly used cutoffs. Consult other research using this variable to see what it does, and go from there if that is theoretically appropriate for you. (And if it’s not –be sure that you explain why in your discussion of the variable.)

As always, you should disclose any transformations you make to published data, including recoding into nominal or ordinal values, in your paper. Always describe the cut points, and if necessary, give a theoretical explanation of that choice. (“I code Family Size as large if the family has five or more children (national mean = 4.7), and small if the family has one to four children; the omitted category is families with no children.”) Sometimes this fits in the methods section; other times, it works better in the discussion of variables and data sources. When conducting quantitative analysis using variables recoded into nominal and ordinal form, remember that they must enter the analysis as a series of dummy variables. You may have a single variable in the data set coded as high (3), medium (2), low (1), and absent (0), but these need to go into your model as three dummy variables and an omitted category. Otherwise your model will attempt to interpret the values so that “high” is three times as much as “low,” even though that may be a meaningless comparison in the context of your particular variable. We typically omit the lowest category (representing absent, none, low, or similar) when using these in quantitative models.

Extracting Categories from Qualitative Information

The process for coding from qualitative information is similar but much more extensive. Once you have determined the concepts of interest in your theory, you need to identify observable indicators of them, which probably occurred at the hypothesis stage. Your next step is to identify possible values of the observable indicators. For some, this will be present or absent; some will require you to make an assessment of high, medium, or low. Others will require categories of discrete characteristics or components. For example, did a ballot proposition campaign involve a legislative vote, petition signing, formation of official political action committees to fund advocacy, involvement of outside actors in advocacy, rallies, television advertising, etc.? For a variable such as this (“components of proposition campaign”), a case can possibly have – and almost certainly has – multiple elements. Before beginning data collection, you should list as many of these as you can and assign each a code. In this type of context, a case can have more than one code, because each actually represents a discrete component. In the early stages of data collection, be sure to note any other theoretically relevant variable values that you identify that are not part of your original list, and if these reoccur, assign them a new code. (See the next section on “other” categories.) Be sure to update your codebook and to review all previously coded cases to see if you need to add this value to the list. I cannot stress this enough: Make thorough notes on each case. You will need to review them at various points as you identify patterns across other cases and need to go back and recode.

For variables classified as high, medium, or low, or similar ordinal scales, you should establish clear criteria for each value. Remember that transparency in data collection and replicability of analysis – including coding – are key values of scholarly research. Be as explicit about these as possible in your research notes and paper; if space limitations prevent extensive discussion, put them in an appendix (hard copy papers) or an online appendix (for published papers). Again, use your research notebook to record all unusual cases, difficult decisions, and any other quirks that you feel you might need to readdress later. Remember that you can ignore notes that you have, but you cannot consult notes you didn’t take. The first few cases you code will feel like they take forever, but as you develop familiarity with the codebook and with the range of observed values, you will find that you require fewer notes and can code with more confidence and fewer reviews.

Depending on your particular research design needs, you may need to compile more detailed systematic information about some variables. For example, we might need to know which legislative house(s) voted, and by what margin. We might need to know how many signatures were required and obtained, and across how many districts. Use your codebook to ensure that you capture all this information.

If you are intending to conduct qualitative analysis, then your coding process can probably stop after this point. If you are conducting quantitative analysis, you will need to read your data into electronic form, if you are using a hard copy codebook, and/or transform it into an analyzable format. We cannot use categorical data that are numbered sequentially. Each category must be its own nominal variable, a dummy variable indicating whether that characteristic or element is present. Conventionally, we code absent or low as 0 and present as 1. Any other values would need to be transformed into 0-1 dummies for analysis, so save yourself the time and just get used to thinking as much as possible in 0-1 terms now.

In the case of multiple dummies capturing different characteristics or categories, entering all the dummies in the model may be possible if a sufficient number of cases do not take any of the established values (i.e., are coded as 0 on all the dummies, meaning they were part of an “other” category as described below), and/or are coded as 1 on more than one of the dummies. [Why do we have to omit a category? See The Assumptions of Regression] You can try a model specification including all the types, but be prepared to omit one category if the model crashes or can’t be estimated as specified because of collinearity.

“Other” Categories

The category of “other” is a popular but risky catch-all bin for cases that do not otherwise fit into the other common categories. The potential danger is that too many different things will get dropped into that category for it to make sense. Analysis – quantitative especially – assumes that the cases within each value are homogenous, meaning that they have effectively the same form or value of the underlying characteristic. If the category of “other” has a lot of cases exhibiting very diverse values, then any coefficients or conclusions drawn about that category will make little sense. If you absolutely must use an “other” category, you need to make sure that you record the actual value of the variable in your data collection process. You should then periodically review the observed values to see if any other categories (codes or variable values) need to be created, and go back and recode any formerly “other” cases as needed. My general rule is that if a value appears in more than 5-10% of cases (depending on the data set size), it needs its own code rather than being lumped in with “other.” “Other” should be reserved for truly unusual or dissimilar cases. In most cases, the category of “other” serves as the omitted or reference category when the resultant variable enters a data set as a series of dummies (i.e., the variable captures a series of nominal values). This allows you to describe your results in terms of the systematic categories captured by the other values.

Preliminaries

Practicalities

Prose & Processes

Math

Glossary

Meet Dr. Powner

Making Ordinal and Nominal Variables from Continuous Data

Extracting Categories from Qualitative Information

“Other” Categories

Archives

Categories