Home » Creating Your Own Dataset

Creating Your Own Dataset

Below I provide some tips and resources for creating your own (observational) dataset from scratch, or for collecting novel data to merge into an existing dataset.

Also check out “Developing Coding Rules” on this site.

Prof Powner’s Dos and Don’ts

  • DO plan your data collection needs and process before you even open a book. Failing to plan is planning to fail. SO MUCH data is out there that if you don’t have a solid plan in place about what you need to answer your research question, you will collect lots of facts, spend a lot of time developing a coding scheme for them, and still probably not have the facts you need to create data to test your claims. Lack of a plan leads to pointless, fruitless wandering through the stacks and hours of reading that get you nowhere. Always start a data collection project with a thorough and careful assessment of what data you need, then take steps to acquire it.
  • DON’T forget about the librarians!! Once you know what you need, set up a talk with them to see what they suggest. They may know about data resources you don’t, like what sites your school has a subscription to or what other kinds of cool data collections are out there on your topic. Every variable you can scrounge from another dataset, as long as its conceptualization of the variable is consistent with yours, is one less you have to collect by hand.
  • DO document EVERY source you cite for every data point. Add a column for “Source” in each row of data. That column doesn’t get copied into your final dataset for analysis, but it’s a necessary thing to have later if you need to post replication files or justify your coding. Where possible, keep a list of major sources you checked that didn’t have any useful information so you avoid checking them again. If you code a case as not exhibiting something, or that something didn’t happen, you’ll want this information to know whether you might have missed evidence somewhere.  
  • DON’T read every word. That’s an ineffective strategy for information searching. Use indices, tables of contents, headings, and other paratext to help navigate to the most potentially useful places in the source and focus your attention there.
  • DO collect data case by case, rather than variable by variable. In general, we organize books around discrete events, periods, themes, etc., which means that a given source is more likely to contain multiple variables’ values for one case than multiple cases’ values on a single variable. A few exceptions exist, such as specialized encyclopedias and other reference books that may be organized in such a way that variable-oriented collection is more effective.
  • DON’T forget about the Transpose function in your spreadsheet software, which lets you flip the rows and columns in a table. This allows you to do data entry using the ‘Enter” key to move down through questions for a given observation, then flip the table to the observations-as-rows orientation we need for analysis.
  • DO pretest your data collection process on one or two cases (usually an easy/best-case-for-finding-information case and a tough case) before committing to a strategy. (Remember, cases with thousands of books written on then, like the world wars or the Cultural Revolution or many others are NOT the easiest. There’s too much to wade through to find the right stuff!)

[1] This is a publicly deposited copy that is available without library or other subscription access to what is otherwise a not widely available journal.

Archives

Categories

Site contents (c) Leanne C. Powner, 2012-2026.
Background graphic: filo / DigitalVision Vectors / Getty Images.
Cover graphic: Cambridge University Press.

Powered by WordPress / Academica WordPress Theme by WPZOOM