Most statistical software expects data in standard database format. Each row represents an object studied. Each column represents a variable for which we have data on the objects. This page elaborates on why that format is so important. While some software allows some exceptions, standard format almost always works, and failing to use it can easily lead to lost data that can destroy your research. This is one of those issues that is not mentioned in most textbooks but becomes extremely important if you need to work with data in the real world. This importance is best illustrated by examples of common problems.
When data are printed on a page they are often arranged in a table to save space. The rows and columns may have no logical meaning. For example, here are 25 attempts to measure the same quantity. We have one variable and 25 cases. One of our instructors received this in an Excel spreadsheet laid out as follows:
| 23.9 | 24 | 23.9 | 23.9 | 23.9 |
| 24 | 23.9 | 24 | 23.8 | 23.9 |
| 23.9 | 23.9 | 23.9 | 24 | 23.8 |
| 24 | 23.8 | 23.9 | 23.9 | 23.9 |
| 23.9 | 23.8 | 23.9 | 23.9 | 24 |
This layout saves paper, but most statistical software would interpret this as five cases and five variables. If you asked for a boxplot, you would get five (one for each column = variable) rather than one. (For graphing calculator users, the effect is the same as if your students unthinkingly entered the data into five lists instead of one.)
This problem arises most frequently when data come from a book, web page or spreadsheet that has laid out the data in a way that saves space rather than respects the logical structure of the data. For example, the book A Handbook of Small Datasets included a wonderful collection of mostly real datasets for use in the classroom. Although the data was available electronically, the files appeared to have been cut and pasted from the text. Most required considerable rearrangement before they could be analyzed with a standard stats. package.
The standard data format means that some information is implicit in the format. If we break the format, we lose data. Consider the first two columns of the table and a new story line. Now we have measurements of the same thing by Tom (Column 1) and Mary (Column 2) on each of the five days of the week (rows). We want to compare Tom and Mary. We summarize the data for Tom with a frequency table
x f 23.9 3 24.0 2
and do the same for Mary.
x f 23.8 2 23.9 2 24.0 1
Mary's numbers look a little lower. Let's get an average for each. The TI-83 graphing calculator lets us enter the summary tables (into two lists per person) instead of the raw data. That will be fine for getting a mean for each operator, but what if we next want to look at a correlation between Tom's measurements and Mary's? We cannot do that because we have lost the information about which measurement was done on which day.
Suppose you are given only the two frequency tables above. There is only a single 24 in the data for Mary. Which value for Tom goes with this 24?
You can see from the original table that it goes with a 23.9 for Tom (Observation 1) but there is no way to tell this from the frequency tables because the ordered pairs have been broken.
This is one of those shortcuts that seems to work when doing textbook problems, where we usually look at one variable (or do one procedure) at a time, but does not work in real life where we almost always have multiple variables and want to analyze our data in multiple ways. If you have the original data in standard format, you can reenter it correctly. For correlation, you would enter it as it appeared in the first two columns of the original table.
| 23.9 | 24 |
| 24 | 23.9 |
| 23.9 | 23.9 |
| 24 | 23.8 |
| 23.9 | 23.8 |
For two-sample-t you would enter it as
| 23.9 | Tom |
| 24 | Tom |
| 23.9 | Tom |
| 24 | Tom |
| 23.9 | Tom |
| 24 | Mary |
| 23.9 | Mary |
| 23.9 | Mary |
| 23.8 | Mary |
| 23.8 | Mary |
Most software will allow you to flip back and forth between these two layouts but of course if information has been lost no statistical program can recover it.
Enter your own data in standard form. (
If you are a teacher, encourage your students to do the same.) If you have
to work with data entered by others, be prepared to rearrange it before
analyzing it. The TI-83 family of graphing calculators are especially bad
in often requiring non-standard format. The TI-89 family do better.
Probably the most notorious offender is spreadsheet software such as Excel
that allows the entry of data in any whimsical format. Although many
statistical software packages can read Excel files, they usually can make
sense of what they read only if it is in standard format.
©2006-2007 Robert W. Hayden