R for Categorical Data


When categorical data appear in textbooks, it is usually already summarized in tables or graphs. Hence, you usually do not need technology to do homework problems with categorical data. However, this leaves one underprepared for dealing with real data, so this page is for those who need to do that. We will use an example dataset small enough so you can do the calculations by hand and compare your results to the computer. Imagine a survey question with answer choices Agree, Disagree or Undecided. Suppose 25 people give these responses:

A,A,D,U,D,D,A,U,A,D,A,D,D,A,U,A,U,D,D,A,A,A,U,D,A.

Where's the Mode?

Most software will not report the mode. That's because the mode is rarely useful for measurements. To find it when you do need it, you have to treat the data as categorical. For categorical data, the modal category is the one with the most observations (if there is such a category). You can see by counting that there are more A's on the list above than D's or U's, so A is the modal category. This is the shortest summary for categorical data, analogous to just giving the mean or median for measurements. When we find the modal category for a group of measurements, it is called the mode. It is useful only when the measurements resemble categorical data in having values that are repeated over and over. An example might be number of children in a family. Here you might see 0, 1, 2... over and over. For more typical measurements, such as these

1.66597, 1.91566, 2.53406, 2.88043, 2.93449, 3.08816, 1.73520, 3.21908, 3.77892, 3.98208

the mode is not useful because there is none. No value is repeated.

If you need the mode, make a frequency table for the data and find the category with the most observations.

Using R for Categorical Data

Run R. Use quotation marks to enter the data as text.

> survey = c("A","A","D","U","D","D","A","U","A","D","A","D","D","A","U","A","U","D","D","A","A","A","U","D","A")
> survey
[1] "A" "A" "D" "U" "D" "D" "A" "U" "A" "D" "A" "D" "D" "A" "U" "A" "U" "D" "D"
[20] "A" "A" "A" "U" "D" "A"
> table(survey)
survey
A D U
11 9 5

The modal category is "A" (agree).

Graphics have to be made from the numbers in such a table as the one above rather than the letters in the variable.

> barplot(table(survey))

Bar Chart

> pie(table(survey))

Pie Chart

Notice that it is obvious from the bar chart that A is the modal category. It takes sharp eyes to see this in the pie chart. The summaries above are in order of decreasing statistical quality. A table gives the most and most precise information in the least amount of space; a pie chart gives the least.