Interpreting Two-Way Tables (Minitab Version)

This document will give examples of various questions you night ask of bivariate qualitative data and how to answer those questions. We will talk about how to answer the questions with formulae or with simple calculations you can do with a $1 calculator in addition to how you can get software to do all the work for you. All the work except thinking, that is!-) In textbook problems you are usually given a table. In practice you are usually given a data file like the PULSE data file (with 92 cases and 8 variables) and you use software to get the table.

If you are not already familiar with the PULSE data, you should read a description of the data. You can find the entire dataset at our site as a plain text file or as an Excel spreadsheet. You can find the data in Minitab format here. (Normally, clicking on this link will result in the file being downloaded and opened, after a pause, in Minitab.) Once it opens you should work through the examples below. Some of these will use the command line because it is simple and self-documenting. (If it's not working, pull down the Editor (not Edit) menu and enable it.*) You can get the same results from the menu system by selecting Stat > Tables and mousing around.

We will start with a table about the relationship between activity level and whether people ran or not. Recall that whether to run or not was determined by the flip of a coin.

MTB > table c8 c3

Rows: ActivityL   Columns: Ran?

       no  yes  All

0       1    0    1
1       6    3    9
2      36   25   61
3      14    7   21
All    57   35   92

From the totals ("All") at the bottom of the table we can see that only 35 out of 92 actually ran. That's only about 38%, quite a bit less than the 50% we would expect from a fair toss. We might wonder, for example, if the folks with high activity levels might be willing to run while those with low activity levels may have been unwilling and hence faked the results of their coin toss. Before we proceed to try to answer that question and others, there are two minor issues we have to get out of the way.

The table above shows one person gave an activity level of "0" even though the permissible levels were 1, 2 and 3. What should we do about this? One solution might be to change that to a "1", figuring that they meant to choose the lowest level available. On the other hand, they may have been just fooling around. (These were college students;-) Another approach might be to eliminate this person from the data set. However, looking at the remainder of the data on this person

PuBefore PuAfter Ran? Smokes? Sex Height Weight ActivityL
48 54 no yes male 68.00 150 0

we do not see any other ridiculous answers such as a weight of 5 pounds or a high of 700 inches, although the low pulse rates would be more typical of someone who exercises a lot than of someone who does not exercise at all. As a compromise, we will eliminate the 0 for ActivityL and keep the rest of the data. (This is a judgement call and someone else might call it differently.) To do that in Minitab, just find the 0 in the ActivityL column and change it to an asterisk "*". This is Minitab's "missing value" code. A common mistake is to put in a "0" for any missing value, but that is not a good idea. It is very unlikely that the student's true pulse is 0. The fact that we do not know the noon temperature in Miami, Florida for a particular July day does not mean that it was 0! Putting in a "0" would pull down the average temperature for the month and cause all kinds of other problems, so most statistical software has a missing value code. (Excel does not, and is inconsistent about how it handles missing values, though generally it handles them poorly.)

Once you fix the data set, the table does not show the individual for whom we have no real data on ActivityL.

MTB > table c8 c3

Rows: ActivityL   Columns: Ran?

       no  yes  All

1       6    3    9
2      36   25   61
3      14    7   21
All    56   35   91

Cell Contents:      Count

Note that the row for ActivityL=0 is gone and the grand total number of observations is now 91 rather than 92.

What to do about outliers and other problem observations is an important statistical issue. Whole books have been written about it. The second issue we have to deal with is a very trivial one that we discuss only because it often leads to confusion for beginners. Our table could have been presented as

MTB > table c3 c8

Rows: Ran?   Columns: ActivityL

       1   2   3  All

no     6  36  14   56
yes    3  25   7   35
All    9  61  21   91

Cell Contents:      Count

Note that rows and columns are interchanged (and that you make this happen by changing the order in which you enter the columns after the table command). There is no right (or wrong) way to do this! Often the choice is determined by non-statistical issues like fitting the table on a page or overhead. The only reason it is worth mentioning is to warn you not to memorize any rules for working with tables that include the words "row" or "column", since the same information could be in either a row or a column, depending on how the table is laid out.

Now let's get to work on interpreting the table. We can think in terms of proportions, such as 35 out of 91 ran. We can also think of this as a probability. If we pick a person at random from this group, the probability that they ran is 35 out of 91. In some case, this is all we want to know. In other cases, this might be an estimate of some other probability or proportion -- perhaps we have a sample value and want to look at a larger population. In what follows, we will talk mainly in terms of probabilities. We will also try to match up what we can get from the table with some of the terminology and notation used for probabilities.

For our table, the probability that a person selected at random ran is 35/91 = 38.46%. The probability that they had an activity level of 3 is 21 out of 91 or 23.08%. Simple probabilities come from the "All" rows and columns. The probability that a person did not run can be found as 56/91 or by the complement rule as 1-(the probability that they did run) or 1-(35/91). Both approaches should give 61.54%. (In theoretical work and in doing arithmetic we usually use the proportion 0.6154 but when interpreting results most people prefer percentages.) The probability that someone had an activity level of 1 or 2 is (9+61)/91=70/91=76.9%. You would get the same answer if you used the rule for adding probabilities: (9/91)+(61/91). Or, you could figure this with the complement rule. The complement of an activity level of 1 or 2 is an activity level of 3, so we would compute 1-(21/91) which again gives 76.9%. It is important to recognize how probability rules play out in tables, because this sort of data is almost always presented in tables!

There is also a rule for probabilities with "and', but it works only for independent events. These are important in theory but rare in practice;-) In practice we have to count. From the table, we can see that there were 7 people who ran AND had an activity level of 3. Hence the correct probability for this is 7/91=7.7%. The independent event formula would give (21/91)*(35/91)=8.9% -- close but not real close. Probabilities with and are generally found with a total percents table.

MTB > table c8 c3;
SUBC> totpercents.

Rows: ActivityL   Columns: Ran?

          no    yes     All

1       6.59   3.30    9.89
2      39.56  27.47   67.03
3      15.38   7.69   23.08
All    61.54  38.46  100.00

Cell Contents:      % of Total

Here you can see that 7.69% of these folks ran AND had an activity level of 3. The "All" sections contain simple probabilities such as the probability that a randomly chosen person ran (38.46%) or the probability that they had an activity level of 2 (67.03%). In general, the basic table of counts gives us the numbers we need to compute various probabilities with our calculator. The other types of tables provide answers to various questions in a way that does not require a calculator. For example, you can just read off this new table three of the probabilities we computed earlier: the probability that someone ran is 38.46% and that they did not run 61.54%, while the probability that they had an activity level of 3 is 23.08%.

Chapter 14 also mentions disjoint (also called "mutually exclusive") events. These connect with tables in two ways. First, when you set up each categorical variable, the categories should be disjoint. People should have just one activity level, and either they ran or they did not. If you open a well-constructed data file, this should already be taken care of. You may have to be more careful if you set up a data file yourself. For example, you may have a survey question that asks people to check a list of hobbies they have. Since people may have more than one hobby, your hobbies may not form disjoint sets. The standard way to deal with this is to represent each hobby choice with a yes-no variable.

We may also see disjointness between certain values of different variables. For example, if we are studying the prevalence of various forms of cancer and comparing males and females, we will find no males with ovarian cancer and no females with prostate cancer. These are disjoint events and we would see 0's in the contingency table. On the other hand, when we see 0's, we always wonder if there is some reason (biological in our example) why the events are disjoint, or is the 0 just a peculiarity of this set of observations.

Conditional probabilities are computed in row percent and column percent tables. In fact, the meaning of conditional probabilities is much clearer in tables than it is in language or mathematical notation.

MTB > table c8 c3;
SUBC> rowpercents.

Rows: ActivityL   Columns: Ran?

          no    yes     All

1      66.67  33.33  100.00
2      59.02  40.98  100.00
3      66.67  33.33  100.00
All    61.54  38.46  100.00

Cell Contents:      % of Row

Because the various activity levels are in the rows of the table, this row percents table gives us separate probabilities for each level of ActivityL. For example, among those with an activity level of 1, 33.33% ran, while among those with an activity level of 2, 40.98% ran. The notation for these conditional probabilities might look something like P(ran | ActivityL=1) and P(ran | ActivityL=2) respectively. This table would help us compare the running behavior of the three activity levels. Level 2 had the highest percentage who ran. What if we want to compare the activity level of folks who ran with folks who did not, that is, we want something like P(? | ran)? We need to get percentages for each column if the columns represent who ran and who did not.

MTB > table c8 c3;
SUBC> colpercents.

Rows: ActivityL   Columns: Ran?

           no     yes     All

1       10.71    8.57    9.89
2       64.29   71.43   67.03
3       25.00   20.00   23.08
All    100.00  100.00  100.00

Cell Contents:      % of Column

Here we can see that P(ActivityL=1 | ran) = 8.57%, quite different from P(ran | ActivityL=1) = 33.33%. Putting these into words may help to see the difference and how these arise in practice. P(ActivityL=1 | ran) = 8.57% is about the 35 people who ran. Of those 35, what proportion had an activity level of 1? From the original table, the answer is 3 out of 35, which is the 8.57% in the column percents table. This table compares the activity levels of those who ran with those who did not.

P(ran | ActivityL=1) = 33.33% is about the 9 people who had an activity level of 1. Of those people, what proportion ran? From the original table, the answer is 3 out of 9. A conditional probability is a probability for a subset of the people in a table (or in our database). Often we are interested in comparing various subsets. For example, in an election poll we might be interested in the proportion of voters who prefer Candidate A, and also be interested in what that proportion is among certain subsets, such as men, women or blacks.

Independence is closely related to conditional probabilities. If running and activity level were independent, the last two tables might look something like this:


MTB > table c8 c3;
SUBC> rowpercents.

Rows: ActivityL   Columns: Ran?

          no    yes     All

1      66.67  33.33  100.00
2      66.67  33.33  100.00
3      66.67  33.33  100.00
All    66.67  33.33  100.00


MTB > table c8 c3;
SUBC> colpercents.

Rows: ActivityL   Columns: Ran?


           no     yes     All

1       10.00   10.00   10.00
2       67.00   67.00   67.00
3       23.00   23.00   23.00
All    100.00  100.00  100.00

"Independence" can be a tricky word in ordinary English, and is even more so in statistics. For the first of the two hypothetical tables above, independence means that the proportion who ran is the same for all activity levels. Running and activity level are independent here in the sense that the proportion who ran does not change depending on which activity level a person had. But they are dependent in the sense that if I know the percentage who ran for one activity level, and I know the two are independent, then I know the percentage who ran for all the other activity levels. Ironically, statistical independence puts very tight restrictions on what a two-way table can look like. Rarely do we see complete independence. Are running and activity level independent for the actual PULSE data? No, not perfectly, but nearly. In the row percents table, for example, two rows are identical, and the other one is close to those. Certainly there is no support for the hypothesis that folks might be more willing to run if they enjoyed a high level of activity.


*That just does it for that session. If you want the command line to be active every time when you start Minitab go to Tools > Options , find the Session option and input/output ... and check the Enable box. Then it will stay that way until you decide otherwise.