Chi-square with Excel

For contingency tables, we start with the heart attack data. If you have not worked with this data before you can find a description here. This is a very large data set and so is provided as zip files. (You may need a program such as winzip to unzip them). Available are plain text (with tabs separating entries), Excel versions of the data. If possible, download and unzip the Minitab version of the heart attack data and open it in your version of Minitab. (It may be too big for student versions.) We would like to see if the mortality rate is different between men and women.

To get 2x2 contingency table Select Data > Pivote Table and Pivote charts next select two columns of Sex and Died, click on next and then choose Layout, then drag the variable Sex and place into the ROWS, similarly drag the variable Died and place into the COLUMNS, also drag any of the variables Sex or Died and place into DATA, then double click and select count. Finally click on finish. You will get following table.

SEX 0 1Grand Total
F 4298 767 5065
M 7136 643 7779
Grand Total11434141012844

Now we will find expected value from the above table. Formula for expected value = (Row Total)*(Column Total)/Grand Total. Let's first find the expected value for first cell. For first cell we have row total is 5065 and Column total is 11434 and grand total is 12844. Therefore Expected value = 5065*11434/12844 = 4508.97 like this find for all cell. Then we get the matrix as follows:

SEX		    0	      1	   Grand Total
F		 4508.97   556.03     5065
M		 6925.03   853.97     7779
Grand Total	 11434	   1410	      12844

Select Insert > Function> Statistical CHITEST then in actual range select the 2x2 observed value matrix and in expected range select the 2x2 expected value matrix, finally click Ok. We get the p-value for the Chi-square test = 3.75555E-34 = 3.75*10-34. Here p-value is less than 0.05 (level of Significance). Therefore we will reject H0 at 5% of level of significance.

With 12,844 observations, getting the table is a lot more work than computing chi-square, and it is best to let the computer do it. If you have an existing table, then no need to go pivot table just find the expected value and follow the steps as given above. For example, here are some data from the University of Texas Southwestern Medical Center reported in De Veaux, Velleman and Bock, Stats.: Data and Models 2nd ed., 2008, Addison Wesley, Boston. (It's the last example in Chapter 26, p.645.) The disease hepatitis C can be transmitted through needle pricks including those involved in tattoos. The goal here was to compare infection rates between people with differing tattoo status. A summary of their data is

Hepatitus C No Hepatitus C
Tattoo from parlor 17 35
Tattoo from elsewhere 8 53
No tattoo 22 491

We get expected value matrix is as follow:

Hepatitus CNo Hepatitus C
Tattoo from parlor3.9048.10
Tattoo from elsewhere4.5856.42
No tattoo38.52474.48

When you go this route you have no control over what gets printed and have to put up with this congested table. In this case, the information in that crowded table is useful. A rule of thumb is that for the approximations in using the continuous chi-squared distribution for discrete counts to work well we should have expected counts above 5 in each cell. Here we have two violations out of six cells. One reason we check this is that because expected counts go into the denominator in computing the sample chi-squared, they could inflate the result. One common after-the-fact remedy is see if there are some rows in the table we can combine is a rational way. In this case we could concede that we do not have enough data to answer questions about where people got their tattoos and pool the two sources together to get a new table with these entries.

25 88
22 491

The analysis is left as an exercise but the conclusion is that the hepatitis rates are quite different for the tattoo vs. no tattoo groups.