Chi-square with R

We will use the heart attack data as an example. If you have not worked with this data before you can find a description here. This is a very large data set and so is provided as zip files. (You may need a program such as winzip to unzip them). Available are plain text (with tabs separating entries) and Excel versions of the data. If possible, download and unzip the text version of the heart attack data and open it in R. A separate page gives details on reading text files into R using this data as an example.

R prefers to work with the counts rather than the raw data. If you do not have the counts, but you have the data in variables in R, you can use the R table command to get the counts. This also checks for some types of gross errors, such as an 11 in a column that is supposed to be 0-1. You have to reattach the heart attack data each time you open R. We start with the heart attack data and the Goodness-of-Fit test.

> names(heartatk)
[1] "Patient"   "DIAGNOSIS" "SEX"       "DRG"       "DIED"      "CHARGES"  
[7] "LOS"       "AGE"   
   
> attach(heartatk)

> table(SEX)
SEX
   F    M 
5065 7779 

> chisq.test(table(SEX))

        Chi-squared test for given probabilities

data:  table(SEX) 
X-squared = 573.4815, df = 1, p-value < 2.2e-16

The command chisq.test(table(SEX)) does a chi-square goodness of fit test on the table for the SEX variable. The default is to test for equal expected counts in every cell. That is hardly the case here. 2.2e-16 means 2.2X10-16 = 0.00000000000000022, which is small.

If you do not want equal proportions, you need to give a set of proportions for each cell. Here are more reasonable proportions (60/40) for the heart attack data.

You can make a cross-tabulation of two categorical variables with table and do a test of independence or homogeneity with chisq.test. (We return to the heart attack data.)

> table(SEX,DIED)
   DIED
SEX    0    1
  F 4298  767
  M 7136  643
> chisq.test(table(SEX,DIED))

        Pearson's Chi-squared test with Yates' continuity correction

data:  table(SEX, DIED) 
X-squared = 147.7612, df = 1, p-value < 2.2e-16

We have seen this p-value before! It is probably the smallest non-zero number R can handle and hence not very accurate. However, p is definitely small!

With 12,844 observations, getting the table is a lot more work than computing chi-square, and it is best to let the computer do it. If you have an existing table, R will analyze it -- but not without putting up a fight. You need to enter the table one row at a time and use rbind to combine the rows into a table. Here we enter a table that looks like.

17 35
8 53
22 491

without totals.

> hep=rbind(c(17,35),c(8,53),c(22,491))
> hep
     [,1] [,2]
[1,]   17   35
[2,]    8   53
[3,]   22  491
> chisq.test(hep)

        Pearson's Chi-squared test

data:  hep 
X-squared = 57.9122, df = 2, p-value = 2.658e-13

Warning message:
Chi-squared approximation may be incorrect in: chisq.test(hep) 

The warning message probably refers to a small expected count in one cell. We can check that using a standard R trick. Many R procedures compute far more than they report. You can save all that is computed to a file and then extract what you want. This allows the output of one procedure to be used as the input to another -- a very powerful feature if you program in R.

> results <- chisq.test(hep)
Warning message:
Chi-squared approximation may be incorrect in: chisq.test(hep) 

> results$expected
          [,1]      [,2]
[1,]  3.904153  48.09585
[2,]  4.579872  56.42013
[3,] 38.515974 474.48403

Here is a complete list of things you could type after the $, obtained by typing ?chisq.test.

statistic: the value the chi-squared test statistic.

parameter: the degrees of freedom of the approximate chi-squared
          distribution of the test statistic, 'NA' if the p-value is
          computed by Monte Carlo simulation.

 p.value: the p-value for the test.

  method: a character string indicating the type of test performed, and
          whether Monte Carlo simulation or continuity correction was
          used.

data.name: a character string giving the name(s) of the data.

observed: the observed counts.

expected: the expected counts under the null hypothesis.

residuals: the Pearson residuals, '(observed - expected) /
          sqrt(expected)'.

We see that we have two cells with small expected counts.


©2008 statistics.com, portions ©2006-2007 Robert W. Hayden and used by permission.