This page uses the PULSE data. If you are not already familiar with it, you should read a description of the data. You can find the entire dataset at our site as a plain text file or as an Excel spreadsheet. Download the text version and save it in the directory where you installed R. After that, run R and type
> PULSE = read.table("pulse.txt",header=TRUE)
at the R prompt. The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. (These should not include spaces.) You can now get a table of contents for what you have created in R with
> objects()
This should return PULSE along with any other objects you may have created. You will not see on this list any of the variables that are inside of PULSE because they are hiding. To see them, type
> names(PULSE) [1] "PuBefor" "PuAfter" "Ran." "Smokes." "Sex" "Height" [7] "Weight" "ActivityL"
To bring them out of hiding, you must "attach" them to your R workspace. (This avoids conflicts if several tables include variables with the same name. Attach just one at a time.)
> attach(PULSE)
You can get an assortment of summary statistics for all your variables by using the summary command.
> summary(PULSE)
PuBefore PuAfter Ran. Smokes. Sex Height
Min. : 48.00 Min. : 50 no :57 no :64 female:35 Min. :61.00
1st Qu.: 64.00 1st Qu.: 68 yes:35 yes:28 male :57 1st Qu.:66.00
Median : 71.00 Median : 76 Median :69.00
Mean : 72.87 Mean : 80 Mean :68.72
3rd Qu.: 80.00 3rd Qu.: 85 3rd Qu.:72.00
Max. :100.00 Max. :140 Max. :75.00
Weight ActivityL
Min. : 95.0 Min. :0.000
1st Qu.:125.0 1st Qu.:2.000
Median :145.0 Median :2.000
Mean :145.2 Mean :2.109
3rd Qu.:155.5 3rd Qu.:2.000
Max. :215.0 Max. :3.000
Note that for the summaries reported above, R gives different summaries depending on whether the column contains numbers or text. This seems right except possibly for the ActivityL variable which is an ordered category and so might better be treated as qualitative data. For now, just remember that anything in a data file that looks like a number to R will be treated as such and you may need to take special steps if you have a column of numbers that are actually labels -- medical diagnosis codes, for example.
The one summary you might miss from the above is the standard deviation. You can easily get that for the variable of your choice.
sd(PuAfter) [1] 17.09379
We can compare the before and after pulse rates with parallel boxplots. R wants to see the data in a format that has all the measurements (both before and after) in one column with another column labelling the two sets of data. The R function c (for concatenate) combines a bunch of things into a single thing.
> rate = c(PuBefor,PuAfter) > rate [1] 48 54 54 58 58 58 60 60 60 60 61 62 62 62 62 62 62 62 [19] 62 62 64 64 64 64 66 66 66 66 66 68 68 68 68 68 68 68 [37] 68 68 68 68 70 70 70 70 70 70 72 72 72 72 72 72 74 74 [55] 74 74 74 76 76 76 76 76 78 78 78 78 78 80 80 80 82 82 [73] 82 84 84 84 84 86 87 88 88 88 90 90 90 90 92 92 94 96 [91] 96 100 54 56 50 70 58 56 76 62 70 66 70 76 75 58 100 98 [109] 62 66 68 66 88 80 62 60 78 82 102 72 76 72 76 76 112 66 [127] 68 64 68 66 68 68 72 106 94 62 70 66 80 74 74 70 70 68 [145] 84 76 70 74 76 118 76 76 76 76 104 118 76 78 80 96 128 74 [163] 100 84 80 84 84 84 80 84 84 110 84 74 94 88 90 92 84 94 [181] 92 140 116 115
We used the name rate for the stacked variable. We also need a variable to keep track of which group each measurement came from. The R rep (for repeat) function is useful for generating repetitive data.
> B = rep("Before",92)
> A = rep("After",92)
> BA = c(B,A)
> BA
[1] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[9] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[17] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[25] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[33] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[41] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[49] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[57] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[65] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[73] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[81] "Before" "Before" "Before" "Before" "Before" "Before" "Before" "Before"
[89] "Before" "Before" "Before" "Before" "After" "After" "After" "After"
[97] "After" "After" "After" "After" "After" "After" "After" "After"
[105] "After" "After" "After" "After" "After" "After" "After" "After"
[113] "After" "After" "After" "After" "After" "After" "After" "After"
[121] "After" "After" "After" "After" "After" "After" "After" "After"
[129] "After" "After" "After" "After" "After" "After" "After" "After"
[137] "After" "After" "After" "After" "After" "After" "After" "After"
[145] "After" "After" "After" "After" "After" "After" "After" "After"
[153] "After" "After" "After" "After" "After" "After" "After" "After"
[161] "After" "After" "After" "After" "After" "After" "After" "After"
[169] "After" "After" "After" "After" "After" "After" "After" "After"
[177] "After" "After" "After" "After" "After" "After" "After" "After"
>
This creates a variable B that is just 92 instances of the word "Before" and similarly for A. Then the two are concatenated into BA, the categorical variable that keeps track of before and after. Now you can type
> boxplot(rate ~ BA)
to get boxplots.
The tilde "~" is used often in R. Here you can think of it as saying we are going to see how rate depends on BA. You can see that the "after" data show a higher median and more variability along with high outliers not present in the "before" data. The fact that both center and variability have changed makes it hard to give a simple comparison here. If we looked only at the mean, we might say that pulse rates went up by about seven points, and they did on average. However, the lowest rates have not gone up much at all (The minimum went up by 2.) while the highest have changed considerably (The maximum went up by 40!). A simple average change does not describe what happened here very accurately, which is one reason we always need to make a picture!
Because the class contained both men and women, we might expect some bimodality in the heights and weights. Stem and leaf plots are R's default tool for assessing the shape of a distribution. (Boxplots tend to hide bimodality.)
> stem(Height) The decimal point is at the | 60 | 08 62 | 000080000 64 | 0000005 66 | 000000000000000 68 | 000000000000000000005 70 | 0000000000005 72 | 00000000000000055 74 | 00000000 > stem(Weight) The decimal point is 1 digit(s) to the right of the | 8 | 5 10 | 288002556688 12 | 000123555550000013555688 14 | 000025555580000000000355555555557 16 | 000045000055 18 | 000500005 20 | 5
Are the Heights trimodal or is that just random scatter? The weights look unimodal but looking at our data on a single scale is rarely enough.
> stem(Weight, scale=2)
The decimal point is 1 digit(s) to the right of the |
9 | 5
10 | 288
11 | 002556688
12 | 00012355555
13 | 0000013555688
14 | 00002555558
15 | 0000000000355555555557
16 | 000045
17 | 000055
18 | 0005
19 | 00005
20 |
21 | 5
Now make boxplots as you did above but with Sex as the factor and Height as the measurement.
> boxplot(Height ~ Sex)
Both sexes show reasonably compact, symmetric distributions with no outliers but the men are consistently about 5 inches taller than the women. Because the two distributions have similar shapes and variabilities, we can reasonably say that the men as a group are about 5 inches taller than the women. Compare this to the situation with the pulse rates above, where such a simple description was an oversimplification. Note that although the two sexes are clearly different here, there is enough overlap that the stem and leaf does not clearly show the two groups.
We saw in the earlier boxplots of the pulse rates that the after rates were skewed toward high values with several possible outliers. When we see most of the alleged outliers on one side of a boxplot it may well be that we have a skewness problem rather than an outlier problem. Another sign is outliers that start close to the whiskers and gradually thin out. Another is a boxplot that looks skewed in the direction of the outliers even if we erase them. It is important to recognize these two different situations because the remedies are quite different. If we have an outlier problem, we need to investigate the individual points to see if we can find a reason for their unusual behavior. If we have a skewness problem we might use the median rather than the mean to describe the center, or we might re-express (transform) the data to make it more symmetric. Here are a number of common transformations and their effect on the after pulse readings. First, here is a stem and leaf of the original data.
5* | 04
5. | 6688
6* | 022224
6. | 666666888888
7* | 000000022244444
7. | 566666666666688
8* | 000002444444444
8. | 88
9* | 022444
9. | 68
10* | 0024
10. | 6
11* | 02
11. | 5688
12* |
12. | 8
13* |
13. |
14* | 0
Let's see if a square root transformation makes this less skewed.
> SquareRoot = sqrt(PuAfter) > stem(SquareRoot) The decimal point is at the | 7 | 13 7 | 556679999 8 | 01111112222224444444 8 | 5556666677777777777778899999 9 | 122222222244 9 | 56677789 10 | 00123 10 | 567899 11 | 3 11 | 8
This is better. Repeat with
ln = log(PuAfter)
to get natural logarithms (base e).
> ln = log(PuAfter) > stem(ln) The decimal point is 1 digit(s) to the left of the | 38 | 19 40 | 3366933336999999 42 | 22222255555558880000023333333333336688888 44 | 13333333338802244468 46 | 11246024577 48 | 54 > stem(ln, scale=2) The decimal point is 1 digit(s) to the left of the | 39 | 19 40 | 33669 41 | 33336999999 42 | 2222225555555888 43 | 0000023333333333336688888 44 | 133333333388 45 | 02244468 46 | 11246 47 | 024577 48 | 5 49 | 4
Going from the original data to the square roots to the logarithms there is successively less skewness. There is still some skewness present in the logs and so we will try two stronger transformations: negative reciprocal roots and negative reciprocals. The reason for the negatives here is that when you take reciprocals you reverse up and down in the graph. This is a nuisance when comparing several graphs, so we introduce the minus sign to switch things back again. For most purposes, however, we do not do that. For example, if you measure fuel economy in miles per gallon and take the reciprocal you get gallons per mile -- a perfectly good, but different, way to measure fuel consumption. No one would normally put a negative sign in front of either measurement. However, we do have to mentally remember that high MPG is good while high GPM is bad, even though in some sense they measure the same thing.
You can use the techniques described above to get these transformations.
> nrr = -1/sqrt(PuAfter) > stem(nrr) The decimal point is 2 digit(s) to the left of the | -14 | 1 -13 | 64411 -12 | 9777753333331111110000000 -11 | 88866666555555555555533222220 -10 | 999999999775443332100 -9 | 987543322 -8 | 85 > stem(nrr,scale=2) The decimal point is 2 digit(s) to the left of the | -14 | 1 -13 | 6 -13 | 4411 -12 | 977775 -12 | 3333331111110000000 -11 | 888666665555555555555 -11 | 33222220 -10 | 999999999775 -10 | 443332100 -9 | 9875 -9 | 43322 -8 | 85 > nr = -1/PuAfter > stem(nr) The decimal point is 3 digit(s) to the left of the | -20 | 0 -18 | 5 -16 | 992271111 -14 | 62222227777773333333 -12 | 99955555322222222222288555552 -10 | 999999999441996664200 -8 | 864197655 -6 | 81 > stem(nr, scale=2) The decimal point is 3 digit(s) to the left of the | -20 | 0 -19 | -18 | 5 -17 | 9922 -16 | 71111 -15 | 6222222 -14 | 7777773333333 -13 | 999555553222222222222 -12 | 88555552 -11 | 999999999441 -10 | 996664200 -9 | 8641 -8 | 97655 -7 | 81
Here four common transformations were applied in order of increasing strength for reducing skewness toward high vales: square roots, logarithms (here to base e), negative reciprocal roots, and negative reciprocals. The last two transformations seem to have made the distribution most symmetric.
©2008 Robert W. Hayden