A Wilcoxon signed-rank test is performed when an analyst would like to test for differences between two related treatments or conditions, but the assumptions of a paired samples t-test are violated. This can occur when when difference between repeated measurements are not normally distributed, or if outliers exist. A Wilcoxon signed-rank is considered a “within -subject” or “repeated measures” analysis.
Repeated measures can occur over time or space. For example, you may want to know if students in a class scored better on the final exam than they did on a midterm exam. However, many students scored dramatically better on the final exam compared to the previous midterm. This results in outliers and difference in exam scores that may not be normally distributed. A Wilcoxon signed-rank test would be more appropriate than a paired samples t-test in this situation. Since the same student is measured at two separate time points, the measurements are considered repeated over time.
Repeated measures over space can be a little more difficult to understand. For example, you may want to check for differences in blood pressure between measurements taken on the right arm and left arm. Since the same study subject is measured with both treatment conditions in two locations, this would be considered a repeated measurement over space.
Like a paired samples t-test, a Wilcoxon signed-rank is performed when each experimental unit (study subject), receives both available treatment conditions. Thus, the treatment groups have overlapping membership and are considered dependent.
Formally, the two-sided null hypothesis is that the difference between pairs follows a symmetric distribution around zero. That is, the difference between repeated measurements are equally positive and negative. The alternative hypothesis is that the difference between the pairs does not follow a symmetric distribution around zero.
Informally, we are testing to see if the median difference between pairs of observations is equal to zero. Many time analysts will indicate that we are testing to see if medians differ between repeated measurements even though this may not be considered formally correct. However for this reason, many times descriptive statistics regarding median values are provided when the Wilcoxon signed-rank test is performed.
The two-sided null hypothesis is that mean treatment differences are equal to zero. The alternative hypothesis is that the mean treatment difference is not equal to zero.
H0: Paired rank differences are symmetrically distributed around zero
Ha: Paired rank differences are not symmetrically distributed around zero
Wilcoxon Signed-Rank Test Assumptions
The following assumptions must be met in order to run a Wilcoxon signed-rank test:
- Data are considered continuous and measured on an interval or ordinal scale.
- Each pair of observations is independent of other pairs.
- Each pair of measurements is chosen randomly from the same population.
- Differences between groups should be symmetrical in shape.
Wilcoxon Signed-Rank Test Example
In this example we will test to see if there is a statistically significant difference in endurance times for nine well-trained cyclists under two treatment conditions. Each cyclist was administered a placebo and a 13 mg dosage of caffeine in random order. Cyclists biked until peddling frequency decreased below 50 rpm under each condition and the time until exhaustion was recorded for each training session. All cyclists re measured under both treatment conditions. We would like to check to see if there was a statistically significant difference in endurance performance time for each subject with and without a 13 mg dosage of caffeine. This data is a subset of a larger experiment.
Dose0 = Endurance performance time under the effects of the placebo.
Dose13 = Endurance performance under the effects of 13 mg of caffeine.
Diff = Dose13 – Dose0; The difference in performance between dosages caffeine levels.
The data for this example is available here:
Wilcoxon Signed-Rank Test SAS Code
In SAS, PROC MEANS can be used to produce basic descriptive statistics. PROC UNIVARIATE is used to perform the Shapiro-Wilk Normality test of group differences, QQ plots of group differences, and the official Wilcoxon signed-rank test. PROC UNIVARIATE can also be used to produce nonparametric confidence intervals around the median.
Here is the annotated code for the example. All assumption checks are provided along with the Wilcoxon signed-rank test:
*Import the data; proc import datafile='C:\Dropbox\Website\Analysis\Wilcoxon Signed Rank\Data\CaffineWSR.csv' out=work.caff dbms=csv replace; run; *Compute the difference between dosage levels; data caff; set caff; Diff=Dose13-Dose0; run; *Produce descriptive statistics; proc means data=caff n nmiss mean std stderr lclm uclm median min max qrange maxdec=2; var Dose0 Dose13; run; *Test for normality; proc univariate data=caff normal cipctldf plots; var Diff; histogram Diff /normal; qqplot /normal (mu=est sigma=est); run;
Wilcoxon Signed-Rank Test Annotated SAS Output
Many times, analysts forget to take a good look at their data before performing statistical tests. Descriptive statistics are not only used to describe the data but also help determine if any inconsistencies are present. Detailed investigation of descriptive statistics can help answer the following questions (in addition to many others):
- How much missing data do I have?
- Do I have potential outliers?
- Are my standard deviation and standard error values large relative to the mean?
- In what range most of my data fall for each treatment?
- Variable – Each treatment level of our independent variable.
- N – The number of observations for each treatment.
- N Miss – The number of missing observations for each treatment.
- Mean – The mean value for each treatment.
- Std Dev – The standard deviation of each treatment.
- Std Error – The standard error of each treatment. That is the standard deviation / sqrt (n).
- Lower and Upper 95% CL for Mean – The upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group.
- Median – The median value for each treatment.
- Minimum, Maximum – The minimum and maximum value for each treatment.
- Quartile Range – The inner quartile range of each treatment. That is the 75th percentile – 25th percentile.
PROC UNIVARIATE can create distribution free 95% confidence intervals on many different percentiles. This can be helpful when describing data that does not follow a normal distribution. Median 95% confidence intervals for the difference in endurance performance time between caffeine levels is presented below.
- Level – Indicates the percentile for which the confidence interval is computed.
- Quantile – Designates the quantile corresponding to each percentile. Since 50% of the data falls above and below this point, the quantile value in this table also corresponds to the median for each group.
- 95% Confidence Limits Distribution Free – The 95% confidence interval for the median of the difference in endurance performance times between 13 mg caffeine and placebo.
Prior to performing the Wilcoxon signed-rank test, it is important to evaluate our assumptions to ensure that we are performing an appropriate and reliable comparison. If normality is present, a paired samples t-test would be a more appropriate test assuming major outliers do not exist.
Testing normality should be performed on the day differences using a Shapiro-Wilk normality test (or equivalent), and/or a QQ plot for large sample sizes. Many times, histograms can also be helpful. In this example, we will use PROC UNIVARIATE to produce our Shapiro-Wilk normality test for the dosage difference, a histogram, and corresponding QQ plots.
The Shapiro-Wilk normality test on the difference between days:
- Test – Four different normality tests are presented.
- Statistic – The test statistics for each test is provided here.
- p Value – The p-value for each test is provided. A p-value < 0.05 would indicate that we should reject the assumption of normality. Since the Shapiro-Wilk Test p-values are > 0.05 for each group, we conclude the data is normally distributed.
Histogram of Dosage Differences
The histogram of the differences between the 13 mg dosage of caffeine and our placebo is quite telling. We can observe significant skew in our data which implies the lack of normality.
PROC UNIVARIATE provides a QQ Plot of the differences between dosage levels. The vast majority of points should follow the theoretical normal line.
In this example, the Shapiro-Wilk normality test and histogram of the differences in endurance performance time between caffeine dosage levels both seem to conclude that our assumptions regarding normality are violated. The QQ plot is a little more obscure.
Since the Shapiro-Wilk test p-value < 0.05, we will reject the assumption of normality and conclude that our dosage difference between caffeine dosages is not normally distributed. Thus, a Wilcoxon signed-rank test would be more appropriate than a paired t-test to perform our comparison.
Boxplots to Visually Check for Outliers
PROC UNIVARAITE will provide a vertical box plot of the difference between dosage levels. This can help visually identify outliers. The boxplot below shows one points outside the upper whisker of the plot. As a result, we conclude there is one outlier in the differences in endurance time between the placebo and 13 mg of caffeine of about 35 minutes. This helps to reinforce that a Wilcoxon signed-rank test is more appropriate than a paired t-test to check for performance differences between caffeine dosage levels.
Wilcoxon Signed-Rank Test Results in SAS
So far, we have determined that the differences between dosage levels are not normally distributed and we have one outlier present. Our next step is to officially perform a Wilcoxon signed-rank test to determine if there is a statistically significant difference in ‘performance time until exhaustion’ in cyclists with and without a 13mg dosage of caffeine.
WARNING: PROC UNIVARIATE requires we specify the difference variable explicitly in the VAR statement to perform this this comparison accurately. Thus the following data step should not be skipped prior to performing the actual test on the newly created ‘Diff’ variable:
*Compute the difference between dosage levels; data caff; set caff; Diff=Dose13-Dose0; run; *Test for normality; proc univariate data=caff normal cipctldf plots; var Diff; histogram Diff /normal; qqplot /normal (mu=est sigma=est); run;
The Wilcoxon Signed-Rank test results are as follows:
- Test – This column identifies the test being performed. In SAS, this table contains additional tests including the a t-test on the differences and the regular sign test. However, we are just interested in the results for the ‘Signed Rank’ test.
- Statistic – The test statistic of interest. Here S corresponds to the sum of the ranks of the positive values minus the sum expected under the null hypothesis = n*(n+1)/4.
- Pr >= |S| – This is the p-value associated with the Wilcoxon sign rank test. That is to say if the P value < 0.05 (assuming alpha=0.05) then there is a statistically significant difference in endurance performance time between the placebo and a 13 mg dosage of caffeine. In essence, we are testing to see if endurance performance is different than zero between our two treatment groups. For our example, we have a p-value = 0.0039. Thus, we reject the null hypothesis that the paired rank difference are symmetric around zero and we conclude that a difference in endurance performance time exists. Thus, there is a difference in the level of performance when a cyclist is on caffeine versus when they are not.
Wilcoxon Signed-Rank Test Interpretation and Conclusions
A p-value = 0.0039 indicates that we should reject the null hypothesis that the paired rank difference are symmetric around zero and we conclude that a difference in endurance performance time exists. Practically speaking, we conclude that endurance performance times differ between treatments.
In a Wilcoxon signed-rank test, the challenge can be correctly interpreting the direction of the difference. It is important to note that dose 0 (placebo) was subtracted from dose 13 (13 mg caffeine) as follows: Diff = Dose 13 – Dose 0. That is to say, difference scores above zero represent caffeine performance exceeds placebo. Difference scores below zero mean that placebo scores exceed performance under a dosage of caffeine.
Looking back at our descriptive statistics we can see that the median of the performance endurance differences between treatments is 7.86. Thus, we conclude that there is a performance difference between caffeine and placebo treatments and we estimate that cyclists can pedal a median of approximately 7.86 minutes long under the effects of caffeine. Furthermore, we are 95% certain that the median of the difference between in performance times between treatments fall between 3.09 and 22.57.
What to do When Assumptions are Broken or Things Go Wrong
The Wilcoxon Signed-Rank test is typically used as a last resort. This is because it is a lower power test when compared to the paired t-test.
More modern alternatives include permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.
If you need to compare more than two dependent groups, a single factor repeated measures ANOVA or Friedman test may be appropriate.
Missing values can severely impact a paired sample t-test because the entire row of data will generally be excluded. If you have a lot of missing data, one alternative would be to perform a single factor repeated measures mixed model ANOVA. This would allow for the computation of estimated marginal means to compensate for the uneven replication between groups. If mixed model residuals are not normally distributed, a transformation can be attempted to correct for the lack of normality.
A Wilcoxon signed-rank test is not appropriate if each experimental unit (subject) only receives one of two available treatments. For example, if you would like to see if first-year students scored differently on an exam when compared to second-year students, then each subject only has one of two potential factor levels. If this is the case, then an independent samples t-test or a Mann-Whitney U test would be a more appropriate course of action.
Additional Resources and References
SAS Version 9.4, SAS Institute Inc., Cary, NC.
Higgins, J.J. (2004). Introduction to Modern Nonparametric Statistics, Pacific Grove, CA: Brooks/Cole, Thomson Learning, Inc.
Conover W.J. (1999). Practical Nonparametric Statistics. New York, NY: John Wiley & Sons, Inc.
Littell, R.C., Stroup, W.W., and Freund R.J. (2002). SAS for Linear Models, Fourth Edition. Cary, NC: SAS Institute Inc.
Mitra, A. (1998). Fundamentals of Quality Control and Improvement. Upper Saddle River, NJ: Prentice Hall.
Laplin, L.L. (1997). Modern Engineering Statistics. Belmont, CA: Wadsworth Publishing Company.
W.J. Pasman, M.A. van Baak, A.E. Jeukendrup, A. de Haan (1995). The Effect of Different Dosages of Caffeine on Endurance Performance Time, International Journal of Sports Medicine, Vol. 16, pp 225-230.