A Wilcoxon signed-rank test is performed when an analyst would like to test for differences between two related treatments or conditions, but the assumptions of a paired samples t-test are violated. This can occur when when difference between repeated measurements are not normally distributed, or if outliers exist. A Wilcoxon signed-rank is considered a “within -subject” analysis or “repeated measures” analysis.
Repeated measures can occur over time or space. For example, you may want to know if students in a class scored better on the final exam than they did on a midterm exam. However, many students scored dramatically better on the final exam compared to the previous midterm. This results in outliers and difference in exam scores that are not normally distributed. A Wilcoxon signed-rank test would be more appropriate than a paired samples t-test in this situation. Since the same student is measured at two separate time points, the measurements are considered repeated over time.
Repeated measures over space can be a little more difficult to understand. For example, you may want to check for differences in blood pressure between measurements taken on the right arm and left arm. Since the same study subject is measured with both treatment conditions in two locations, this would be considered a repeated measurement over space.
Like a paired samples t-test, a Wilcoxon signed-rank is performed when each experimental unit (study subject), receives both available treatment conditions. Thus, the treatment groups have overlapping membership and are considered dependent.
Formally, the two-sided null hypothesis is that the difference between pairs follows a symmetric distribution around zero. That is, the difference between repeated measurements are equally positive and negative. The alternative hypothesis is that the difference between the pairs does not follow a symmetric distribution around zero.
Informally, we are testing to see if the median difference between pairs of observations is equal to zero. Many time analysts will indicate that we are testing to see if medians differ between repeated measurements even though this may not be considered formally correct. However for this reason, many times descriptive statistics regarding median values are provided when the Wilcoxon signed-rank test is performed.
The two-sided null hypothesis is that mean treatment differences are equal to zero. The alternative hypothesis is that the mean treatment difference is not equal to zero.
H0: Paired rank differences are symmetrically distributed around zero
Ha: Paired rank differences are not symmetrically distributed around zero
Wilcoxon Signed-Rank Test Assumptions
The following assumptions must be met in order to run a Wilcoxon signed-rank test:
- Data are considered continuous and measured on an interval or ordinal scale.
- Each pair of observations is independent of other pairs.
- Each pair of measurements is chosen randomly from the same population.
- Differences between groups should be symmetrical in shape.
Wilcoxon Signed-Rank Test Example in R
In this example we will test to see if there is a statistically significant difference in endurance times for nine well-trained cyclists under two treatment conditions. Each cyclist was administered a placebo and a 13 mg dosage of caffeine in random order. Cyclists biked until peddling frequency decreased below 50 rpm under each condition and the time until exhaustion was recorded for each training session. All cyclists re measured under both treatment conditions. We would like to check to see if there was a statistically significant difference in endurance performance time for each subject with and without a 13 mg dosage of caffeine. This data is a subset of a larger experiment.
Dose0 = Endurance performance time under the effects of the placebo.
Dose13 = Endurance performance under the effects of 13 mg of caffeine.
Diff = Dose13 – Dose0; The difference in performance between dosages caffeine levels.
The data for this example is available here:
Wilcoxon Signed-Rank Test R Code
Each package used in the example can be installed with the install.packages commands as follows:
install.packages("MASS", dependencies = TRUE) install.packages("ggplot2", dependencies = TRUE) install.packages("qqplotr", dependencies = TRUE) install.packages("dplyr", dependencies = TRUE) install.packages("tidyr", dependencies = TRUE)
The R code below includes Shapiro-Wilk Normality Tests and QQ plots for each treatment group. Data manipulation and summary statistics are performed using the dplyr package. Boxplots are created using the ggplot2 package. QQ plots are created with the qqplotr package. The shapiro.test and wilcox.test functions are included in the base stats package.
library("MASS") library("ggplot2") library("qqplotr") library("dplyr") library("tidyr") #Import the data dat<-read.csv("C:/Dropbox/Website/Analysis/Wilcoxon Signed Rank/Data/CaffineWSR.csv") #Compute the difference dat$Diff<-dat$Dose13-dat$Dose0 #Create a 'long or 'tall' dataset for descriptive statistics dat_long<-gather(dat, Dose, Minutes, Dose0, Dose13, factor_key=TRUE) #Produce descriptive statistics by group dat_long %>% select(Minutes, Dose) %>% group_by(Dose) %>% summarise(n = n(), mean = mean(Minutes, na.rm = TRUE), sd = sd(Minutes, na.rm = TRUE), stderr = sd/sqrt(n), LCL = mean - qt(1 - (0.05 / 2), n - 1) * stderr, UCL = mean + qt(1 - (0.05 / 2), n - 1) * stderr, median=median(Minutes, na.rm = TRUE), min=min(Minutes, na.rm = TRUE), max=max(Minutes, na.rm = TRUE), IQR=IQR(Minutes, na.rm = TRUE)) #Perform the Shapiro-Wilk Test for Normality on each group shapiro.test(dat$Diff) #Produce boxplots and visually check for outliers ggplot(dat, aes(x = "", y = Diff)) + stat_boxplot(geom ="errorbar", width = 0.5) + geom_boxplot(fill = "light blue") + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + ggtitle("Boxplot of day 3 - day 1 differences") + theme_bw() + theme(legend.position="none") #Perform a Wilcoxon signed rank test m1<-wilcox.test(dat$Dose13, dat$Dose0, paired=TRUE, conf.int=TRUE) print(m1)
Wilcoxon Signed-Rank Test Annotated R Output
Many times, analysts forget to take a good look at their data prior to performing statistical tests. Descriptive statistics are not only used to describe the data but also help determine if any inconsistencies are present. Detailed investigation of descriptive statistics can help answer the following questions (in addition to many others):
- How much missing data do I have?
- Do I have potential outliers?
- Are my standard deviation and standard error values large relative to the mean?
- In what range most of my data fall for each treatment?
Dosea nb meanc sdd stderre LCLf UCLf mediang minh maxh IQRi
Dose0 9 46.44 12.49 4.16 36.84 56.04 45.2 28.34 66.38 20.50
Dose13 9 58.14 15.13 5.05 46.52 69.78 59.3 36.20 79.12 22.99
- Dose– This column identifies the levels of the treatment variable.
- n – This column identifies how many data points (cyclists) are in each dose category.
- mean – The mean endurance performance for each dosage.
- sd – The endurance performance standard deviation of each dosage.
- stderr – The endurance performance standard error of each dosage.
- LCL, UCL – The upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group, assuming the data is normally distributed.
- median – The median endurance performance for each dosage.
- min, max – The minimum and maximum endurance performance observed for each dosage.
- IQR – The interquartile range endurance performance for each dosage level. The interquartile range is the 75th percentile – 25th percentile.
Prior to performing a paired t-test or a Wilcoxon signed-rank, it is important to validate our assumptions to ensure that we are performing an appropriate and reliable comparison. Testing normality should be performed on the dosage differences using a Shapiro-Wilk normality test (or equivalent), and/or a QQ plot for large sample sizes. Many times, histograms can also be helpful,
Shapiro-Wilk normality test data: dat$Diff
W = 0.83208, p-value = 0.04712
- W – The Shapiro-Wilk (W) test statistics for each test is provided for each group.
- p-value – The p-value for each test is provided. A p-value < 0.05 would indicate that we should reject the assumption of normality. Since the Shapiro-Wilk Test p-values = 0.047, we conclude the data is not normally distributed. Thus, a Wilcoxon signed-rank test is more appropriate than a paired t-test to evaluate this data.
Boxplots to Visually Check for Outliers
The ggplot2 package provides a box plot of dosage differences. This can help visually identify outliers. The boxplot below shows one potential outlier outside the upper whiskers of the plot. Rank based tests such as the Wilcoxon signed-rank test are appropriate when outliers are present.
Wilcoxon Signed-Rank Test in R
So far, we have determined that the differences between dosages levels is not normally distributed and we do have a major influential outlier. Our next step is to officially perform a Wilcoxon signed-rank test to determine if there is a statistically significant difference in endurance performance between caffeine dosage levels.
Wilcoxon signed rank exact test data: dat$Dose13 and dat$Dose0
V = 45a, p-value = 0.003906b
alternative hypothesis: true location shift is not equal to 0
- V – This is the Wilcoxon signed-rank test statistic. It is the ratio of the mean difference to the standard error.
- p-value – This is the p-value associated with the Wilcoxon signed-rank test. That is to say if the P value < 0.05 (assuming alpha=0.05) then there is a statistically significant difference in endurance performance between the 0 and 13 mg caffeine dosage level. For our example, we have a p-value = 0.003906. Thus, we reject the null hypothesis that the endurance performance rank differences are symmetrically distributed around zero and we conclude that a difference between caffeine dosage levels exist. Practically speaking, you conclude that there are differences in median endurance performance between caffeine dosage levels.
Wilcoxon Signed-Rank Test Interpretation and Conclusions
A p-value = .003906 indicates that, we should reject the null hypothesis that the endurance performance rank differences are symmetrically distributed around zero and we conclude that a difference between caffeine dosage levels exist. Practically speaking, you conclude that there are differences in median endurance performance between caffeine dosage levels. Descriptive statistics report that cyclists with no caffeine dosage reported a median endurance performance of 45.2 while cyclists with a 13 mg caffeine dosage of 59.3. Thus, cyclist endurance performance improved when under the effects of caffeine.
What to do When Assumptions are Broken or Things Go Wrong
When the differences between repeated measurements are normally distributed and do not contain outliers, a paired t-test is preferable to the Wilcoxon signed rank test since a paired t-test is considered more statistically powerful.
Additional options include considering permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.
If you need to compare more than two dependent groups, a single factor repeated measures analysis of variances (ANOVA) or nonparametric Friedman test would be appropriate.
Furthermore, if you have one between-subject factor and one-within subject factor to consider simultaneously, then a repeated measure mixed model ANOVA would be appropriate.
Missing values can severely impact a Wilcoxon signed-rank test because the entire row of data will generally be excluded. If you have a lot of missing data, one alternative would be to perform a single factor repeated measures mixed model ANOVA. This would allow for the computation of estimated marginal means to compensate for the uneven replication between groups.
A Wilcoxon signed-rank test is not appropriate if each experimental unit (subject) only receives one of two available treatments. For example, if you would like to see if first-year students scored differently on an exam when compared to second-year students, then each subject only has one of two potential factor levels. If this is the case, then an independent samples t-test would be a more appropriate course of action.
Additional Resources and References
Muenchen, R.A. (2011). R for SAS and SPSS Users, Second Edition. New York, NY: Springer, LLC.
Higgins, J.J. (2004). Introduction to Modern Nonparametric Statistics, Pacific Grove, CA: Brooks/Cole, Thomson Learning, Inc.
Conover W.J. (1999). Practical Nonparametric Statistics. New York, NY: John Wiley & Sons, Inc.
Pasman, W.J, Baak, M.A. Jeukendrup, A.E., Haan, A. (1995). The Effect of Different Dosages of Caffeine on Endurance Performance Time. Int. J. Sport Med., Vol16, No. 4, pp. 225-230.