# Mann-Whitney U Test in R

## Introduction

A Mann-Whitney U test is typically performed when an analyst would like to test for differences between two independent treatments or conditions.  However, the continuous response variable of interest is not normally distributed.  For example, you may want to know if first-years students scored differently on an exam when compared to second-year students, but the exam scores for at least one group do not follow a normal distribution. The Mann-Whitney U test is often considered a nonparametric alternative to an independent sample t-test. The Mann-Whitney U test is also known as the Mann-Whitney-Wilcoxon, Wilcoxon-Mann-Whitney, and the Wilcoxon Rank Sum.

A Mann-Whitney U test is typically performed when each experimental unit, (study subject) is only assigned one of the two available treatment conditions. Thus, the treatment groups do not have overlapping membership and are considered independent. A Mann-Whitney U test is considered a “between-subjects” analysis.

Formally, the null hypothesis is that the distribution functions of both populations are equal. The alternative hypothesis is that the distribution functions are not equal.

Informally, we are testing to see if mean ranks differ between groups.  Since mean ranks approximate the median,  many time analysts will indicate that we are testing for median differences even though this may not be considered formally correct. For this reason, many times descriptive statistics regarding median values are provided when the Mann-Whitney U test is performed.

H0: distribution1 = distribution2

Ha: distribution1 ≠ distribution2

## Mann-Whitney U Test Assumptions

The following assumptions must be met in order to run a Mann-Whitney U test:

1. Treatment groups are independent of one another. Experimental units only receive one treatment and they do not overlap.
2. The response variable of interest is ordinal or continuous.
3. Both samples are random.

## Mann-Whitney U Test Example in R

In this example, we will test to see if there is a statistically significant difference in the number of insects that survived when treated with one of two available insecticide treatments.

Dependent response variable:
bugs = number of bugs

Categorical independent variable:
spray = two different insecticide treatments (C or D)

The data for this example is available here and represents a subset of a larger experiment:

## Mann-Whitney U Test R Code

Each package used in the example can be installed with the install.packages commands as follows:

```install.packages("gmodels", dependencies = TRUE)
install.packages("car", dependencies = TRUE)
install.packages("DescTools", dependencies = TRUE)
install.packages("ggplot2", dependencies = TRUE)
install.packages("qqplotr", dependencies = TRUE)
install.packages("dplyr", dependencies = TRUE)
```

The R code below includes Shapiro-Wilk Normality Tests and QQ plots for each treatment group.  Data manipulation and summary statistics are performed using the dplyr package. Boxplots are created using the ggplot2 package. QQ plots are created with the qqplotr package. The wilcoxon.test function is included in the base stats package. Median confidence intervals are computed by the DescTools package.

Here is the annotated code for the example.  All assumption checks are provided along with the Mann-Whitney U test:

```library("gmodels")
library("car")
library("DescTools")
library("ggplot2")
library("qqplotr")
library("dplyr")

#Import the data

#Designate spray as a categorical factor
dat\$spray<-as.factor(dat\$spray)

#Produce descriptive statistics by group
dat %>% select(spray, bugs) %>% group_by(spray) %>%
summarise(n = n(),
mean = mean(bugs, na.rm = TRUE),
sd = sd(bugs, na.rm = TRUE),
stderr = sd/sqrt(n),
LCL = mean - qt(1 - (0.05 / 2), n - 1) * stderr,
UCL = mean + qt(1 - (0.05 / 2), n - 1) * stderr,
median = median(bugs, na.rm = TRUE),
min = min(bugs, na.rm = TRUE),
max = max(bugs, na.rm = TRUE),
IQR = IQR(bugs, na.rm = TRUE),
LCLmed = MedianCI(bugs, na.rm=TRUE),
UCLmed = MedianCI(bugs, na.rm=TRUE))

#Produce Boxplots and visually check for outliers
ggplot(dat, aes(x = spray, y = bugs, fill = spray)) +
stat_boxplot(geom ="errorbar", width = 0.5) +
geom_boxplot(fill = "light blue") +
stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") +
ggtitle("Boxplot of Treatments C and D") +
theme_bw() + theme(legend.position="none")

#Test each group for normality
dat %>%
group_by(spray) %>%
summarise(`W Stat` = shapiro.test(bugs)\$statistic,
p.value = shapiro.test(bugs)\$p.value)

#Perform QQ plots by group
ggplot(data = dat, mapping = aes(sample = bugs, color = spray, fill = spray)) +
stat_qq_band(alpha=0.5, conf=0.95, qtype=1, bandType = "boot") +
stat_qq_line(identity=TRUE) +
stat_qq_point(col="black") +
facet_wrap(~ spray, scales = "free") +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_bw()

#Perform the Mann-Whitney U test
m1<-wilcox.test(bugs ~ spray, data=dat, na.rm=TRUE, paired=FALSE, exact=FALSE, conf.int=TRUE)
print(m1)

#Hodges Lehmann Estimator
m1\$estimate
```

## Mann-Whitney U Test Annotated R Output

### Descriptive Statistics

Many times, analysts forget to take a good look at their data prior to performing statistical tests. Descriptive statistics are not only used to describe the data but also help determine if any inconsistencies are present. Detailed investigation of descriptive statistics can help answer the following questions (in addition to many others):

• How much missing data do I have?
• Do I have potential outliers?
• Are my standard deviation and standard error values large relative to the mean?
• In what range most of my data fall for each treatment?
`spraya    nb  meanc  sdd    stderre   LCLf   UCLf    mediang  minh maxh   IQRi    LCLmedj UCLmedjC        12  2.08   1.98   0.570     0.828  3.34    1.5     0     7     2         1      3D        12  4.92   2.50   0.723     3.33   6.51    5       2    12     1.25      3      5`
1. spray – The treatment levels corresponding to our independent variable ‘spray’.
2. n – The number of observations for each treatment.
3. mean – The mean value for each treatment.
4. sd – The standard deviation of each treatment.
5. stderr – The standard error of each treatment.  That is the standard deviation / sqrt (n).
6. LCL, UCL – The upper and lower confidence intervals of the mean.  That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group assuming a normal distribution.
7. median – The median value for each treatment.
8. min, max – The minimum and maximum value for each treatment.
9. IQR – The inner quartile range of each treatment. That is the 75th percentile –  25th percentile.
10. LCLmed, UCLmed The 95% confidence interval for the median.

### Boxplots

Side-by-side boxplots are provided by ggplot2.  The boxplots below seem to indicate one outlier in each treatment group. Furthermore, both the mean (circle with +) and median (middle line) values are at the 75th percentile.  This indicates that the data is highly skewed by the effects of the outlier(s).

### Normality Tests

Prior to performing the Mann-Whitney U, it is important to evaluate our assumptions to ensure that we are performing an appropriate and reliable comparison. If normality is present, an independent samples t-test would be a more appropriate test.

Testing normality should be performed using a Shapiro-Wilk normality test (or equivalent), and/or a QQ plots for large sample sizes. Many times, histograms can also be helpful. However, this data set is so small that histograms did not add value.

In this example, we will use the shapiro.test function from the stats package to produce our Shapiro-Wilk normality test for each cylinder group, and the qqPlot function from the qqplotr package to produce QQ plots. These functions are wrapped with “tidyverse” dplyr syntax to easily produce separate analyses for each treatment group.

`spraya  W Statb  p.valuecC       0.859    0.0476D       0.751    0.0027`
1. spray – This column identifies the levels of the treatment variable along with the mean differences between the levels.
2. W Stat – The Shapiro-Wilk (W) test statistics for each test is provided for each group.
3. p-value – The p-value for each test is provided.  A p-value < 0.05 would indicate that we should reject the assumption of normality. The Shapiro-Wilk Test p-values for treatments C and D are < 0.05 and are, therefore, not normally distributed.

### QQ Plots

The vast majority of points should follow the theoretical normal reference line and fall within the curved 95% bootstrapped confidence bands to be considered normally distributed. However, for spray D, a small deviation from normality can be observed which supports our Shapiro-Wilk normality test conclusion.

Since the Shapiro-Wilk test p-values are < 0.05, for both treatment groups and the QQ plot for spray D is showing a deviation from the theoretical normal diagonal line, we conclude the data is not normally distributed.

### Mann-Whitney U Test Results and Hodges-Lehmann Estimate in R

So far, we have determined that the data for each treatment group is not normally distributed, and we have major influential outliers. As a result, a Mann-Whitney U test would be more appropriate than an independent samples t-test to test for significant differences between treatment groups. Our next step is to officially perform a Mann-Whitney U test to determine which bug spray is more effective. The wilcoxon.test function performs this test in R.

`Wilcoxon rank sum test with continuity correctiondata:  bugs by sprayWa = 20, p-valueb = 0.002651alternative hypothesis: true location shift is not equal to 095 percent confidence intervalc: -4.000018 -1.000009sample estimates:difference in locationd -2.999922 `

1. W – This value represents the Wilcoxon test statistic.  The Wilcoxon test statistic is the sum of the ranks in sample 1 minus n1*(n1+1)/2. n1 is the number of observations in sample 1.
2. p-value – The p-value corresponding to the two-sided test based on the standard normal (Z) distribution.
3. 95% confidence interval – The 95% confidence interval on the difference between the number of bugs that survived under the effects of spray C vs spray D.
4. difference in location – This value corresponds to the Hodges-Lehmann Estimate of the location parameter differences between sprays C and D.

## Mann-Whitney U Test Interpretation and Conclusions

We have concluded that the number of bugs in each treatment group is not normally distributed. In addition, outliers exist in each group. As a result, a Mann-Whitney U test is more appropriate than a traditional independent samples t-test to compare the effectiveness of two separate insecticide treatments.

The Mann-Whitney U test results in a two-sided test p-value = 0.0027. This indicates that we should reject the null hypothesis that distributions are equal and conclude that there is a significant difference in insecticide effectiveness. Descriptive statistics indicate that the median value for spray C is 1.5 and spray D is 5.0. That is to say, the difference between the median values of each treatment is about 3.5 bugs between sprays. The Hodges-Lehmann estimate more precisely indicates that we can expect a median of about 3 more bugs will survive when spray D is used instead of spray C. We are 95% certain that the median difference between spray D and C across the population will be between 1 and 4 bugs. Thus, spray C is more effective than spray D at controlling the bug population.

## What to do When Assumptions are Broken or Things Go Wrong

The Mann-Whitney U test is typically used as a last resort.  This is because it is a lower power test when compared to the independent samples t-test.

More modern alternatives include permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.

If you need to compare more than two independent groups, a one-way Analysis of Variances (ANOVA) or Kruskal-Wallis test may be appropriate.

A Mann-Whitney U test is not appropriate if you have repeated measurements taken on the same experimental unit (subject).  For example, if you have a pre-test post-test study, then each subject would be measured at two different time points.  If this is the case, then a paired t-test or corresponding nonparametric Wilcoxon signed-rank test may be a more appropriate course of action.