One-way Analysis of Variance (ANOVA) in R

Introduction

A one-way analysis of variance (ANOVA) is typically performed when an analyst would like to test for mean differences between three or more treatments or conditions.  For example, you may want to see if first-year students scored differently than second or third-year students on an exam.

A one-way ANOVA is appropriate when each experimental unit, (study subject) is only assigned one of the available treatment conditions. Thus, the treatment groups do not have overlapping membership and are considered independent. A one-way ANOVA is considered a “between-subjects” analysis.

The null hypothesis is that there is no difference between treatment group means, while the alternative hypothesis is that the mean of at least one treatment group differs.

H0: μ1 = μ2 = …  = μk

Ha: μ1μ2 ≠ … ≠ μk

One-way ANOVA Assumptions

In order to run a one-way ANOVA the following assumptions must be met:

  1. The response of interest is continuous and normally distributed for each treatment group.
  2. Treatment groups are independent of one another. Experimental units only receive one treatment, and they do not overlap.
  3. There are no major outliers.
  4. A check for unequal variances will help determine which version of a one-way ANOVA is most appropriate:
    • If variances are equal, then the assumptions of a standard one-way ANOVA are met.
    • If variances are unequal, then a Welch’s one-way ANOVA is appropriate.

One-way ANOVA Example in R

In this example, an experiment is performed to compare the dry weight of plants with one of three potential treatments. Each plant is treated with one out of three available treatments to enhance the weight of each plant. We want to answer the question, “Which treatment is optimal for enhancing the plant weight”?

Dependent response variable:
Plant_weight = The dried weight of each plant

Independent categorical variable:
Treatment = One of three available treatments to enhance plant weight

  • Ctrl = Control
  • Trt1 = Treatment 1
  • Trt2 = Treatment 2

The data for this example is available here:

One-way ANOVA R Code

Each package used in the example can be installed with the install.packages commands as follows:

install.packages("gmodels", dependencies = TRUE)
install.packages("car", dependencies = TRUE)
install.packages("ggplot2", dependencies = TRUE)
install.packages("qqplotr", dependencies = TRUE)
install.packages("dplyr", dependencies = TRUE)
install.packages("emmeans", dependencies = TRUE)

The R code below includes Shapiro-Wilk Normality Tests and QQ plots for each treatment group.  Data manipulation and summary statistics are performed using the dplyr package. Boxplots are created using the ggplot2 package. QQ plots are created with the qqplotr package. The ‘lm’ (Linear Models) function is included in the base stats package.

Two versions of Levene’s Test for Equality of Variances are performed in order to demonstrate the traditional solution along with a more robust form of the test.  In the leveneTest statement, the center=”mean” option will correspond to the traditional test as reported by other commercially available software. The center=”median” option is the default and can result in a slightly more robust solution to Levene’s Test.

Here is the annotated code for the example.  All assumption checks are provided along with the one-way ANOVA and post-hoc tests:

library("gmodels")
library("car")
library("ggplot2")
library("qqplotr")
library("dplyr")
library("emmeans")

#Import the data
dat<-read.csv("C:/Dropbox/Website/Analysis/One-way ANOVA/data/PlantGrowth.csv")

#Designate treatment as a categorical factor
dat$treatment<-as.factor(dat$treatment)

#Produce descriptive statistics by treatment
dat %>% select(plant_weight, treatment) %>% group_by(treatment) %>% 
  summarise(n = n(), 
            mean = mean(plant_weight, na.rm = TRUE), 
            sd = sd(plant_weight, na.rm = TRUE),
            stderr = sd/sqrt(n), 
            LCL = mean - qt(1 - (0.05 / 2), n - 1) * stderr,
            UCL = mean + qt(1 - (0.05 / 2), n - 1) * stderr,
            median = median(plant_weight, na.rm = TRUE),
            min = min(plant_weight, na.rm = TRUE), 
            max = max(plant_weight, na.rm = TRUE),
            IQR = IQR(plant_weight, na.rm = TRUE))

#Perform the Shapiro-Wilk Test for Normality on each group
dat %>%
  group_by(treatment) %>%
  summarise(`W Stat` = shapiro.test(plant_weight)$statistic,
            `p-value` = shapiro.test(plant_weight)$p.value)

#Perform QQ plots by group
ggplot(data = dat, mapping = aes(sample = plant_weight, color = treatment, fill = treatment)) +
  stat_qq_band(alpha=0.5, conf=0.95, qtype=1, bandType = "boot", B=5000) +
  stat_qq_line(identity=TRUE) +
  stat_qq_point(col="black") +
  facet_wrap(~ treatment, scales = "free") +
  labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_bw() 

#Perform Levene's Test of Equality of Variances
lev1<-leveneTest(plant_weight ~ treatment, data=dat, center="mean")
lev2<-leveneTest(plant_weight ~ treatment, data=dat, center="median")
print(lev1)
print(lev2)

#Produce boxplots and visually check for outliers
ggplot(dat, aes(x = treatment, y = plant_weight, fill = treatment)) +
  stat_boxplot(geom ="errorbar", width = 0.5) +
  geom_boxplot(fill = "light blue") + 
  stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + 
  ggtitle("Boxplots of plant weight for each treatment") + 
  theme_bw() + theme(legend.position="none")

#Perform an ANOVA to check for plant weight differences between treatments
m1<-lm(plant_weight ~ treatment, data=dat, contrasts = c("contr.sum", "contr.poly"))
Anova(m1, type=3)

#Compute expected marginal means post-hoc tests
posthocs<-emmeans(m1, pairwise ~ treatment, adjust="tukey")

#Display post-hoc letter groupings
CLD(posthocs$emmeans, details=TRUE, sort=TRUE, alpha=0.05, Letters = letters, adjust="tukey")

#Plot estimated marginal means
emm <- summary(posthocs)$emmeans
ggplot(emm, aes(treatment)) +
  geom_line(aes(y = emmean, group = 1)) +
  geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
  geom_point(aes(y = emmean), size = 2, color = "blue") +
  labs(x = "Treatment", y = "Estimated marginal mean", 
       title = "Estimated marginal means with 95% confidence intervals") +
  theme_bw()

#Plot contrasts
plot(posthocs$contrasts) +
  geom_vline(xintercept = 0) + 
  theme_bw() +
  labs(y = "Contrast", 
       x = "Estimated marginal mean difference", 
       title = "Estimated marginal mean differences with 95% confidence intervals")

One-way ANOVA Annotated R Output

Descriptive Statistics

Many times, analysts forget to take a good look at their data prior to performing statistical tests. Descriptive statistics are not only used to describe the data but also help determine if any inconsistencies are present. Detailed investigation of descriptive statistics can help answer the following questions (in addition to many others):

  • How much missing data do I have?
  • Do I have potential outliers?
  • Are my standard deviation and standard error values large relative to the mean?
  • In what range most of my data fall for each treatment?
treatmenta    nb   meanc     sdd     stderre   LCLf    UCLf   mediang   minh   maxh   IQRi
ctrl         10   5.03    0.583   0.184    4.61   5.45   5.15  4.17   6.11  0.743
trt1         10   4.66    0.790   0.250    4.09   5.22   4.55  3.59   6.03  0.662
trt2         10   5.53    0.443   0.140    5.21   5.84   5.44  4.92   6.31  0.467
  1. treatment – Each treatment level of our independent variable.
  2. n – The number of observations for each treatment.
  3. mean – The mean value for each treatment.
  4. sd – The standard deviation of each treatment.
  5. stderr – The standard error of each treatment.  That is the standard deviation / sqrt (n).
  6. LCL, UCL – The lower and upper confidence intervals of the mean.  That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group, assuming the data is normally distributed. 
  7. Median – The median value for each treatment.
  8. Minimum, Maximum – The minimum and maximum value for each treatment.
  9. Quartile Range – The inner quartile range of each treatment. That is the 75th percentile –  25th percentile.

Normality Tests

Prior to performing a one-way ANOVA, it is important to validate our assumptions to ensure that we are performing an appropriate and reliable comparison. Testing normality should be performed using a Shapiro-Wilk normality test (or equivalent), and/or a QQ plot for large sample sizes. Many times, histograms can also be helpful especially for large sample sizes.

In this example, we will use the shapiro.test function from the stats package to produce our Shapiro-Wilk normality test for each treatment group, and the qqPlot function from the qqplotr package to produce QQ plots. These functions are wrapped with “tidyverse” dplyr syntax to easily produce separate analyses for each treatment group.

treatmenta    W Statb   p-valuec
ctrl          0.957     0.747
trt1          0.933     0.478
trt2          0.941     0.564

Shapiro-Wilk Test annotations:

  1. Test – Four different normality tests are presented.
  2. Statistic – The test statistics for each test is provided here.
  3. p Value – The p-value for each test is provided.  A p-value < 0.05 would indicate that we should reject the assumption of normality. Since the Shapiro-Wilk Test p-values are > 0.05 for each group(p=.75, p=.48, p=.56, respectively), we conclude the data is normally distributed.

QQ Plots

The vast majority of points should follow the theoretical normal reference line and fall within the curved 95% bootstrapped confidence bands to be considered normally distributed.

QQ Plot of each treatment group

Since the Shapiro-Wilk Test p-values are > 0.05, and the QQ Plot for each treatment group follows the QQ plot theoretical normal diagonal line, we conclude the data is normally distributed.

Levene’s Test for Homogeneity of Variances

Levene’s Test for Homogeneity of Variance is performed using the traditional mean centered methodology and using R’s default median centered methodology. The null hypothesis for this test is that variances are equal across groups. The alternative hypothesis is that variances are unequal for at least one of our treatment groups.

If we reject the null hypothesis of equal variances and conclude variances are unequal, then a Welch’s ANOVA is appropriate. However, if we do not reject the null hypothesis, the standard one-way ANOVA results would be considered appropriate.

More clearly, if the Levene’s test p-value P ≤ 0.05, use a Welch’s t-test.  If P > 0.05, use the standard one-way ANOVA results.

In our example, a p-value = 0.3093 or 0.3445 indicates that we fail to reject the null hypothesis and conclude that variances are equal. Therefore we will be using the standard ANOVA results after assumptions are checked.

Levene's Test for Homogeneity of Variance (center = "mean")
      Dfa F valueb Pr(>F)c
group  2  1.2258 0.3093
      27               
Levene's Test for Homogeneity of Variance (center = "median")
      Dfa F valueb Pr(>F)c
group  2  1.1089   0.3445
      27               

Levene’s Test for Homogeneity of Variance annotations:

  1. Df – The degrees of freedom associated with each variable and overall error.
  2. F Value – The F statistic for which the p-value is computed.
  3. Pr > F -Levene’s Test for Equality of Variances shows a p-value of 0.3445. A significant p-value (P ≤ 0.05) would indicate that Welch’s ANOVA should be used in place of a standard one-way ANOVA. However since variances are considered equal (P > 0.05), the standard one-way ANOVA results can be reported.

Boxplots to Visually Check for Outliers

The ggplot2 package provides side-by-side boxplots. Boxplots can help visually identify major outliers and help visually show if variances might be unequal.  The boxplot below seems to indicate one minor outlier but subjectively, not enough evidence to suggest we move to a different analysis method.

Boxplot of plant weight for each treatment

One-way ANOVA Results

So far, we have determined that the data for each treatment group is normally distributed, variances are equal, and we do not have major influential outliers. Our next step is to officially perform the ANOVA to determine whether there is a statistically significant difference in plant weight between treatments.

Anova Table (Type III tests)

Response: plant_weight
             Sum Sqa Dfb  F valuec  Pr(>F)d    
(Intercept) 253.210   1 654.5976   < 2e-16 ***
treatment     3.783   2   4.8897   0.01541 *  
Residuals    10.444  27                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

ANOVA annotations:

Unlabeled First Column– This column identifies the partitioned sources of variability for the ANOVA calculation. We will ignore the ‘(Intercept)’. The ‘treatment’ measures represent the variation that can be explained by the model.  The source ‘Residuals’ is the leftover variation not explained by measurable sources.

  1. Sum Sq – The sum of squares represents the sums of squares calculation for each source of variation. I will attempt to provide the primarily non-mathematical version of the interpretation of each source.  However, if you prefer the mathematical interpretation, I encourage you to read the post here.
    • Model – This value is a calculation of the variation in our treatment group means around the overall plant weight (grand) mean.
    • Residuals – This value is a calculation of the variation in individual plant weight values around the mean of the corresponding treatment. This is an overall measure of the discrepancy between the data and the model.
  2. Df – This column displays information corresponds to the degrees of freedom associated with each source of variability. 
    • Model – This represents our number of treatment levels – 1.  Thus, 3 – 1 = 2.
    • Residuals – This represents our Total Sample Size – 1 – treatment dfs = 30 – 1 – 2 = 27
  3. F value – This value is computed by dividing the mean square of our model by the mean square error and represents our ANOVA test statistic.  This can be thought of as the signal/noise ratio.  The higher the value, the more significant the effect of our treatment is at explaining variability in our outcome (plant weight).
  4. Pr(>F) – The p-value associated with our F test statistic. The p-value for our test is 0.01541.  Given our alpha =0.05, we would reject our null hypothesis and conclude that there is a statistically significant difference in plant weight between treatments.

General Information About Post-hoc Tests

In our example, an ANOVA p-value=0.0154 indicates that there is an overall difference in mean plant weight between at least two of our treatments groups.  However, it does not provide us with an indication of which groups are different without also performing post-hoc tests. Post-hoc tests are typically adjusted for the number of tests performed in order to control for Type I errors. As a result, we will apply Tukey’s post-hoc test p-value adjustment. This will allow us to determine which groups significantly differ while controlling for Type I errors.

In R, the emmeans package is typically used to perform post-hoc tests. Furthermore, this statement will compute the estimated marginal mean values for each treatment group and the corresponding differences between treatment group combinations. This approach is preferred especially for analyses where there are missing values, or you have an unequal number of subjects (or experimental units) within each treatment. You can find more information on estimated marginal means using the following link.

Post-hoc Test Results

$emmeans
 treatmenta  emmeanb    SEc   dfd  lower.CLe upper.CLe  .groupf
 trt1        4.66     0.197  27     4.16     5.16        a    
 ctrl        5.03     0.197  27     4.53     5.53        ab   
 trt2        5.53     0.197  27     5.03     6.03         b   

Confidence level used: 0.95 
Conf-level adjustment: sidak method for 3 estimates 
P value adjustment: tukey method for comparing a family of 3 estimates 
significance level used: alpha = 0.05 
$contrasts
 contrastg    estimateh  SEi    dfd t.ratioj p.valuek
 ctrl - trt1    0.373   0.278  27   1.341   0.3854 
 ctrl - trt2   -0.494   0.278  27  -1.776   0.1966 
 trt1 - trt2   -0.867   0.278  27  -3.117   0.0116 

P value adjustment: tukey method for comparing a family of 3 estimates 

Post-hoc annotations:

  1. treatment – This column designates each categorical level of our treatment effect. 
  2. emmean – Estimated marginal mean (also known as least squares mean) values for each treatment group.
  3. SE – The standard error associated with the estimated marginal mean of each treatment group.
  4. df – The degrees of freedom for each comparison.
  5. lower.CL, upper.CL– The upper and lower confidence intervals of the estimated marginal (least squares) mean.  That is to say, you can be 95% certain that the true estimated marginal mean falls between the lower and upper values specified for each treatment group. 
  6. .group – Post-hoc letter groupings.  Treatments that share the same letter are not significantly different. For example, trt1 and ctrl share the letter a, and do not significantly differ.  However, trt1 and trt2 do not share the same letter, thereby indicating a statistically significant difference.
  7. contrasts – This column indicates the paired comparison that is being performed. For example, trt1 – trt2 indicates that we are comparing treatments 1 and 2 by subtracting trt2 from trt1.
  8. estimate – This value represents the marginal mean difference between treatment combinations identified in the adjoining contrast column. For example, trt1 – trt2 indicates that we are comparing treatments 1 and 2 by subtracting trt2 from trt1= 4.66 – 5.53 = -0.87. So on average, plants treated with treatment 1 weight about 0.867 less (lbs?) than those treated with treatment 1 (p=0.0116).
  9. SE – The standard error associated with the estimated marginal mean difference between treatment group.
  10. t.ratio – The t-test statistic for which the p-value is computed.
  11. p.value – The Tukey adjusted p.value associated with each individual paired comparison.

The “takeaway” here is that treatments 1 and 2 significantly differ (p = 0.0116).  However, no differences exist between the control group and treatment 1 (p = 0.3854) or the control group and treatment 2 (p=.1966). Furthermore, the expected mean difference between treatments 1 and 2 is approximately -0.87.  We can reverse this difference and say that on average, plants in treatment group 2 weight about 0.87 more (lbs?) than those in treatment group 1.

While this information can be helpful, in larger studies it can be hard to mentally process this information without additional details. The ggplot2 package can be used to create additional helpful plots which can help visually describe the results.

Estimated Marginal Means Plot with 95% Confidence Intervals

Estimated marginal means plot with 95% confidence intervals

The estimated marginal means plot provides a visual aid to help interpret the numerical information provided by our post-hoc tests.  Treatments are identified on the X-axis and mean plant weights are provided on the Y-axis.  The circular point identifies the marginal mean of each treatment and while the high-low bars represent 95% confidence intervals. Confidence intervals from different treatments that do not overlap are generally of interest and warrants further investigation in the post-hoc tests.

In the plot above, a substantial increase between treatment 2 and 1 can be observed, thus reflecting a significant p-value = 0.0116. The difference in plant yield between treatments 2 and 1 is approximately 0.87.

Estimated Marginal Mean Differences with 95% Confidence Intervals

Estimated marginal mean difference plot with 95% confidence intervals

The estimated marginal mean difference plot with 95% confidence intervals attempts to convey the post-hoc test results in graphical form. A statistically significant difference between two treatment levels does not exist when the blue 95% confidence interval overlaps the vertical zero line.  However, when the confidence interval does not overlap 0, a statistically significant difference does exist, as is the case for trt1 – trt2.

One-way ANOVA Interpretation and Conclusions

The overall ANOVA p-value = 0.01541. This indicates a statistically significant difference exists between plant weights of least two treatment groups. Post-hoc tests reveal that significant differences exist between treatments 1 and 2 (p = 0.0116). However, significant differences do not exist between our control group and treatment 1, or our control group and treatment 2.

We can quantify this difference by reviewing our estimated marginal (LS) mean differences from the post-hoc tests. The estimated marginal mean difference between treatments 1 and 2 is -0.87 with a 95% confidence interval on the difference of [-1.56, -0.17]. For simpler interpretation, it is possible to turn this comparison around and say that the difference between treatment 2 and treatment 1 is about 0.87 with a 95% CI [0.17, 1.56]. This can be confirmed visually by evaluating the estimated marginal mean plots with 95% confidence intervals.

As a result, we can conclude that treatment 2 is superior to treatment 1 at enhancing plant weight.  However, neither treatment is statistically different than our control group.  

What to do When Assumptions are Broken or Things Go Wrong

The lack of normality or severe impact of outliers can violate ANOVA assumptions and ultimately the results.  If this happens, there are several available options:

Perform a nonparametric Kruskal-Wallis test is the most popular alternative. This test is considered robust to violations of normality and outliers (among others) and tests for differences in mean ranks. The Kruskal-Wallis test can have slightly decreased power when compared to ANOVA. So while the Kruskal-Wallis test is an acceptable option, transforming the dependent (using a log, square root, etc.) should result in a slightly higher power. A permutation ANOVA would be a more modern, but lesser known, nonparametric alternative.

If your data is normally distributed, but you have unequal variances, then a Welch’s ANOVA is a viable option.

While a one-way ANOVA is appropriate if you have a between-subjects design (each experimental only receives only one treatment), a one-way ANOVA is not appropriate for a within-subjects design.  A within-subjects design can be analyzed with a repeated measures ANOVA. This is appropriate when each experimental unit (subject) receives more than one treatment.  For example, if you wanted to see if students exam scores differed between 3 tests, then a single factor repeated measures ANOVA would be an appropriate analysis.

Many variations exist for both within and between measures designs.  If an analyst needs to compare two between-subject factors, a two-way ANOVA would be appropriate.  If you have one between-subject factor, and one within-subject factor then a repeated measures split-plot ANOVA would be the way to go.  If you have two within-subject factors then a doubly repeated measures ANOVA would be appropriate.  This goes on…

Additionally, if you have a continuous outside source of measurable variability, then an analysis of covariance (ANCOVA) can be performed to capture both the effects of a categorical treatment while also accounting for the continuous covariate.

As I update this site, I will provide the alternatives listed and much more.

Additional Resources and References

SAS Version 9.4, SAS Institute Inc., Cary, NC.

Hicks, C.R. & Turner, K.V. (1999). Fundamental Concepts in the Design of Experiments. New York, NY: Oxford University Press.

Mitra, A. (1998). Fundamentals of Quality Control and Improvement. Upper Saddle River, NJ: Prentice Hall.

Laplin, L.L. (1997). Modern Engineering Statistics. Belmont, CA: Wadsworth Publishing Company.

Dobson, A. J. (1983) An Introduction to Statistical Modelling. London: Chapman and Hall