Side-by-side Boxplots in R

Boxplots have multiple uses in statistics.  Two of the most common uses are to determine if there may be outliers present in the dataset, and to visualize wether data seem to relatively uniform distributed across different groups.  While statistical modeling procedures such as regression, multilevel models, and mixed model analysis of variance (ANOVA) have more comprehensive ways to evaluate outliers and validate assumptions as apart of the modeling process, a simple boxplot can help visualize the data and look for potential outliers prior to performing more advanced analyzes.

KD Nuggets does a nice job of describing boxplots comprehensively. In short, values outside of the upper and lower fences of the boxplot are considered outliers. The upper fence is generally defined as Q3 + 1.5*IQR and the lower fence is generally defined as Q1 – 1.5*IQR.  IQR is the interquartile range which is the difference between the 25th and 75th percentile and form the box. The image from KD Nuggets is provided below:

How to Create Side-By-Side Boxplots and Label Outliers Using R

The R ‘ggplot2’ package can be used to create side-by-side boxplots in for the ‘PlantGrowth’ dataset.  However, a separate function is necessary to identify and label outliers. Below is the code for loading the ‘PlantGrowth’ dataset along with a function used to identify outliers less than Q1 – 1.5*IQR or greater than Q3 + 1.5*IQR.

library("dplyr")
library("datasets")
library("ggplot2")

data("PlantGrowth")

#R function to identify outliers
find_outlier <- function(x) {
  return(x < quantile(x, .25) - 1.5*IQR(x) | x > quantile(x, .75) + 1.5*IQR(x))
}

Next we can incorporate this function into a ‘dplyr’ mutate statement in order to create a side-by-side box plot for plant weights by treatment group while identifying and labeling outliers. Here ‘group’ is our categorical grouping variable, and ‘weight’ is our numeric continuous measure. the mean of each treatment group is identified as a circle with a ‘+’ sign, while the median, or 50th percentile, is represented by the bold line in the center of the box.

PlantGrowth %>% group_by(group) %>%
  mutate(outlier = ifelse(find_outlier(weight), weight, as.numeric(NA))) %>%
  ggplot(., aes(x = group, y = weight)) +
  stat_boxplot(geom ="errorbar", width = 0.5) +
  geom_boxplot(fill = "light blue") + 
  stat_summary(fun=mean, geom="point", shape=10, size=3.5) + 
  geom_text(aes(label=outlier), na.rm=TRUE, hjust=-.5) +
  ggtitle("Boxplots of plant weight for each treatment") + 
  theme_bw() + theme(legend.position="none")

The resulting side-by-side boxplot is as follows:

Side-by-side boxplot with labeled outliers

As a result, we can observe that treatment group 1 (trt1) contains two outliers that warrant further investigation. Plant weights of 5.87 and 6.03 are outside of Q3+1.5*IQR and should be investigate further to determine if the data points are true outliers, or if there was a typographical error impacting the data.

Scroll to Top