Many statistical analysis techniques require either the raw data, or model residuals to be normally distributed. Basic bivariate statistical methods such as a t-test or correlations require each raw measure to be normally distributed. While modeling procedures such as regression, multilevel models, or mixed model ANOVA require the model residuals to be normally distributed.
Q-Q Plots in R
Using a Q-Q (Quantile-Quantile) plot in R is a great way to assess whether data is normally distributed. Below we will evaluate plant weights from the plant growth dataset for normality.
The quickest way to check for normality is as follows:
library("dplyr") library("ggplot2") library("qqplotr") library("ggpubr") library("datasets") library("MASS") data("PlantGrowth") ggqqplot(PlantGrowth, x = "weight", main="Q-Q Plot with 95% confidence bands")
Notice how the weight data follow the the black diagonal theoretical identity line and the data points fall within the shaded upper and lower 95% confidence bands. This helps determine that the data is normally distributed.
If there is a significant deviation outside of the confidence bands or the data does not follow the diagonal identity line, then the assumption of normality should be rejected.
A Nicer Q-Q Plot
A nicer plot for the same data with bootstrapped confidence can be performed using ‘ggplot2’ in conjunction with the ‘qqplotr’ package as follows:
ggplot(data = PlantGrowth, mapping = aes(sample = weight)) + stat_qq_band(fill = "light blue") + stat_qq_line(alpha=0.5, conf=0.95, bandType = "boot", B=5000) + stat_qq_point() + ggtitle("Normal Q-Q Plot with 95% confidence bands") + labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_bw()
The Shapiro-Wilk Test in R
For sample sizes less than 100, the Shapiro-Wilk test can be performed to officially test data for deviations from normality. Note that this test is very sensitive. So for larger sample sizes, the Shapiro-Wilk test may report p-value < 0.05, without there being a significant lack of normality.
Shapiro-Wilk normality test
W = 0.98268, p-value = 0.8915
As shown above, with a p-value = 0.8915, we would conclude that the data is normally distributed.
How to Perform Q-Q Plots by Group in R
Sometimes it may be necessary to evaluate data for normality across multiple groups. This is common to confirm normality for analysis such asch as an independent sample t-test or one-way ANOVA. This can easily be performed using the ‘group’ variable as follows:
ggplot(data = PlantGrowth, mapping = aes(sample = weight, color = group, fill = group)) + stat_qq_band(alpha=0.5, conf=0.95, bandType = "boot", B=5000) + stat_qq_line(identity=TRUE) + stat_qq_point(col="black") + facet_wrap(~ group, scales = "free") + ggtitle("Normal Q-Q Plot with 95% confidence bands by group") + labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_bw()
The Shapiro-Wilk Test by Group in R
Similarly, it is possible to perform the Shapiro-Wilk test across several groups simultaneously in R by using ‘dplyr’ by group processing as shown below:
PlantGrowth %>% group_by(group) %>% summarise(`W Stat` = shapiro.test(weight)$statistic, `p-value` = shapiro.test(weight)$p.value)
group `W Stat` `p-value` <fct> <dbl> <dbl> 1 ctrl 0.957 0.747 2 trt1 0.930 0.452 3 trt2 0.941 0.564
Since all p-values are > 0.05, we would conclude that each individual group is normally distributed.