Want to know the easiest way to perform descriptive statistics by group in R?
The ‘rstatix’ package has you covered.
The code below uses the employee dataset from the ‘stima’ package is used to compute salary descriptive statistics by gender and minority groups using the the get_summary_stats function from the ‘rstatix’ package.
Basic summary statistics for salary by minority and gender are displayed below.
By Group Descriptive Statistics Using the ‘rstatix’ Package
library("stima")
library("rstatix")
library("dplyr")
library("DescTools")
data("employee")
out <- employee %>% group_by(gender, minority) %>% get_summary_stats(salary, type="common")
df <- data.frame(out)
df
By Group Descriptive Statistics ‘rstatix’ Output
minority gender variable n min max median iqr mean sd se ci
1 min f salary 40 16350 35100 23775 5025.0 23062.50 3972.369 628.087 1270.425
2 no_min f salary 176 15750 58125 24450 7500.0 26706.79 8011.894 603.919 1191.902
3 min m salary 64 19650 100000 29025 5062.5 32246.09 13059.881 1632.485 3262.261
4 no_min m salary 193 21300 135000 36000 26650.0 44524.77 20371.882 1466.400 2892.322
Note: The confidence interval (ci) reported above can be misleading. It represents the distance above, or below the mean value, sometimes referred to as the half-width. So the the 95% confidence interval would be approximately:
- LCL = 23062.50 – 1270.425 = 21792.075
- UCL = 23062.50 + 1270.425 = 24332.925
Descriptive Statistics By Group Using the ‘dplyr’ Package
Another way to compute descriptive statistics is by using the R ‘dplyr’ package. While this method is more flexible, it is also a little bit more complicated. Using ‘dplyr’ allows you to compute descriptive statistics beyond what is typically provided by other packages.
For example, 95% bootstrapped confidence intervals of the median are computed below. LCLmed represents the lower confidence interval while UCLmed represents the upper confidence interval.
By Group Descriptive Statistics ‘dplyr’ Code
out <- employee %>% select(salary, gender, minority) %>% group_by(gender, minority) %>%
summarise(n = n(),
mean = mean(salary, na.rm = TRUE),
sd = sd(salary, na.rm = TRUE),
stderr = sd/sqrt(n),
LCL = mean - qt(1 - (0.05 / 2), n - 1) * stderr,
UCL = mean + qt(1 - (0.05 / 2), n - 1) * stderr,
median = median(salary, na.rm = TRUE),
min = min(salary, na.rm = TRUE),
max = max(salary, na.rm = TRUE),
IQR = IQR(salary, na.rm = TRUE),
LCLmed = MedianCI(salary, na.rm=TRUE)[2],
UCLmed = MedianCI(salary, na.rm=TRUE)[3])
df <- data.frame(out)
df
By Group Descriptive Statistics ‘dplyr’ Output
gender minority n mean sd stderr LCL UCL median min max IQR LCLmed UCLmed
1 f min 40 23062.50 3972.369 628.0866 21792.07 24332.93 23775 16350 35100 5025.0 21150 24750
2 f no_min 176 26706.79 8011.894 603.9192 25514.89 27898.69 24450 15750 58125 7500.0 23550 25500
3 m min 64 32246.09 13059.881 1632.4852 28983.83 35508.36 29025 19650 100000 5062.5 27750 30750
4 m no_min 193 44524.77 20371.882 1466.4001 41632.44 47417.09 36000 21300 135000 26650.0 33750 40200
Note: LCL and UCL are the 95% confidence intervals of the mean. LCLmed and UCLmed are the 95% confidence interval of the median as calculated by the ‘DescTools’ package.
Three-way Frequency Tables for Categorical Data
The R ‘stats’ package contains the ‘ftable()’ command which allows for a flexible way to create multi-way contingency, or frequency, tables of counts. Below we create a frequency table using the gender, minority, and job category variables from the employees dataset.
Three-way Frequencies ‘dplyr’ Code
employee %>% select(minority, gender, jobcat) %>% ftable()
Three-way Frequencies ‘dplyr’ Output
jobcat Clerical Custodial manager
minority gender
min f 40 0 0
m 47 13 4
no_min f 166 0 10
m 109 14 70