## Introduction

An independent samples t-test is typically performed when an analyst would like to test for mean differences between two treatments or conditions. For example, you may want to see if first-year students scored differently than second-year students on an exam.

An independent samples t-test is typically used when each experimental unit, (study subject) is only assigned one of the two available treatment conditions. Thus, the treatment groups do not have overlapping membership and are considered independent. An independent samples t-test is the simplest form a “between-subjects” analysis.

The two-sided null hypothesis is that there is no difference between treatment group means, while the alternative hypothesis is that mean values differ between treatment groups.

*H*_{0}: *μ*_{1} = *μ*_{2}

*H*_{a}: *μ*_{1} ≠ *μ*_{2}

## Independent Samples T-test Assumptions

An independent samples t-test requires the following assumptions:

- The response of interest is continuous and normally distributed for each treatment group.
- Treatment groups are independent of one another. Experimental units only receive one treatment, and they do not overlap.
- There are no major outliers.
- A check for unequal variances will help determine which version of a t-test is most appropriate:
- If variances are equal, then the assumptions of a pooled t-test is appropriate.
- If variances are unequal, then a Satterthwaite (also known as Welch’s) t-test is appropriate.

## Independent Samples T-test Example in R

In this example, we will test to see if there is a statistically significant difference in the miles per gallon (mpg) of 4-cylinder automobiles and 8-cylinder automobiles.

Dependent response variable:

mpg = Miles per gallon

Independent categorical variable:

cyl = 4 or 8 cylinder automobiles

The data for this example is available here:

## Independent Samples T-test R Code

Each package used in the example can be installed with the install.packages commands as follows:

install.packages("gmodels", dependencies = TRUE) install.packages("car", dependencies = TRUE) install.packages("ggplot2", dependencies = TRUE) install.packages("qqplotr", dependencies = TRUE) install.packages("dplyr", dependencies = TRUE)

The R code below includes Shapiro-Wilk Normality Tests and QQ plots for each treatment group. Data manipulation and summary statistics are performed using the dplyr package. Boxplots are created using the ggplot2 package. QQ plots are created with the qqplotr package. The t.test function is included in the base stats package.

Two versions of Levene’s Test for Equality of Variances are performed in order to demonstrate the traditional solution along with a more robust form of the test. In the leveneTest statement, the center=”mean” option will correspond to the traditional test as reported by other commercially available software. The center=”median” option is the default and can result in a slightly more robust solution to Levene’s Test.

Here is the annotated code for the example. All assumption checks are provided along with the independent samples t-test:

library("gmodels") library("car") library("ggplot2") library("qqplotr") library("dplyr") #Import the data dat<-read.csv("C:/Dropbox/Website/Analysis/Independent Samples T-test/Data/cars_ttest.csv") #Designate cyl as a categorical factor dat$cyl<-as.factor(dat$cyl) #Perform the Shapiro-Wilk Test for Normality on each group dat %>% group_by(cyl) %>% summarise(`W Statistic` = shapiro.test(mpg)$statistic, `p-value` = shapiro.test(mpg)$p.value) #Perform QQ plots by group ggplot(data = dat, mapping = aes(sample = mpg, color = cyl, fill = cyl)) + stat_qq_band(alpha=0.5, conf=0.95, qtype=1, bandType = "ts") + stat_qq_line(identity=TRUE) + stat_qq_point(col="black") + facet_wrap(~ cyl, scales = "free") + labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_bw() #Perform Levene's Test of Equality of Variances lev1<-leveneTest(mpg ~ cyl, data=dat, center="mean") lev2<-leveneTest(mpg ~ cyl, data=dat, center="median") print(lev1) print(lev2) #Produce boxplots and visually check for outliers ggplot(dat, aes(x = cyl, y = mpg, fill = cyl)) + stat_boxplot(geom ="errorbar", width = 0.5) + geom_boxplot(fill = "light blue") + stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") + ggtitle("Boxplots of 4 and 8 Cylinder Groups") + theme_bw() + theme(legend.position="none") #Produce descriptive statistics by group dat %>% select(mpg, cyl) %>% group_by(cyl) %>% summarise(n = n(), mean = mean(mpg, na.rm = TRUE), sd = sd(mpg, na.rm = TRUE), stderr = sd/sqrt(n), LCL = mean - qt(1 - (0.05 / 2), n - 1) * stderr, UCL = mean + qt(1 - (0.05 / 2), n - 1) * stderr, median=median(mpg, na.rm = TRUE), min=min(mpg, na.rm = TRUE), max=max(mpg, na.rm = TRUE), IQR=IQR(mpg, na.rm = TRUE)) #Perform an Independent Samples T-test m1<-t.test(mpg ~ cyl, data=dat, var.equal=FALSE, na.rm=TRUE) print(m1) #Perform an Independent Samples T-test m1<-t.test(mpg ~ cyl, data=dat, var.equal=FALSE, na.rm=TRUE) print(m1)

## Independent Samples T-Test Annotated R Output

### Descriptive Statistics

Many times, analysts forget to take a good look at their data prior to performing statistical tests. Descriptive statistics are not only used to describe the data but also help determine if any inconsistencies are present. Detailed investigation of descriptive statistics can help answer the following questions (in addition to many others):

- How much missing data do I have?
- Do I have potential outliers?
- Are my standard deviation and standard error values large relative to the mean?
- In what range most of my data fall for each treatment?

cyl^{a}n^{b}mean^{c}sd^{d}stderr^{e}LCL^{f}UCL^{f}median^{g}min^{h}max^{h}IQR^{i}

4 10 27.0 4.56 1.44 23.8 30.3 26.6 21.4 33.9 7.20

8 14 15.1 2.56 0.684 13.6 16.6 15.2 10.4 19.2 1.85

**cyl**– This column identifies the levels of the treatment variable along with the mean differences between the levels.**n**– This column identifies how many data points (cars) are in each cylinder group.**mean**– The mean value for each treatment group.**sd**– The standard deviation of each treatment group.**stderr**– The standard error of each treatment group.**LCL, UCL –**The upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group, assuming the data is normally distributed.**median –**The median value for each treatment group.**min, max**– The minimum and maximum values observed for each treatment group.**IQR –**The interquartile range of each treatment group. The interquartile range is the 75^{th}percentile – 25^{th}percentile.

### Normality Tests

Prior to performing a t-test, it is important to validate our assumptions to ensure that we are performing an appropriate and reliable comparison. Testing normality should be performed using a Shapiro-Wilk normality test (or equivalent), and a QQ plot for large sample sizes. Many times, histograms can also be helpful,

`cyl`^{a} W Stat^{b} p-value^{c}

4 0.928 0.433

8 0.932 0.323

**cyl**– This column identifies the levels of the treatment variable along with the mean differences between the levels.**W Stat**– The Shapiro-Wilk (W) test statistics for each test is provided for each group.**p-value**– The p-value for each test is provided. A p-value < 0.05 would indicate that we should reject the assumption of normality. Since the Shapiro-Wilk Test p-values are > 0.05 for each group, we conclude the data is normally distributed.

### QQ Plots

The vast majority of points should follow each line and stay within the curved 95% bootstrapped confidence bands to be considered normally distributed.

The Shapiro-Wilk Test p-value is > 0.05, and QQ plot data points for each primarily fall within the 95% confidence bounds for each treatment group. However, if this were anything other than a theoretical example, we would want to investigate points 6 and 7 from the cylinder 8 group further and potentially consider other options. We will proceed in spite of this since this is a theoretical example and our Shapiro-Wilk normality test seem to indicate each group is normally distributed.

### Levene’s Test for Homogeneity of Variance

Levene’s Test for Homogeneity of Variance is performed using the traditional mean centered methodology and using R’s default median centered methodology. Both tests indicate that between group variances are

Levene's Test for Homogeneity of Variance (center = "mean")

Df^{a}F value^{b}Pr(>F)^{c}

group 1 6.6349 0.01724 *

22

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Levene's Test for Homogeneity of Variance (center = "median")

Df^{a}F value^{b}Pr(>F)^{c}

group 1 6.5299 0.01804 *

22

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

**Df –**The degrees of freedom associated with each variable and overall error.**F Value –**The F statistic for which the p-value is computed.**Pr > F**– Levene’s Test for Equality of Variances shows a p-value of 0.0172. A significant p-value (P < 0.05) indicates that a Satterthwaite (also known as Welch’s) t-test results should be used instead of pooled t-test results.

Note that the results of this test determine which var.equal flag should be used in the R t.test code. If equal variances are assumed (P > 0.05) then the following code is appropriate:

t.test(mpg ~ cyl, data=dat, var.equal=TRUE, na.rm=TRUE)

However, in our example, we conclude unequal variances are present (p = 0.01804). As a result, the following version is appropriate:

t.test(mpg ~ cyl, data=dat, var.equal=FALSE, na.rm=TRUE)

### Boxplots to Visually Check for Outliers

The ggplot2 package provides side-by-side boxplots. Boxplots can help visually identify major outliers and help visually show if variances might be unequal. The boxplot below seems to indicate one minor outlier but subjectively, not enough evidence to suggest we move to a different analysis method.

### Independent Samples T-test Results in R

So far, we have determined that the data for each cylinder group is normally distributed, variances are unequal, and we do not have major influential outliers. Our next step is to officially perform an independent samples t-test to determine whether 4 and 8 cylinder cars show significant differences between their average mpg expenditure.

Welch Two Sample t-test

data: mpg by cyl

t = 7.49^{a}, df = 13.054^{b}, p-value = 4.453e-06^{c}

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

8.504657 15.395343^{d}

sample estimates:

mean in group 4 mean in group 8

27.05^{e}15.10^{e}

**t**– This is the t-statistic. It is the ratio of the mean of the difference in means to the standard error of the difference.**df**– The appropriate degrees of freedom. This varies between each type of independent samples t-test.**p-value**– This is the p-value associated with the test. That is to say if the P value < 0.05 (assuming alpha=0.05) then treatments have a statistically significant mean difference. For our example, we have a p-value = 4.453e-06. Thus, we reject the null hypothesis that the mean mpg of the 4 and 8 cylinder groups are equal and conclude that there is a mean difference between groups.**95% confidence interval –**The values presented here are on the mean difference for each treatment group. That is to say, you can be 95% certain that the true mean difference in mpg of the 4 cylinder and 8 cylinder groups falls between 8.5 and 15.4.**sample means –**The sample mean of the 4 and 8 cylinder treatment groups.

## Independent Samples T-test Interpretation and Conclusions

We have concluded that the Satterthwaite (also known as Welch’s) version of the independent samples t-test is appropriate since our variances are considered unequal between the 4 and 8 cylinder treatment groups. A p-value < 0.05 indicates that we should reject the null hypothesis that the mean mpg is equal across the 4 and 8 cylinder treatment groups and conclude that there is a

## What to do When Assumptions are Broken or Things Go Wrong

The lack of normality or severe impact of outliers can violate independent sample t-test assumptions and ultimately the results. If this happens, there are several available options:

Perform a nonparametric Mann-Whitney U test is the most popular alternative. This is also known as the Mann-Whitney-Wilcoxon or the Wilcoxon Rank Sum test. This test is considered robust to violations of normality and outliers (among others) and tests for differences in mean ranks.

Additional options include considering permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.

If you need to compare more than two independent groups, a one-way Analysis of Variances (ANOVA) or Kruskal-Wallis may be appropriate.

An independent samples t-test is not appropriate if you have repeated measurements taken on the same experimental unit (subject). For example, if you have a pre-test post-test study, then each subject was measured at two different time intervals. If this is the case, then a paired t-test may be a more appropriate course of action.

## Additional Resources and References

Muenchen, R.A. (2011). R for SAS and SPSS Users, Second Edition. New York, NY: Springer, LLC.

Littell, R.C., Stroup, W.W., and Freund R.J. (2002). *SAS for Linear Models, Fourth Edition*. Cary, NC: SAS Institute Inc.

Mitra, A. (1998). *Fundamentals of Quality Control and Improvement*. Upper Saddle River, NJ: Prentice Hall.

Laplin, L.L. (1997). *Modern Engineering Statistics*. Belmont, CA: Wadsworth Publishing Company.

Henderson and Velleman (1981). Building multiple regression models interactively. *Biometrics*, **37**, 391–411.