Introduction
An independent samples t-test is typically performed when an analyst would like to test for mean differences between two treatments or conditions. For example, you may want to see if first-year students scored differently than second-year students on an exam.
An independent samples t-test is typically used when each experimental unit, (study subject) is only assigned one of the two available treatment conditions. Thus, the treatment groups do not have overlapping membership and are considered independent. An independent samples t-test is the simplest form a “between-subjects” analysis.
The two-sided null hypothesis is that there is no difference between treatment group means, while the alternative hypothesis is that mean values differ between treatment groups.
H0: μ1 = μ2
Ha: μ1 ≠ μ2
Independent Samples T-test Assumptions
The following assumptions must be met in order to run an independent samples t-test:
- The response of interest is continuous and normally distributed for each treatment group.
- Treatment groups are independent of one another. Experimental units only receive one treatment and they do not overlap.
- There are no major outliers.
- A check for unequal variances will help determine which version of an independent samples t-test is most appropriate:
- If variances are equal, then a pooled t-test is appropriate
- If variances are unequal, then a Satterthwaite (also known as Welch’s) t-test is appropriate
Independent Samples T-test Example
In this example, we will test to see if there is a statistically significant difference in the miles per gallon (mpg) of 4-cylinder automobiles and 8-cylinder automobiles.
Dependent response variable:
mpg = Miles per gallon
Independent categorical variable:
cyl = 4 or 8 cylinder automobiles
The data for this example is available here:
Independent Samples T-test SAS Code
PROC TTEST includes QQ plots for each treatment group along with a folded F-test to help identify unequal variances. While this information can aid in validating assumptions, the Shapiro-Wilk Normality Test, in addition to QQ plots, should be used to help evaluate normality. Furthermore, Levene’s Test for Equality of Variances is generally preferred when evaluating whether variances between groups are equal. Thus, SAS code has been provided to demonstrate both the Shapiro-Wilk and Levene’s tests.
Here is the annotated code for the example. All assumption checks are provided along with the independent samples t-test:
*Import the data;
proc import datafile='C:DropboxWebsiteAnalysisIndependent Samples T-testDatacars_ttest.csv'
out=work.cars
dbms=csv
replace;
run;
*Produce descriptive statistics;
proc means data=insect nmiss mean std stderr lclm uclm median min max qrange maxdec=2;
class spray;
var bugs;
run;
*Test for normality;
proc univariate data=cars normal;
class cyl;
var mpg;
run;
*Test for equality of variances;
proc glm data=cars;
class cyl;
model mpg = cyl;
means cyl / hovtest=levene(type=abs) welch;
run;
*Independent Samples T-test;
proc ttest data=cars;
class cyl;
var mpg;
run;
Independent Samples T-test Annotated SAS Output
Descriptive Statistics
Many times, analysts forget to take a good look at their data before performing statistical tests. Descriptive statistics are not only used to describe the data but also help determine if any inconsistencies are present. Detailed investigation of descriptive statistics can help answer the following questions (in addition to many others):
- How much missing data do I have?
- Do I have potential outliers?
- Are my standard deviation and standard error values large relative to the mean?
- In what range most of my data fall for each treatment?

- cyl – Each treatment level of our independent variable.
- N obs – The number of observations for each treatment.
- N Miss – The number of missing observations for each treatment.
- Mean – The mean value for each treatment.
- Std Dev – The standard deviation of each treatment.
- Std Error – The standard error of each treatment. That is the standard deviation / sqrt (n).
- Lower and Upper 95% CL for Mean – The upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group.
- Median – The median value for each treatment.
- Minimum, Maximum – The minimum and maximum value for each treatment.
- Quartile Range – The inner quartile range of each treatment. That is the 75th percentile – 25th percentile.
Normality Tests
Prior to performing the t-test, it is important to validate our assumptions to ensure that we are performing an appropriate and reliable comparison. Testing normality should be performed using a Shapiro-Wilk normality test (or equivalent), and/or a QQ plot for large sample sizes. Many times, histograms can also be helpful. In this example, we will use PROC UNIVARIATE to produce our Shapiro-Wilk normality test for each cylinder group, and PROC TTEST will produce our corresponding QQ plots.
The Shapiro-Wilk normality test for the 4-cylinder group:

The Shapiro-Wilk normality test for the 8-cylinder group:

- Test – Four different normality tests are presented.
- Statistic – The test statistics for each test is provided here.
- p Value – The p-value for each test is provided. A p-value < 0.05 would indicate that we should reject the assumption of normality. Since the Shapiro-Wilk Test p-values are > 0.05 for each group, we conclude the data is normally distributed.
QQ Plots
The ‘4’ and ‘8’ in the top left corner of each plot indicates which group each QQ plot corresponds too. The vast majority of points should follow each line.

Since the Shapiro-Wilk Test p-value is > 0.05, and the QQ Plot for each treatment group follows the QQ plot theoretical normal diagonal line, we conclude the data is normally distributed.
Levene’s Test for Homogeneity of Variances
PROC TTEST does not perform Levene’s Test for Equality of Variances automatically. In its place, PROC TTEST will perform a Folded F test which is considered inferior to Levene’s Test for Equality of Variances. However, the GLM procedure can be used to perform this test. I recommend using Levene’s Test for Equality of Variances rather than the Folded F test included by default in PROC TTEST.

- Source – This column designates which variable Levene’s Test for Homogeneity is performed.
- DF – The degrees of freedom associated with each variable and overall error.
- Sum of Squares – The sums of squares calculation for Levene’s Test
- Mean Square – The mean square calculation for Levene’s Test.
- Value – The F statistic for which the p-value is computed. This value is (24.3781/3.6742) = 6.63
- Pr > F – Levene’s Test for Equality of Variances shows a p-value of 0.0172. A significant p-value (P < 0.05) indicates that a Satterthwaite (also known as Welch’s) t-test results should be used instead of pooled t-test results.
Boxplots to Visually Check for Outliers
Side-by-side boxplots are provided by the GLM procedure. This can help visually identify major outliers and help visually show if variances might be unequal. The boxplot below seems to indicate one minor outlier but subjectively, not enough evidence to suggest we move to a different analysis method.

Independent Samples T-test
So far, we have determined that the data for each cylinder group is normally distributed, variances are unequal, and we do not have major influential outliers. Our next step is to officially perform an independent samples t-test to determine whether 4 and 8 cylinder cars show significant differences between their average mpg expenditure.

- cyl – This column identifies the levels of the treatment variable of interested along with the mean differences between the levels.
- Method – This column displays information corresponds to formulas for the Pooled or Satterthwaite t-test formula. If equal variances are assumed, the row representing the Pooled difference is appropriate. However, if variances are considered unequal, the row of data representing the Satterthwaite is appropriate. Levene’s Test for Equality of Variances, or the inherent Folded F test will dictate which row is most appropriate.
- N – This column identifies how many data points (cars) are in each cylinder group.
- Mean – The first two rows in this column correspond to mean values for each treatment group. The second two rows correspond to the mean difference between the 4 and 8 cylinder groups (27.05-15.10 = 11.95) .
- Std Dev – The first two rows in this column identify the standard deviation of each treatment group. The third row is an estimate of the the pooled standard deviation across both treatment groups.
- Std Err – The first two rows in this column are the standard errors of each treatment group respectively. The second two rows represent the Pooled (assuming equal variances) and Satterthwaite (assuming unequal variances) standard error estimates respectively.
- Min, Max – The minimum and maximum values observed for each treatment group.
- 95% CL Mean – The first two rows in this column represent the upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean mpg of the 4 cylinder group falls between 23.79 and 30.31. Furthermore, you can be 95% certain that the true mean mpg of the 8 cylinder group falls between 13.62 and 16.58. The third and fourth rows indicate the confidence interval of the mean difference between treatment groups.
- 95% CL Std Dev – The first two rows indicate that the confidence interval of the standard deviation for each treatment group. The third row indicates the 95% confidence interval of the pooled standard deviation across both treatment groups.
Independent Samples T-test Results in SAS

- Method – This column designates each type of independent samples t-test.
- Variance -An independent samples pooled t-test is appropriate when variances are assumed equal. An independent samples Satterthwaite t-test is appropriate when variances are considered unequal. The appropriate row to evaluate will be based on the results of the Levene’s Test for Homogeneity of Variances above. In our example, the “Satterthwaite” test is appropriate since variances are considered unequal between the 4 and 8 cylinder treatment groups.
- DF – The appropriate degrees of freedom vary between each type of independent samples t-test.
- t Value – This is the t-statistic. It is the ratio of the mean of the difference in means to the standard error of the difference. For the Satterthwaite t-test this value is (11.95/1.5955) = 7.49
- Pr > |t| – This is the p-value associated with the test. That is to say if the P value < 0.05 (assuming alpha=0.05) then treatments have a statistically significant mean difference. For our example, we have a p-value < 0.0001. Thus, we reject the null hypothesis that the mean mpg of the 4 and 8 cylinder groups are equal and conclude that there is a mean difference between groups.
Independent Samples T-test Interpretation and Conclusions
We have concluded that the Satterthwaite version of the independent samples t-test is appropriate since our variances are considered unequal between the 4 and 8 cylinder treatment groups. A p-value < 0.0001 indicates that we should reject the null hypothesis that the mean mpg is equal across the 4 and 8 cylinder treatment groups and conclude that there is a mean difference. We know that the average difference between the 4 and 8 cylinder groups for this sample is 11.95 mpg. That is to say, the 4 cylinder group gets, on average, 11.95 more miles per gallon than the 8 cylinder group. We are 95% certain that the mean difference between the 4 and 8 cylinder groups across the population will be between 8.5047 and 14.9702.
What to do When Assumptions are Broken or Things Go Wrong
The lack of normality or outliers can violate independent sample t-test assumptions and ultimately the results. If this happens, there are several available options:
Perform a nonparametric Mann-Whitney U test is the most popular alternative. This is also known as the Mann-Whitney-Wilcoxon or the Wilcoxon Rank Sum test. This test is considered robust to violations of normality and outliers (among others) and tests for differences in mean ranks. This is the most well-known alternative.
Additional options include considering permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.
If you need to compare more than two independent groups, a one-way Analysis of Variances (ANOVA) or Kruskal-Wallis may be appropriate.
An independent samples t-test is not appropriate if you have repeated measurements taken on the same experimental unit (subject). For example, if you have a pre-test post-test study, then each subject was measured at two different time intervals. If this is the case, then a paired t-test may be a more appropriate course of action.
Additional Resources and References
SAS Version 9.4, SAS Institute Inc., Cary, NC.
Littell, R.C., Stroup, W.W., and Freund R.J. (2002). SAS for Linear Models, Fourth Edition. Cary, NC: SAS Institute Inc.
Mitra, A. (1998). Fundamentals of Quality Control and Improvement. Upper Saddle River, NJ: Prentice Hall.
Laplin, L.L. (1997). Modern Engineering Statistics. Belmont, CA: Wadsworth Publishing Company.
Henderson and Velleman (1981). Building multiple regression models interactively. Biometrics, 37, 391–411.