## Introduction

A paired samples t-test is performed when an analyst would like to test for mean differences between two related treatments or conditions. If the same experimental unit (subject) is measured multiple times, and you would like to test for differences, then you may need to perform a repeated measures analysis such as a paired t-test. A paired sample t-test is the simplest version of “within -subject” analysis or “repeated measures” analysis.

Repeated measures can occur over time or space. For example, you may want to see if students improved their knowledge over the course of a term by checking for differences between mid-term and final exam scores. Since the same student is measured at two separate time points, the measurements are considered repeated over time.

Repeated measures over space can be a little more difficult to understand. For example, you may want to check for differences in blood pressure between measurements taken on the right arm and left arm. Since the same study subject is measured with both treatment conditions in two locations, this would be considered a repeated measurement over space.

A paired samples t-test is performed when each experimental unit (study subject), receives both available treatment conditions. Thus, the treatment groups have overlapping membership and are considered dependent.

The two-sided null hypothesis is that mean treatment differences are equal to zero. The alternative hypothesis is that the mean treatment difference is not equal to zero.

*μ*_{diff} = *μ*_{1} – *μ*_{2}

*H*_{0}: *μ*_{diff} = 0

*H*_{a}: *μ*_{diff} ≠ 0

## Paired Samples T-test Assumptions

The following assumptions must be met in order to run a paired samples t-test:

- The response of interest should be a continuous measure.
- The difference between the two related treatment groups should be normally distributed.
- The difference between groups contain no major outliers.

## Paired Samples T-test Example

In this example, we will test to see if there is a statistically significant difference in the reaction times of participants in a sleep deprivation study. Each evening, study participants were only allowed 3 hours of sleep per night. After waking, a series of tests were administered and average reaction times were recorded for each subject. We would like to check to see if there was a statistically significant difference in the reaction times of participants between day 1 and day 3 of the study. This data is a subset of a larger experiment.

Variables:

day1 = The reaction times on day 1

day3 = The reaction times on day 3

diff = the difference of day 1 – day 3

The data for this example is available here:

## Paired Samples T-test SAS Code

PROC TTEST includes QQ plots for the differences between day 1 and day 3. While this information can aid in validating assumptions, the Shapiro-Wilk Normality Test of group difference, should also be used to help evaluate normality. Thus, PROC UNIVARIATE SAS code has been provided to perform the Shapiro-Wilk test on the group differences.

Here is the annotated code for the example. All assumption checks are provided along with the paired t-test:

*Import the data; proc import datafile='C:\Dropbox\Website\Analysis\Paired T\Data\sleep.csv' out=work.sleep dbms=csv replace; run; *Produce descriptive statistics; proc means data=sleep nmiss mean std stderr lclm uclm median min max qrange maxdec=2; var day1 day3; run; *Test for the normality of the differences; proc univariate data=sleep normal; var diff; run; *Run the Paired T-test; proc ttest data=sleep plots=all; paired day1*day3; run;

## Paired Samples T-test Annotated SAS Output

### Descriptive Statistics

Many times, analysts forget to take a good look at their data before performing statistical tests. Descriptive statistics are not only used to describe the data but also help determine if any inconsistencies are present. Detailed investigation of descriptive statistics can help answer the following questions (in addition to many others):

- How much missing data do I have?
- Do I have potential outliers?
- Are my standard deviation and standard error values large relative to the mean?
- In what range most of my data fall for each treatment?

**Variable –**Each treatment level of our independent variable.**N –**The number of observations for each treatment.**N Miss –**The number of missing observations for each treatment.**Mean –**The mean value for each treatment.**Std Dev –**The standard deviation of each treatment.**Std Error –**The standard error of each treatment. That is the standard deviation / sqrt (n).**Lower and Upper 95% CL for Mean –**The upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group.**Median –**The median value for each treatment.**Minimum, Maximum –**The minimum and maximum value for each treatment.**Quartile Range –**The inner quartile range of each treatment. That is the 75th percentile – 25th percentile.

### Normality Tests

Prior to performing the paired t-test, it is important to validate our assumptions to ensure that we are performing an appropriate and reliable comparison. Testing normality should be performed on the day differences using a Shapiro-Wilk normality test (or equivalent), and/or a QQ plot for large sample sizes. Many times, histograms can also be helpful. In this example, we will use PROC UNIVARIATE to produce our Shapiro-Wilk normality test for the daily difference, and PROC TTEST will produce our corresponding QQ plots.

The Shapiro-Wilk normality test on the difference between days:

**Test**– Four different normality tests are presented.**Statistic**– The test statistics for each test is provided here.**p Value**– The p-value for each test is provided. A p-value < 0.05 would indicate that we should reject the assumption of normality. Since the Shapiro-Wilk Test p-values are > 0.05 for each group, we conclude the data is normally distributed.

### QQ Plots

PROC TTEST provides a QQ Plot of the differences between days. The vast majority of points should follow the theoretical normal line.

Since the Shapiro-Wilk Test p-value is > 0.05, and the QQ Plot of the differences follows the QQ plot theoretical normal diagonal line, we conclude the daily difference is normally distributed.

### Boxplots to Visually Check for Outliers

PROC TTEST will provide a horizontal box plot of the difference between days. This can help visually identify outliers. The boxplot below shows no points outside the whiskers of the plot. As a result, we conclude there are no major outliers present in our differences.

### Paired Samples T-test

So far, we have determined that the differences between days are normally distributed and we do not have major influential outliers. Our next step is to officially perform a paired sample t-test to determine if there is a statistically significant difference in activity scores between 1 day and 3 day. PROC TTEST will produce descriptive statistics on the differences as a part of the paired t-test output:

**N**– This column identifies the number of paired observations for which we are taking differences. More clearly, this will represent the number of subjects that were measured on days 1 and 3.**Mean**– The average of the differences between days 1 and 3.**Std Dev**– The standard deviation of the differences between days 1 and 3.**Std Err**– The standard error of the differences between days 1 and 3.**Min, Max**– The minimum and maximum difference.**95% CL Mean**– The 95% confidence interval around the mean difference. That is to say, you can be 95% certain that the true mean difference in activity scores between day 1 and day 3 falls between -31.66 and -5.32.**95% CL Std Dev**– The 95% confidence interval of the standard deviation across od the difference between days.

### Paired Samples T-test Results in SAS

**DF**– The appropriate degrees of freedom represent the number of paired observations (subject) with – 1. This pairs that were dropped due to missing values.**t Value**– This is the t-statistic. It is the ratio of the mean difference to the standard error. This value is computed as follows: -18.5 / 6.24 = -2.96.**Pr > |t|**– This is the p-value associated with the paired samples t-test. That is to say if the P value < 0.05 (assuming alpha=0.05) then there is a statistically significant difference between days 1 and 3. In essence, we are testing to see if the difference between days are different than zero. For our example, we have a p-value = 0.0087. Thus, we reject the null hypothesis that the mean difference between activity scores is equal to zero and we conclude that a difference between days exists.

## Paired T-test Interpretation and Conclusions

A p-value = 0.0087 indicates that we should reject the null hypothesis that the average difference between day 1 and day 3 activity scores is equal to zero. Thus, we conclude there is a difference in activity over time between days. In a paired samples t-test, the challenge can be correctly interpreting the direction of the difference. It is important to note that day 3 was subtracted from day 1 as follows: diff = day 1 – day 3, and the mean difference was approximately -18.5. Looking back at our descriptive statistics we can see that the average activity score for day 1 was approximately 264.5 while the average for day 3 was 283. Thus, on average, study subjects performed activities 18.5 seconds slower after 3 days of sleep deprivation compared to 1 day after being sleep deprived. Furthermore, we are 95% that the true mean difference in activity scores between day 1 and day 3 falls between -31.66 and -5.32.

## What to do When Assumptions are Broken or Things Go Wrong

The lack of normality of group differences or the existence of major outliers can violate the paired sample t-test assumptions and ultimately impact the results. If this happens, there are several available options:

Performing a nonparametric Wilcoxon signed-rank test is the most popular alternative. This test is considered robust to violations of normality and outliers. The Wilcoxon signed-rank performs a similar comparison to that of a paired samples t-test only on ranks. This is the most well-known alternative.

Additional options include considering permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.

If you need to compare more than two dependent groups, a single factor repeated measures analysis of variances (ANOVA) or nonparametric Friedman test would be appropriate.

Furthermore, if you have one between-subject factor and one-within subject factor to consider simultaneously, then a repeated measure split-plot design and corresponding mixed model ANOVA would be appropriate.

Missing values can severely impact a paired sample t-test because the entire row of data will generally be excluded. If you have a lot of missing data, one alternative would be to perform a single factor repeated measures mixed model ANOVA. This would allow for the computation of estimated marginal means to compensate for the uneven replication between groups.

A paired samples t-test is not appropriate if each experimental unit (subject) only receives one of two available treatments. For example, if you would like to see if first-year students scored differently on an exam when compared to second-year students, then each subject only has one of two potential factor levels. If this is the case, then an independent samples t-test would be a more appropriate course of action.

## Additional Resources and References

SAS Version 9.4, SAS Institute Inc., Cary, NC.

Gregory Belenky, Nancy J. Wesensten, David R. Thorne, Maria L. Thomas, Helen C. Sing, Daniel P. Redmond, Michael B. Russo and Thomas J. Balkin (2003) Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study. *Journal of Sleep Research* **12**, 1–12.

Littell, R.C., Stroup, W.W., and Freund R.J. (2002). *SAS for Linear Models, Fourth Edition*. Cary, NC: SAS Institute Inc.

Mitra, A. (1998). *Fundamentals of Quality Control and Improvement*. Upper Saddle River, NJ: Prentice Hall.

Laplin, L.L. (1997). *Modern Engineering Statistics*. Belmont, CA: Wadsworth Publishing Company.