## Introduction

A paired samples t-test is performed when an analyst would like to test for mean differences between two related treatments or conditions. If the same experimental unit (subject) is measured multiple times, and you would like to test for differences, then you may need to perform a repeated measures analysis such as a paired t-test. A paired sample t-test is the simplest version of “within -subject” analysis or “repeated measures” analysis.

Repeated measures can occur over time or space. For example, you may want to see if students improved their knowledge over the course of a term by checking for differences between mid-term and final exam scores. Since the same student is measured at two separate time points, the measurements are considered repeated over time.

Repeated measures over space can be a little more difficult to understand. For example, you may want to check for differences in blood pressure between measurements taken on the right arm and left arm. Since the same study subject is measured with both treatment conditions in two locations, this would be considered a repeated measurement over space.

A paired samples t-test is performed when each experimental unit (study subject), receives both available treatment conditions. Thus, the treatment groups have overlapping membership and are considered dependent.

The two-sided null hypothesis is that mean treatment differences are equal to zero. The alternative hypothesis is that the mean treatment difference is not equal to zero.*μ*_{diff} = *μ*_{1} – *μ*_{2}

H_{0}: *μ*_{diff} = 0*H*_{a}: *μ*_{diff} ≠ 0

## Paired Samples T-test Assumptions

The following assumptions must be met in order to run a paired samples t-test:

- The response of interest should be a continuous measure.
- The difference between the two related treatment groups should be normally distributed.
- The difference between groups contain no major outliers.

## Paired Samples T-test Example in R

In this example, we will test to see if there is a statistically significant difference in the reaction times of participants in a sleep deprivation study. Each evening, study participants were only allowed 3 hours of sleep per night. After waking, a series of tests were administered and average reaction times were recorded for each subject. We would like to check to see if there was a statistically significant difference in the reaction times of participants between day 1 and day 3 of the study. This data is a subset of a larger experiment.

Variables:

day1, day3 = Reaction times on day1 and day3

The data for this example is available here:

## Paired Samples T-test R Code

Each package used in the example can be installed with the install.packages commands as follows:

```
install.packages("gmodels", dependencies = TRUE)
install.packages("car", dependencies = TRUE)
install.packages("ggplot2", dependencies = TRUE)
install.packages("qqplotr", dependencies = TRUE)
install.packages("dplyr", dependencies = TRUE)
install.packages("tidyr", dependencies = TRUE)
```

The R code below includes Shapiro-Wilk Normality Tests and QQ plots for each treatment group. Data manipulation and summary statistics are performed using the dplyr package. Boxplots are created using the ggplot2 package. QQ plots are created with the qqplotr package. The shapiro.test and t.test functions are included in the base stats package.

Two versions of Levene’s Test for Equality of Variances are performed in order to demonstrate the traditional solution along with a more robust form of the test. In the leveneTest statement, the center=”mean” option will correspond to the traditional test as reported by other commercially available software. The center=”median” option is the default and can result in a slightly more robust solution to Levene’s Test.

Here is the annotated code for the example. All assumption checks are provided along with the paired t-test:

```
library("gmodels")
library("car")
library("ggplot2")
library("qqplotr")
library("dplyr")
library("tidyr")
#Import the data
dat<-read.csv("C:/Dropbox/Website/Analysis/Paired T/Data/sleep.csv")
#Create a 'long' or 'tall' dataset for decriptive statistics
dat_long<-gather(dat, Day, Activity, Day1, Day3, factor_key=TRUE)
#Produce descriptive statistics by group
dat_long %>% select(Activity, Day) %>% group_by(Day) %>%
summarise(n = n(),
mean = mean(Activity, na.rm = TRUE),
sd = sd(Activity, na.rm = TRUE),
stderr = sd/sqrt(n),
LCL = mean - qt(1 - (0.05 / 2), n - 1) * stderr,
UCL = mean + qt(1 - (0.05 / 2), n - 1) * stderr,
median=median(Activity, na.rm = TRUE),
min=min(Activity, na.rm = TRUE),
max=max(Activity, na.rm = TRUE),
IQR=IQR(Activity, na.rm = TRUE))
#Perform the Shapiro-Wilk Test for Normality on each group
shapiro.test(dat$diff)
#Perform a QQ plot of the differences
ggplot(data = dat, mapping = aes(sample = diff)) +
stat_qq_band(alpha=0.25, conf=0.95, qtype=1, bandType = "boot", B=5000, fill="red") +
stat_qq_line(identity=TRUE) +
stat_qq_point(col="black") +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles") + theme_bw()
#Produce boxplots and visually check for outliers
ggplot(dat, aes(x = "", y = diff)) +
stat_boxplot(geom ="errorbar", width = 0.5) +
geom_boxplot(fill = "light blue") +
stat_summary(fun.y=mean, geom="point", shape=10, size=3.5, color="black") +
ggtitle("Boxplot of day 3 - day 1 differences") +
theme_bw() + theme(legend.position="none")
#Perform an Independent Samples T-test
m1<-t.test(x=dat$Day1, y=dat$Day3, paired=TRUE, na.rm=TRUE)
print(m1)
```

## Paired Samples T-Test Annotated R Output

### Descriptive Statistics

Many times, analysts forget to take a good look at their data prior to performing statistical tests. Descriptive statistics are not only used to describe the data but also help determine if any inconsistencies are present. Detailed investigation of descriptive statistics can help answer the following questions (in addition to many others):

- How much missing data do I have?
- Do I have potential outliers?
- Are my standard deviation and standard error values large relative to the mean?
- In what range most of my data fall for each treatment?

Day^{a}n^{b}mean^{c}sd^{d}stderr^{e}LCL^{f}UCL^{f}median^{g}min^{h}max^{h}IQR^{i}

Day1 18 264. 33.4 7.88 248. 281. 273. 194. 314. 45.4

Day3 18 283. 38.9 9.16 264. 302. 281. 205. 347. 55.3

**Day**– This column identifies the levels of the treatment variable along with the mean differences between the levels.**n**– This column identifies how many data points (cars) are in each cylinder group.**mean**– The mean value for each treatment group.**sd**– The standard deviation of each treatment group.**stderr**– The standard error of each treatment group.**LCL, UCL –**The upper and lower confidence intervals of the mean. That is to say, you can be 95% certain that the true mean falls between the lower and upper values specified for each treatment group, assuming the data is normally distributed.**median –**The median value for each treatment group.**min, max**– The minimum and maximum values observed for each treatment group.**IQR –**The interquartile range of each treatment group. The interquartile range is the 75^{th}percentile – 25^{th}percentile.

### Normality Tests

Prior to performing a paired t-test, it is important to validate our assumptions to ensure that we are performing an appropriate and reliable comparison. Testing normality should be performed on the day differences using a Shapiro-Wilk normality test (or equivalent), and/or a QQ plot for large sample sizes. Many times, histograms can also be helpful,

```
Shapiro-Wilk normality test
data: dat$diff
W
```^{a} = 0.95927, p-value^{b} = 0.5877

**W**– The Shapiro-Wilk (W) test statistics for each test is provided for each group.**p-value**– The p-value for each test is provided. A p-value < 0.05 would indicate that we should reject the assumption of normality. Since the Shapiro-Wilk Test p-values are > 0.05 for each group, we conclude the data is normally distributed.

### QQ Plots

The vast majority of points should follow the diagonal theoretical normal line and stay within the curved 95% bootstrapped confidence bands to be considered normally distributed.

Since the Shapiro-Wilk Test p-value is > 0.05, and the QQ Plot of the differences follows the QQ plot theoretical normal diagonal line, we conclude the daily difference is normally distributed.

### Boxplots to Visually Check for Outliers

The ggplot2 package provides a box plot of the day 3 – day 1 differences. This can help visually identify outliers. The boxplot below shows no points outside the whiskers of the plot. As a result, we conclude there are no major outliers present in our differences.

### Paired Samples T-test Results in R

So far, we have determined that the differences between days are normally distributed and we do not have major influential outliers. Our next step is to officially perform a paired sample t-test to determine if there is a statistically significant difference in activity scores between 1 day and 3 day.

```
Paired t-test
data: dat$Day1 and dat$Day3
t = -2.9635
```^{a}, df = 17^{b}, p-value = 0.008705^{c}
alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:
-31.664248^{d} -5.328263^{d}

sample estimates:
mean of the differences
-18.49626^{e}

` `

**t**– This is the t-statistic. It is the ratio of the mean difference to the standard error. This value is computed as follows: -18.5 / 6.24 = -2.96.**df**– The appropriate degrees of freedom represent the number of paired observations (subject) with – 1. This pairs that were dropped due to missing values.**p-value**– This is the p-value associated with the paired samples t-test. That is to say if the P value < 0.05 (assuming alpha=0.05) then there is a statistically significant difference between days 1 and 3. In essence, we are testing to see if the difference between days are different than zero. For our example, we have a p-value = 0.0087. Thus, we reject the null hypothesis that the mean difference between activity scores is equal to zero and we conclude that a difference between days exists.**95% confidence interval –**The 95% confidence interval around the mean difference. That is to say, you can be 95% certain that the true mean difference in activity scores between day 1 and day 3 falls between -31.66 and -5.32.**sample estimates –**The average of the differences between days 1 and 3.

## Paired Samples T-test Interpretation and Conclusions

A p-value = 0.0087 indicates that we should reject the null hypothesis that the average difference between day 1 and day 3 activity scores is equal to zero. Thus, we conclude there is a difference in activity over time between days. In a paired samples t-test, the challenge can be correctly interpreting the direction of the difference. It is important to note that day 3 was subtracted from day 1 as follows: diff = day 1 – day 3, and the mean difference was approximately -18.5. Looking back at our descriptive statistics we can see that the average activity score for day 1 was approximately 264.5 while the average for day 3 was 283. Thus, on average, study subjects performed activities 18.5 seconds slower after 3 days of sleep deprivation compared to 1 day after being sleep deprived. Furthermore, we are 95% that the true mean difference in activity scores between day 1 and day 3 falls between -31.66 and -5.32.

## What to do When Assumptions are Broken or Things Go Wrong

The lack of normality of group differences or the existence of major outliers can violate the paired sample t-test assumptions and ultimately impact the results. If this happens, there are several available options:

Performing a nonparametric Wilcoxon signed-rank test is the most popular alternative. This test is considered robust to violations of normality and outliers. The Wilcoxon signed-rank performs a similar comparison to that of a paired samples t-test only on ranks. This is the most well-known alternative.

Additional options include considering permutation/randomization tests, bootstrap confidence intervals, and transforming the data but each option will have its own stipulations.

If you need to compare more than two dependent groups, a single factor repeated measures analysis of variances (ANOVA) or nonparametric Friedman test would be appropriate.

Furthermore, if you have one between-subject factor and one-within subject factor to consider simultaneously, then a repeated measure split-plot design and corresponding mixed model ANOVA would be appropriate.

Missing values can severely impact a paired sample t-test because the entire row of data will generally be excluded. If you have a lot of missing data, one alternative would be to perform a single factor repeated measures mixed model ANOVA. This would allow for the computation of estimated marginal means to compensate for the uneven replication between groups.

A paired samples t-test is not appropriate if each experimental unit (subject) only receives one of two available treatments. For example, if you would like to see if first-year students scored differently on an exam when compared to second-year students, then each subject only has one of two potential factor levels. If this is the case, then an independent samples t-test would be a more appropriate course of action.

## Additional Resources and References

Muenchen, R.A. (2011). R for SAS and SPSS Users, Second Edition. New York, NY: Springer, LLC.

Gregory Belenky, Nancy J. Wesensten, David R. Thorne, Maria L. Thomas, Helen C. Sing, Daniel P. Redmond, Michael B. Russo and Thomas J. Balkin (2003) Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study. *Journal of Sleep Research* **12**, 1–12.

Littell, R.C., Stroup, W.W., and Freund R.J. (2002). *SAS for Linear Models, Fourth Edition*. Cary, NC: SAS Institute Inc.

Mitra, A. (1998). *Fundamentals of Quality Control and Improvement*. Upper Saddle River, NJ: Prentice Hall.

Laplin, L.L. (1997). *Modern Engineering Statistics*. Belmont, CA: Wadsworth Publishing Company.