I noticed a pattern at one University. The students in the business program were using a P-P plot to examine the distribution of residuals in regression models, when a Q-Q plot is widely referenced in statistics textbooks. So I looked deeper into the differences between P-P and Q-Q plots with simulated data.

**Data Creation**

First, I used R to create a normally distributed data set (*N* = 50) with an *M* = 3.0 and an *SD* = 1.0.

set.seed(092920)

ndata <- rnorm(n = 100, mean = 3.0, sd = 1.0)

**Review of Histogram**

Next, I used **ggplot2** (Wickham et al., 2020), to create a histogram of the data.

The data looks approximately normal; however, note the distance between the two tails and the other data points.

**Tests of Normality**

Next, I’ll perform a series of statistical tests to see if the data follows a theoretical normal distribution. For this illustration, I’ll use six different tests: Shapiro-Wilk test (found in the **stats** package, which is loaded automatically by R), Anderson-Darling, Cramer-von Mises, Kolmogorov-Smirnov w/Lilliefors correction, Pearson Chi-Square, and Shapiro-Francia (found in the **nortest** package; Gross & Ligges, 2015).

shapiro.test(ndata)

ad.test(ndata)

cvm.test(ndata)

lillie.test(ndata)

pearson.test(ndata)

sf.test(ndata)

```
Shapiro-Wilk normality test
```

data: test$ndata

W = 0.99155, p-value = **0.02226**

```
Anderson-Darling normality test
```

data: test$ndata

A = 0.72545, p-value = 0.05808

```
Cramer-von Mises normality test
```

data: test$ndata

W = 0.10841, p-value = 0.0859

```
Lilliefors (Kolmogorov-Smirnov) normality test
```

data: test$ndata

D = 0.042662, p-value = 0.07763

```
Pearson chi-square normality test
```

data: test$ndata

P = 62.88, p-value = **1.344e-06**

```
Shapiro-Francia normality test
```

data: test$ndata

W = 0.99237, p-value = **0.03868**

Interesting…three of the tests (Anderson-Darling, Cramer-von Mises, and Kolmogorov-Smirnov w/Lilliefors correction) found the distribution to follow a theoretical normally distributed (*p* > .05), while three others (Shapiro-Wilk, Pearson Chi-square, and Shapiro-Francia) did not. What to do?

One could pick a test and make a decision, but the histogram and test may demonstrate to the reader that the decision was subjective. Let’s try to plot the data against a theoretical normal distribution.

**The P-P Plot**

Using **ggplot2** and **qqplotr** (Almeida et al., 2020), I created a P-P plot based on the data and plotted a 95% CI band on the AB line –

ggplot(data = test, mapping = aes(sample = ndata)) +

stat_pp_band() +

stat_pp_line() +

stat_pp_point() +

labs(x = “Probability Points”, y = “Cumulative Probability”)

Note the “submarine sandwich” 95% CI band around the data. A P-P plot focuses on the skewness or asymmetry of the distribution. Thus, the mode is magnified. If relying on a P-P plot, an emerging researcher could rely on the some of the statistical tests to state the distribution following a normal distribution and use a P-P plot to support that conclusion.

**The Q-Q Plot**

Next, let’s plot a Q-Q plot using the same parameters –

ggplot(data = test, mapping = aes(sample = ndata)) +

stat_qq_band() +

stat_qq_line() +

stat_qq_point() +

labs(x = “Theoretical Quanitles”, y = “Sample Quantiles”)

Interesting. In the Q-Q plot, points at both tails deviate from the 95% CI of a theoretical normal distribution. A Q-Q plot magnifies deviations at the tails. Thus, if an emerging scholar was looking at a Q-Q plot with certain tests of normality, one could decide that a residual or a variable did (or did not) follow a normal distribution.

It appears a P-P plot is best when used to explore extremely peaked distributions, while a Q-Q plot is best used to explore the influence of tails of a distribution.

**Why is a P-P Plot is chosen more frequently at this school?**

I corresponded with a methodologist at this University and she shared a few thoughts –

- Many universities (and students) use SPSS in their coursework. In the regression menu option, there is a Probability Plot option box. If checked, it creates a P-P plot.
*Note: A Q-Q plot is not offered within the regression menu. See this link on how to create a Q-Q plot from regression residuals in SPSS.* - Field (2018) is used as the associated textbook when teaching SPSS in doctoral business programs. The author prominently discusses P-P plots in this version of the textbook.
*Note: He also covers Q-Q plots but in a more subtle way and the discussion is buried in a graphics section. When found, the author refers to an earlier discussion on quantiles and quartiles. In the R version of book (Field et al, 2012), the Q-Q plot is referenced and their is no reference to a P-P Plot.*

**Student Notes: Don’t be a slave to a single author’s view: Expand your knowledge by reading different points of view.** **Don’t be a slave to a menu-based system: Learn about the statistical tests, how they are interpreted, and what the plots represent.**

References:

Almeida, A., Loy, A., & Hofmann, H. (2020, February 4). *qqplotr: Quantile-quantile plot extensions for ‘ggplot2’*. https://cran.r-project.org/web/packages/qqplotr/qqplotr.pdf

Field, A. (2018). *Discovering statistics using IBM SPSS Statistics* (5th Ed.). SAGE Publications.

Field, A., Miles, J., & Field, Z. (2012). *Discovering statistics using R*. SAGE Publications

Gross, J., & Ligges, U. (2015, July 29). *nortest: Tests for normality*. https://cran.r-project.org/web/packages/nortest/nortest.pdf

Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahshi, K., Wilke, C., Woo, K., Yutani, H., & Dunnington, D. (2020, June 19). *ggplot2: Create elegant data visualisations using the Grammar of Graphics*. https://cloud.r-project.org/web/packages/ggplot2/ggplot2.pdf