Lukas Soenning
Lehrstuhl für englische Sprachwissenschaft einschließlich Sprachgeschichte
Research seminar
January 8th, 2015

This short tutorial shows how to construct useful plots in R using the lattice package (Sarkar 2014).
It is an online appendix to the presentation Graphical methods for data analysis.

## Contents

1. Plotting a single distribution
1. Histogram
2. Density plot
3. Boxplot
4. Quantile plot

2. Comparing two distributions
1. Histogram
2. Density plot
3. Boxplot

3. Scatterplots
1. Regression lines and smoothers
2. Reversing the x-axis
3. Avoiding overplotting: Jittering and transparency
4. Superposition

4. Multipanel conditioning
1. Conditioning on a categorical variable
2. Conditioning on a quantitative variable
3. Conditioning on two variables

5. Multivariate plots
1. Scatterplot matrix
2. Parallel coordinates plot

6. Categorical variables
1. Spine plot
2. Mosaic plot

#### Preparations: R

Install the lattice package.

install.packages("lattice")

## Error in contrib.url(repos, “source”): trying to use CRAN without setting a mirror

library(lattice)

The following code changes the settings to produce black and white output (and a transparent background for the strips of the panels).

bw.theme <- canonical.theme(color = FALSE)
bw.theme$strip.background$col <- "transparent"
lattice.options(default.theme = bw.theme)  

#### Preparations: Data

Create an object named malta with the data.

malta <- read.csv(file="C:/Users/ba4rh5/Desktop/malta.csv")
items <- read.csv(file="C:/Users/ba4rh5/Desktop/items.csv")

## Quantitative variables

### 1 Plotting a single distribution

#### 1.1 Histogram

Histograms can be drawn with the function histogram(). A histogram of the distribution of the mean questionnaire scores:

histogram(malta$Mean) You can control the number of bins with the nint= argument: histogram(malta$Mean, nint=30)

#### 1.2 Density plot

Density plots avoid the arbitrary choice of intervals for binning the data. They are essentially smoothed histograms.

densityplot(malta$Mean) The smoothness of the density plot can be controlled with the bandwidth parameter, which can be set with the argument bw=. densityplot(malta$Mean, bw=.07)

The following arguments suppress the data points and add a reference line at zero:

#### 1.4 Quantile plot

Quantile plots can be drawn with the function qqmath(). The additional argument distribution=“qunif” is necessary to draw the type of quantile plot shown in the presentation.

#### 3.3 Avoiding overplotting: Jittering and transparency

If quantitative variables take on only a few different values, overplotting is an issue. It is difficult to judge where most observations concentrate.

xyplot(Preposition ~ Idiom, data=malta)

Jittering adds a bit of random noise to the variables:

xyplot(jitter(Preposition) ~ jitter(Idiom), data=malta)

For more random noise:

xyplot(jitter(Preposition, 3) ~ jitter(Idiom, 3), data=malta)

Another solution is to use transparency. The argument alpha= specifies the degree of transparency. For example, .5 means that 1/.5 = 2 points add up to black; .2 means that 1/.2 = 5 points add up to black. Jittering and transparency can be combined:

xyplot(jitter(Preposition, 3) ~ jitter(Idiom, 3), alpha=.2, data=malta)

#### 3.4 Superposition

We can compare two groups by plotting them into the same panel. The argument groups= specifies which groups to compare:

xyplot(Mean ~ Age, groups=Gender, data=malta, type=c("p", "r"))

The argument auto.key=TRUE adds a legend. Unfortunately only the points are shown (it is possible to add the lines to the legend, but this is a bit more complicated).

xyplot(Mean ~ Age, groups=Gender, data=malta, type=c("p", "r"), auto.key=TRUE)

Of course, you can plot only the regression lines by omitting “p” from the type= argument:

xyplot(Mean ~ Age, groups=Gender, data=malta, type=c("r"), auto.key=TRUE)

### 4 Multipanel conditioning

#### 4.1 Conditioning on categorical variables

Multipanel conditioning is a very powerful tool for data exploration, especially for uncovering and understanding interactions between variables. When conditioning on categorical variables, the groups (categories) are plotted into different panels. Lattice automatically ensures equal scaling in all panels. Here is a box plot showing score by native language of the father:

bwplot(Mean ~ NL.father, data=malta)

For a comparison of male and female subjects we can condition on gender. The “ | ” sign introduces a conditioning variable:

bwplot(Mean ~ NL.father | Gender, data=malta)

Sometimes a rearrangment of the independent variables can be more revealing. We should be flexible and try different combinations:

bwplot(Mean ~ Gender | NL.father, data=malta)

Here is a scatterplot (again) showing SCORE by AGE:

xyplot(Mean ~ Age, data=malta, type=c("p", "r"))

We can condition on GENDER:

xyplot(Mean ~ Age | Gender, data=malta, type=c("p", "r"))

#### 4.2 Conditioning on quantitative variables

We can also condition on quantitative variables. Lattice uses a concept called shingles to do this. A quantitative variable can be divided into a specified number of shingles, which are overlapping categories with an equal number of observations. The function equal.count() is used to create shingles. The arguments number= and overlap= control the number of shingles and the amount of overlap. You can look at the result using the function plot():

Age.shingles <- equal.count(malta$Age, number=5, overlap=1/4) plot(Age.shingles) 6 shingles with little overlap: Age.shingles <- equal.count(malta$Age, number=6, overlap=0)
plot(Age.shingles)

4 shingles with more overlap:

### 6 Categorical variables

Data

We first need to create a hypothetical dataset. Speakers of 6 different varieties were asked which verb form of (to) prove they find acceptable in sentences like “History has prov** him wrong”. The table lists the frequencies of the three possible answers.

prove.opinion <- rbind("American" = c(28, 22, 44),
"British" = c(83, 22, 17),
"Nigerian" = c(19, 54, 34),
"Indian" = c(23, 14, 4),
"Singapore" = c(46, 22, 6))
colnames(prove.opinion) <- c("proved", "either", "proven")
prove.opinion
##           proved either proven
## American      28     22     44
## British       83     22     17
## Nigerian      19     54     34
## Indian        23     14      4
## Singapore     46     22      6

#### 6.1 Spine plot

Spineplots are drawn with the function spineplot().

spineplot(prove.opinion)

#### 6.2 Mosaic plot

The function mosaicplot() draws mosaic plots. For 2-dimensional tables, they are very similar to spineplots. However, they offer additional options.

mosaicplot(prove.opinion)

We can change the orientation of the x-axis labels:

mosaicplot(prove.opinion, las=4)

And we can add shades of grey to distinguish between the categories of the dependent variable (verb form):

mosaicplot(prove.opinion, las=4, col=c("grey90", "white", "grey60"))

The mosaicplot() function allows us to use residual-based shading to see which cells differ from the expected counts:

mosaicplot(prove.opinion, las=4, shade=TRUE)