Graphical methods for data analysis with R (lattice) by Lukas Sönning
Lukas Soenning
Lehrstuhl für englische Sprachwissenschaft einschließlich Sprachgeschichte
Research seminar
January 8th, 2015
This short tutorial shows how to construct useful plots in R using the lattice package (Sarkar 2014).
It is an online appendix to the presentation Graphical methods for data analysis.
Contents
- Plotting a single distribution
- Histogram
- Density plot
- Boxplot
- Quantile plot
- Comparing two distributions
- Histogram
- Density plot
- Boxplot
- Scatterplots
- Regression lines and smoothers
- Reversing the x-axis
- Avoiding overplotting: Jittering and transparency
- Superposition
- Multipanel conditioning
- Conditioning on a categorical variable
- Conditioning on a quantitative variable
- Conditioning on two variables
- Multivariate plots
- Scatterplot matrix
- Parallel coordinates plot
- Categorical variables
- Spine plot
- Mosaic plot
Preparations: R
Install the lattice package.
install.packages("lattice")
## Error in contrib.url(repos, “source”): trying to use CRAN without setting a mirror
Load it into R.
library(lattice)
The following code changes the settings to produce black and white output (and a transparent background for the strips of the panels).
bw.theme <- canonical.theme(color = FALSE) bw.theme$strip.background$col <- "transparent" lattice.options(default.theme = bw.theme)
Preparations: Data
Create an object named malta with the data.
malta <- read.csv(file="C:/Users/ba4rh5/Desktop/malta.csv") items <- read.csv(file="C:/Users/ba4rh5/Desktop/items.csv")
Quantitative variables
1 Plotting a single distribution
1.1 Histogram
Histograms can be drawn with the function histogram(). A histogram of the distribution of the mean questionnaire scores:
histogram(malta$Mean)
You can control the number of bins with the nint= argument:
histogram(malta$Mean, nint=30)
1.2 Density plot
Density plots avoid the arbitrary choice of intervals for binning the data. They are essentially smoothed histograms.
densityplot(malta$Mean)
The smoothness of the density plot can be controlled with the bandwidth parameter, which can be set with the argument bw=.
densityplot(malta$Mean, bw=.07)
The following arguments suppress the data points and add a reference line at zero:
densityplot(malta$Mean, bw=.07, plot.points=FALSE, ref=TRUE)
1.3 Box plot
Box plots can be drawn with the function bwplot().
bwplot(malta$Mean)
1.4 Quantile plot
Quantile plots can be drawn with the function qqmath(). The additional argument distribution=“qunif” is necessary to draw the type of quantile plot shown in the presentation.
qqmath(items$Mean, distribution="qunif")
1.5 Dot plot
Dot plots are a very flexible tool for data visualization. In their most basic form, they plot data points with labels. Here we can use them to plot the distribution of the questionnaire items including their labels. The additional function reorder() sorts the items in increasing order.
dotplot(reorder(BrE, Mean) ~ Mean, data=items)
2 Comparing distributions
2.1 Histogram
We can use histograms to compare two groups. Lattice makes it very easy to compare groups with different panels. The “ | ” sign indicates the variable that is used to split the data into different panels.
histogram(~ Age | Gender, data=malta)
2.2 Density plot
The advantage of density plots is that we can plot the two distributions into the same panel, which makes direct comparisons easier. The groups= argument specifies the groups.
densityplot(~Mean, groups=Gender, data=malta, plot.points=FALSE, ref=TRUE)
2.3 Box plot
The box plot is a very useful method for comparing two or more distributions.
bwplot(Mean ~ NL.father, data=malta)
The variable to the left of the “ ~ ” sign (tilde) is plotted on the y-axis. Swap the two variables to draw a horizontal plot.
bwplot(NL.father ~ Mean, data=malta)
3 Scatterplots
3.1 Regression lines and smoothers
Scatterplots can be drawn with the function xyplot(). The variable to the left of the “ ~ ” sign (tilde) is plotted on the y-axis.
xyplot(Mean ~ Age, data=malta)
Adding a regression line is simple: You can use the type= argument to specify which elements you want in your scatterplot: “p” is for points, “r” adds a straight regression line:
xyplot(Mean ~ Age, data=malta, type=c("p", "r"))
Using only “r” omits the points:
xyplot(Mean ~ Age, data=malta, type=c("r"))
“smooth” adds a scatterplot smoother:
xyplot(Mean ~ Age, data=malta, type=c("p", "smooth"))
3.2 Reversing the x-axis
You can reverse the x-axis with the following additional argument xlim=:
xyplot(Mean ~ Age, data=malta, type=c("p", "r"), xlim=rev(extendrange(malta$Age)))
3.3 Avoiding overplotting: Jittering and transparency
If quantitative variables take on only a few different values, overplotting is an issue. It is difficult to judge where most observations concentrate.
xyplot(Preposition ~ Idiom, data=malta)
Jittering adds a bit of random noise to the variables:
xyplot(jitter(Preposition) ~ jitter(Idiom), data=malta)
For more random noise:
xyplot(jitter(Preposition, 3) ~ jitter(Idiom, 3), data=malta)
Another solution is to use transparency. The argument alpha= specifies the degree of transparency. For example, .5 means that 1/.5 = 2 points add up to black; .2 means that 1/.2 = 5 points add up to black. Jittering and transparency can be combined:
xyplot(jitter(Preposition, 3) ~ jitter(Idiom, 3), alpha=.2, data=malta)
3.4 Superposition
We can compare two groups by plotting them into the same panel. The argument groups= specifies which groups to compare:
xyplot(Mean ~ Age, groups=Gender, data=malta, type=c("p", "r"))
The argument auto.key=TRUE adds a legend. Unfortunately only the points are shown (it is possible to add the lines to the legend, but this is a bit more complicated).
xyplot(Mean ~ Age, groups=Gender, data=malta, type=c("p", "r"), auto.key=TRUE)
Of course, you can plot only the regression lines by omitting “p” from the type= argument:
xyplot(Mean ~ Age, groups=Gender, data=malta, type=c("r"), auto.key=TRUE)
4 Multipanel conditioning
4.1 Conditioning on categorical variables
Multipanel conditioning is a very powerful tool for data exploration, especially for uncovering and understanding interactions between variables. When conditioning on categorical variables, the groups (categories) are plotted into different panels. Lattice automatically ensures equal scaling in all panels. Here is a box plot showing score by native language of the father:
bwplot(Mean ~ NL.father, data=malta)
For a comparison of male and female subjects we can condition on gender. The “ | ” sign introduces a conditioning variable:
bwplot(Mean ~ NL.father | Gender, data=malta)
Sometimes a rearrangment of the independent variables can be more revealing. We should be flexible and try different combinations:
bwplot(Mean ~ Gender | NL.father, data=malta)
Here is a scatterplot (again) showing SCORE by AGE:
xyplot(Mean ~ Age, data=malta, type=c("p", "r"))
We can condition on GENDER:
xyplot(Mean ~ Age | Gender, data=malta, type=c("p", "r"))
4.2 Conditioning on quantitative variables
We can also condition on quantitative variables. Lattice uses a concept called shingles to do this. A quantitative variable can be divided into a specified number of shingles, which are overlapping categories with an equal number of observations. The function equal.count() is used to create shingles. The arguments number= and overlap= control the number of shingles and the amount of overlap. You can look at the result using the function plot():
Age.shingles <- equal.count(malta$Age, number=5, overlap=1/4) plot(Age.shingles)
6 shingles with little overlap:
Age.shingles <- equal.count(malta$Age, number=6, overlap=0) plot(Age.shingles)
4 shingles with more overlap:
Age.shingles <- equal.count(malta$Age, number=4, overlap=1/2) plot(Age.shingles)
We can use the newly created object Age.shingles to condition on AGE. First, here is a boxplot showing SCORE by NL.FATHER:
bwplot(Mean ~ NL.father, data=malta)
We can condition on AGE:
bwplot(Mean ~ NL.father | Age.shingles, data=malta)
You can control the arrangment of the panels with the argument layout=:
bwplot(Mean ~ NL.father | Age.shingles, data=malta, layout=c(4,1))
Different layout:
bwplot(Mean ~ NL.father | Age.shingles, data=malta, layout=c(1,4))
4.3 Conditioning on two variables
Multipanel conditioning can also involve two conditioning variables:
bwplot(Mean ~ NL.father | Gender * Age.shingles, data=malta, layout=c(2,4))
5 Multivariate plots
5.1 Scatterplot matrix
A scatterplot matrix simulateously plots several quantitative variables against each other. It is especially useful for investigating the association between several independent variables and identifiying outliers or clusters. For the present dataset it is not very revealing since we only have 1 quantitative independent variable (AGE). We will illustrate its use with the mean scores on subsets of the questionnaire data. A scatterplot matrix is drawn with the function splom(). Use square brackets [] to indicate which columns (=variables) of the data frame you want to include:
splom(~malta[c(11:15)])
You can change the size of the plotting symbols:
splom(~malta[c(11:15)], cex=.6)
You can add a regression line easily:
splom(~malta[c(11:15)], cex=.6, type=c("p", "r"))
If overplotting is an issue, you can use transparency:
splom(~malta[c(11:15)], cex=.6, type=c("p", "r"), alpha=.3)
5.2 Parallel coordinates plot
Parallel coordinates plots are another method for showing several quantitative variables in the same plot. The argument lty= selects solid lines for all observations:
parallelplot(~malta[c(11:15)], lty=1)
By default, every variable is scaled according to its own range, thus stretching from its minimum to its maximum. If variables are measured on a common scale (like here), you can add the argument common.scale=TRUE. The data now range from the overall minimum to the overall maximum:
parallelplot(~malta[c(11:15)], lty=1, common.scale=TRUE)
If overplotting is an issue you can use transparency:
parallelplot(~malta[c(11:15)], lty=1, common.scale=TRUE, alpha=.2)
We can use color to distinguish groups. However, we first have to unload the lattice package and load it again to have available the (default) color settings. We also add a legend.
detach(package:lattice, unload=TRUE) library(lattice)
parallelplot(~malta[c(11:15)], lty=1, groups=malta$Gender, common.scale=TRUE, alpha=.2, auto.key=TRUE)
6 Categorical variables
Data
We first need to create a hypothetical dataset. Speakers of 6 different varieties were asked which verb form of (to) prove they find acceptable in sentences like “History has prov** him wrong”. The table lists the frequencies of the three possible answers.
prove.opinion <- rbind("American" = c(28, 22, 44), "Canadian" = c(14, 9, 24), "British" = c(83, 22, 17), "Nigerian" = c(19, 54, 34), "Indian" = c(23, 14, 4), "Singapore" = c(46, 22, 6)) colnames(prove.opinion) <- c("proved", "either", "proven") prove.opinion
## proved either proven ## American 28 22 44 ## Canadian 14 9 24 ## British 83 22 17 ## Nigerian 19 54 34 ## Indian 23 14 4 ## Singapore 46 22 6
6.1 Spine plot
Spineplots are drawn with the function spineplot().
spineplot(prove.opinion)
6.2 Mosaic plot
The function mosaicplot() draws mosaic plots. For 2-dimensional tables, they are very similar to spineplots. However, they offer additional options.
mosaicplot(prove.opinion)
We can change the orientation of the x-axis labels:
mosaicplot(prove.opinion, las=4)
And we can add shades of grey to distinguish between the categories of the dependent variable (verb form):
mosaicplot(prove.opinion, las=4, col=c("grey90", "white", "grey60"))
The mosaicplot() function allows us to use residual-based shading to see which cells differ from the expected counts:
mosaicplot(prove.opinion, las=4, shade=TRUE)