The grammar of graphics
Datasets and mapping
Geometries
Statistical transformation and plotting distribution
Position adjustment and scales
Coordinates and themes
Facets and custom plots
Most requested programming languages for data scientists are R and Python.
ggplot2 as a visualization package for R, is becoming an industry standard for visualization.
You can create new sentences if you know about the grammar.
In ggplot2 context, you can create new graphics or tailored plot that suits your needs or preferences.
Each geom can display certain aesthetics.
Some of them are required.
Line plots
Aesthetics of geom_path
, geom_line
, geom_step
:
We will use the babynames
data from the babynames
package for demonstration.
Rows: 1,924,665
Columns: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880,…
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida",…
$ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258,…
$ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843, 0.016…
Line plots: practice exercise
Scatterplots
We can derived plots like:
geom_line
is added)To avoid overlappping
geom_jitter
for positionScatter plot: practice exercise
Use the mpg
data to recreate the plot shown on the right.
ggplot(data = <dataset>,
mapping = aes(x = <varX>, y = <varY>, ...)) +
geom_<function>(..., stat = <stat>, position = <position>) +
geom_<stat>(...)
Every layer has a statistical transformation associated to it.
geoms control the way the plot looks
stats control the way the data is transformed
Geoms and stats
Every geometry has a default stat.
geom_line
default stat is stat_identity
geom_point
default stat is stat_identity
geom_smooth
default stat is stat_smooth
Each stat has a default geom
stat_smooth
default geom is geom_smooth
stat_count
default geom is geom_bar
stat_sum
default geom is geom_point
Interesting stats
stat_smooth(geom_smooth)
stat_unique(geom_point)
stat_summary(geom_pointrange
stat_count(geom_bar)
stat_bin(geom_histogram)
stat_density(geom_density)
stat_boxplot(geom_boxplot)
stat_ydensity(geom_violin)
Computed aesthetics
When a stat perform a transformation, new variables are created.
e.g., in geom_histogram
computed variables are:
count
- number of points in bindensity
- density of points in bins, scaled to integrate to 1 ncount.ncount
- count, scaled to maximum of 1ndensity
- density, scaled to maximum of 1To access: + old way: ..<stat name>..
+ new way: stat(name)
Computed aesthetics
When a stat perform a transformation, new variables are created.
e.g., in geom_histogram
computed variables are:
count
- number of points in bindensity
- density of points in bins, scaled to integrate to 1 ncount.ncount
- count, scaled to maximum of 1ndensity
- density, scaled to maximum of 1To access: + old way: ..<stat name>..
+ new way: stat(name)
Ways to look at distributions:
Histograms
Frequency polygons
Density plots
Boxplots
Violin plots
Density plots
geom_density
a smoothed version of the frequency polygon
different from geom_area
where aesthetic y is needed
geom_density_ridges
available in ggridges package
create a ridgeline plots
Practice exercise
Use the mpg
data to recreate the plot shown on the right. You may use the following parameters:
bins = 10
fill = "cadetblue3"
alpha = 0.5
Boxplot
geom_boxplot
Practice exercise
Use the mpg
data to recreate the plot shown on the right.
ggplot(data = <dataset>,
mapping = aes(x = <varX>, y = <varY>, ...)) +
geom_<function>(..., stat = <stat>, position = <position>) +0
geom_<stat>(...) +
scale_<aesthetic>_<type> #<<
Scales control how data values are translated to visual properties
Can overide default scales like axis,legend, and transformation of data to aesthetics.
Scales belong to one these types:
Naming scheme:
scale + aesthetic + name of scale
scale_*_continuous()
scale_*_discrete()
scale_*_manual()
Position scales
Position scales
Position scales
Color scales
Continuous
Binned
Discrete
Viridis family
Colorbrewer family
Continuous
Binned
Discrete
scale_*_manual
For example
scale_*_manual
For example
scale_*_manual
For example
Shortcuts
Shortcuts
Shortcuts
All layers have a position that resolves overlapping geoms
Overrides default using position
argument to geom_
or stat_
function.
position_jitter
?position_jitter
Adds random noise to the data points to avoid overlaps
Useful for scatterplots
geom_jitter
Parameters
position_stack()
?position_stack
Stacks geoms on top of each other.
position_fill()
?position_fill
Stacks geoms
on top of each other and standardizes the height.
Parameters
position_stack()
?position_stack
Stacks geoms on top of each other.
position_fill()
?position_fill
Stacks geoms
on top of each other and standardizes the height.
Parameters
position_stack()
?position_stack
Stacks geoms on top of each other.
position_fill()
?position_fill
Stacks geoms
on top of each other and standardizes the height.
Parameters
position_dodge
?position_dodge
preserves the vertical position of a geom while adjusting the horizontal position.
Parameters
position_dodge
?position_dodge
preserves the vertical position of a geom while adjusting the horizontal position.
Parameters
Practice exercise
syntax
Coordinate are sets that locate points in space
coord_cartesian()
coord_flip()
coord_polar()
coord_cartesian()
Zooming into plots
setting limits using scale
setting limits using coordinate system
proper way to zoom
does not eliminate data outside the plot
Parameters
coord_cartesian()
Zooming into plots
setting limits using scale
setting limits using coordinate system
proper way to zoom
does not eliminate data outside the plot
Parameters
coord_cartesian()
Zooming into plots
setting limits using scale
setting limits using coordinate system
proper way to zoom
does not eliminate data outside the plot
Parameters
coord_cartesian()
Zooming into plots
setting limits using scale
setting limits using coordinate system
proper way to zoom
does not eliminate data outside the plot
Parameters
syntax
Facets divide plot into subplots based on the values of one or more discrete variables.
facet_wrap()
facet_grid()
facet_grid
produces a 2d grid of panels defined by variables which form the rows and columns
.~ a
spreads the values across columns
b ~ .
spreads the values of b
down the ro ws
a ~ b
spreads a
across columns and b
down rows
facet_grid
produces a 2d grid of panels defined by variables which form the rows and columns
.~ a
spreads the values across columns
b ~ .
spreads the values of b
down the rows
a ~ b
spreads a
across columns and b
down rows
facet_grid
produces a 2d grid of panels defined by variables which form the rows and columns
.~ a
spreads the values across columns
b ~ .
spreads the values of b
down the rows
a ~ b
spreads a
across columns and b
down rows
Options
ggplot2
library
theme_gray()
theme_bw()
theme_light()
theme_classic(
)...
ggthemes
Options
ggplot2
library
theme_gray()
theme_bw()
theme_light()
theme_classic(
)...
ggthemes
Options
ggplot2
library
theme_gray()
theme_bw()
theme_light()
theme_classic(
)...
ggthemes
Options
ggplot2
library
theme_gray()
theme_bw()
theme_light()
theme_classic(
)...
ggthemes
Options
Using the built-in-theme from ggplot2
library
Using other package e.g., ggthemes
theme_economist_white()
theme_fivethirtyeight()
theme_stata()
theme_tufte()
Options
Using the built-in-theme from ggplot2
library
Using other package e.g., ggthemes
theme_economist_white()
theme_fivethirtyeight()
theme_stata()
theme_tufte()
Options
Using the built-in-theme from ggplot2
library
Using other package e.g., ggthemes
theme_economist_white()
theme_fivethirtyeight()
theme_stata()
theme_tufte()
Options
Using the built-in-theme from ggplot2
library
Using other package e.g., ggthemes
theme_economist_white()
theme_fivethirtyeight()
theme_stata()
theme_tufte()
Let’s apply what we have covered!
Use the mpg dataset to recreate the plot.
But first, we need to do some data wrangling!
Use the updated mpg data to mimic the plot.
AgEc 211: Statistical Methods