.

Lessons

    • Overview of EDA
    • Centrality and variability
    • Amounts and proportions
    • Comparisons
    • Trends

Lesson 1: What is EDA?

EDA: an introduction

  • EDA is an iterative cycle that helps you understand what your data says. It involves:

    • Generate questions about your data

    • Search for answers by visualizing, transforming, and modeling your data

    • Use what you learn to refine your questions and/or generate new questions

EDA: an introduction

Your goal during EDA is to develop an understanding of your data.

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

EDA: two useful questions

There is no rule about which questions you should ask to guide your research. However, two questions are particularly useful:

  1. What type of variation occurs within my variables?

  2. What type of covariation occurs between my variables?

Is EDA a tool for discovery or confirmation?

  • Discovery

  • Confirmation

When you begin to explore data, is it better to formulate one or two high-quality questions to ask, or many, many questions to explore?

  • One or two high-quality questions

  • Many, many questions

Lesson 2: Centrality and variability

Centrality (aka the “Average” value)

A single number representing the middle of a set of numbers


  • Mean: \(\frac{\text{Sum of values}}{\text{# of values}}\)

  • Median: “Middle” value (50% of data above & below)

  • Mode: Most frequent value (usually for categorical data)

Centrality (aka the “Average” value)

Mean is not the always “best” choice


wildlife_impacts %>%
    filter(! is.na(height)) %>%
    summarise(
      mean = mean(height),
      median = median(height))
# A tibble: 1 × 2
   mean median
  <dbl>  <dbl>
1  984.     50

Percent of data below mean:

percentiles <- ecdf(wildlife_impacts$height)
meanP <- percentiles(mean(wildlife_impacts$height, na.rm = TRUE))
paste0(round(100*meanP, 1), '%')
[1] "73.9%"

Variability (“spread”)

  • Standard deviation: distribution of values relative to the mean

    • \(s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}\)
  • Interquartile range (IQR): \(Q_3 - Q_1\) (middle 50% of data)

  • Range: max - min

Variability (“spread”)

  • Complaints are coming in about orders shipped from warehouse B, so you collect some data.

  • … here averages are misleading

days_to_ship
# A tibble: 12 × 3
   order warehouseA warehouseB
   <int>      <dbl>      <dbl>
 1     1          3          1
 2     2          3          1
 3     3          3          1
 4     4          4          3
 5     5          4          3
 6     6          4          4
 7     7          5          5
 8     8          5          5
 9     9          5          5
10    10          5          6
11    11          5          7
12    12          5         10
days_to_ship |> 
  pivot_longer(-order, names_to = "warehouse", values_to = "days") |> 
  group_by(warehouse) |>
  summarise(
    mean = mean(days),
    median = median(days))
# A tibble: 2 × 3
  warehouse   mean median
  <chr>      <dbl>  <dbl>
1 warehouseA  4.25    4.5
2 warehouseB  4.25    4.5

Variability (“spread”)

  • Complaints are coming in about orders shipped from warehouse B, so you collect some data:

  • variability reveals difference in days to ship

days_to_ship
# A tibble: 12 × 3
   order warehouseA warehouseB
   <int>      <dbl>      <dbl>
 1     1          3          1
 2     2          3          1
 3     3          3          1
 4     4          4          3
 5     5          4          3
 6     6          4          4
 7     7          5          5
 8     8          5          5
 9     9          5          5
10    10          5          6
11    11          5          7
12    12          5         10
days_to_ship |> 
  pivot_longer(-order, names_to = "warehouse", values_to = "days") |> 
  group_by(warehouse) |>
  summarise(
    mean = mean(days),
    sd = sd(days),
    iqr = IQR(days),
    range = max(days) - min(days))
# A tibble: 2 × 5
  warehouse   mean    sd   iqr range
  <chr>      <dbl> <dbl> <dbl> <dbl>
1 warehouseA  4.25 0.866  1.25     2
2 warehouseB  4.25 2.70   2.75     9

Variability (“spread”)


Outliers

Mean and standard deviation are sensitive to outliers

  • Outliers: \(Q_1 - 1.5 IQR\) * \(Q_3 + 1.5 IQR\)

  • Extreme values: \(Q_1 - 3 IQR\) * \(Q_3 + 3 IQR\)


data1 <- c(3,3,4,5,5,6,6,7,8,9)
  • Mean: 5.6

  • Standard deviation: 2.01

  • Median: 5.5

  • IQR: 2.5

data2 <- c(3,3,4,5,5,6,6,7,8,20)
  • Mean: 6.7

  • Standard deviation: 4.95

  • Median: 5.5

  • IQR: 2.5

Outliers


Source: Data Science Discovery

Outliers

Robust statistics for continuous data (less sensitive to outliers)

  • Centrality: use median rather than mean

  • Variability: use IQR rather than standard deviation

“Visualizing data helps us think”

anscombe |> tibble()
# A tibble: 11 × 8
      x1    x2    x3    x4    y1    y2    y3    y4
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    10    10    10     8  8.04  9.14  7.46  6.58
 2     8     8     8     8  6.95  8.14  6.77  5.76
 3    13    13    13     8  7.58  8.74 12.7   7.71
 4     9     9     9     8  8.81  8.77  7.11  8.84
 5    11    11    11     8  8.33  9.26  7.81  8.47
 6    14    14    14     8  9.96  8.1   8.84  7.04
 7     6     6     6     8  7.24  6.13  6.08  5.25
 8     4     4     4    19  4.26  3.1   5.39 12.5 
 9    12    12    12     8 10.8   9.13  8.15  5.56
10     7     7     7     8  4.82  7.26  6.42  7.91
11     5     5     5     8  5.68  4.74  5.73  6.89


anscombe_summary_stats
# A tibble: 2 × 9
  statistic    x1    x2    x3    x4    y1    y2    y3    y4
  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mean       9     9     9     9     7.50  7.50  7.5   7.50
2 sd         3.32  3.32  3.32  3.32  2.03  2.03  2.03  2.03

Anscombe’s Quartet

  • Stephen Few (2009, p6)

Data types determines how to summarize it


Nominal (categorical) Ordinal (categorical Numerical (continuous)

Measures

  • Frequency counts
  • proportions

Measures

  • Frequency counts
  • proportions
  • Median, mode
  • IQR

Measures

  • Mean, median
  • Range, standard deviation, IQR

Charts

  • Bars

Charts

  • Bars

Charts

  • Histogram
  • Boxplot

Summarizing Nominal data

Summarize with counts/ percentages

wildlife_impacts |> 
  count(operator, sort = TRUE) |> 
  mutate(percent = n / sum(n))
# A tibble: 4 × 3
  operator               n percent
  <chr>              <int>   <dbl>
1 SOUTHWEST AIRLINES 17970   0.315
2 UNITED AIRLINES    15116   0.265
3 AMERICAN AIRLINES  14887   0.261
4 DELTA AIR LINES     9005   0.158

Visualize with bars

wildlife_impacts |> 
  count(operator, sort = TRUE) |> 
  ggplot(aes(x = fct_reorder(operator, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(x = "Operator", y = "Count") +
  theme_minimal()

Summarizing Ordinal data

Summarize: counts/ percentages

wildlife_impacts |> 
  count(incident_month, sort = TRUE) |> 
  mutate(percent = n / sum(n))
# A tibble: 12 × 3
   incident_month     n percent
            <dbl> <int>   <dbl>
 1              9  7980  0.140 
 2             10  7754  0.136 
 3              8  7104  0.125 
 4              5  6161  0.108 
 5              7  6133  0.108 
 6              6  4541  0.0797
 7              4  4490  0.0788
 8             11  4191  0.0736
 9              3  2678  0.0470
10             12  2303  0.0404
11              1  1951  0.0342
12              2  1692  0.0297

Visualize: bars

wildlife_impacts |> 
  count(incident_month, sort = TRUE) |> 
  ggplot(aes(x = as.factor(incident_month), y = n)) +
  geom_col() +
  labs(x = "Incident Month", y = "Count") 

Summarizing continuous data

Histograms:

  • Skewness

  • Number of modes


Boxplots:

  • Outliers

  • Comparing variablesn

Histogram: Identify Skewness & # of Modes

Summarise:

  • Mean, median, sd, range, & IQR:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0    50.0   983.8  1000.0 25000.0   18038 

Visualize:

  • Histogram (identify skewness & modes)

Histogram: Identify Skewness & # of Modes

Height

Speed

Boxplot: Identify outliers

Height

Speed

Histogram and Boxplot

Histogram

  • Skewness
  • Modes

Boxplot

  • Outliers