class: center, middle, inverse, title-slide # Statistical modelling ## #1.c Exploratory Data Analysis ###
Dr. Léo Belzile
HEC Montréal --- layout: true <div class="my-footer"> <span> <a href="https://lbelzile.github.io/modstat/" target="_blank"><b>Statistical Modelling</b></a> - Dr. Léo Belzile, HEC Montréal </span> </div> --- # Type of data Data base typically comprise many different *variable types*. Distinguishing between the later is needed for - correct model choice, - proper graphical representation, - interpretation of effects. --- ## Numerical variables <img src="img/c1/EDA/continuous_discrete.png" width="55%" style="display: block; margin: auto;" /> .center[ .small[ Drawing by Allison Horst of continuous (left) and discrete (right) numerical variables. ] ] --- ## Categorical variables <img src="img/c1/EDA/nominal_ordinal_binary.png" width="70%" style="display: block; margin: auto;" /> .center[ .small[ Drawing by Allison Horst of nominal (left), ordinal (middle) and binary (right) categorical variables. ] ] --- ## Graphics and data > *The simple graph has brought more information to the data analyst's mind than any other device.* .right[— John Tukey] -- ### What is a good graph? > *Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency ... Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.* .right[— Tufte, 1983] --- ## Grammar of graphics > Wilkinson, L. (2005), *The Grammar of Graphics(2nd ed.)* Statistics and Computing, New York: Springer. - Elements (layers): - data - mapping (variables -> aesthetics) - geometric objects - transformations - positioning - Scale / guide - Coordinates (facets, coordinate system) --- Some golden rules for effective visualization ### Rule 1: the choice of graphic depends on the variable type -- <br> #### One variable .midi[ - continuous: histogram/density plot - discrete: bar plot - categorical: bar plot (frequency or percentage) ] -- #### Two variables .midi[ - continuous: scatterplot - categorical: bar plots (one group via color), heatmap - continuous `\(\times\)` categorical: box-and-whisker plot, violin plot ] --- .panelset[ .panel[.panel-name[**R** graph] .center[.small[Renfe data: bar plot of ticket class]] <img src="MATH60604A_w1c_eda_files/figure-html/renfe_barplot-1.png" width="70%" style="display: block; margin: auto;" /> ] .panel[.panel-name[**R** code] #### One categorical variable: bar plot ```r ggplot(data = renfe, aes(x = forcats::fct_infreq(classe))) + geom_bar() + geom_text(stat='count', aes(label=..count..), vjust=-0.5) + labs(x = "class", y = "count") + scale_y_continuous(expand = c(.125, 0)) + theme(panel.grid.major.x = element_blank()) ``` .small[ - Order values by frequency. - If the labels are too long, rotate the axis (`+ coord_flip()`). ] ] .panel[.panel-name[**SAS** graph] <img src="img/c1/EDA/EDA_sas_g1.png" width="55%" style="display: block; margin: auto;" /> ] .panel[.panel-name[**SAS** code] ```sas proc sgplot data=statmod.renfe; vbar class / categoryorder=respdesc; xaxis label="ticket class"; run; ``` ] ] --- .panelset[ .panel[.panel-name[**R** graph] .center[ .small[Renfe data: histogram of Promo ticket price ] ] <img src="MATH60604A_w1c_eda_files/figure-html/renfe_hist-1.png" width="70%" style="display: block; margin: auto;" /> ] .panel[.panel-name[**R** code] #### One continuous variable: histogram/density ```r renfe %>% subset(fare == "Promo") %>% ggplot(aes(x = price)) + geom_histogram(aes(y = ..density..), bins = 30) + geom_density() + geom_rug(sides = "b") + labs(x = "price of Promo tickets (in euros)", y = "density") ``` ] .panel[.panel-name[**SAS** graph] <img src="img/c1/EDA/EDA_sas_g2.png" width="55%" style="display: block; margin: auto;" /> ] .panel[.panel-name[**SAS** code] ```sas data renfe_promo; set statmod.renfe; where fare="Promo"; run; proc sgplot data=renfe_promo noautolegend; histogram price; density price / type=kernel; xaxis label = "Promo ticket price (in euros)"; run; ``` ] ] --- .panelset[ .panel[.panel-name[**R** graph] .center[ .small[ Renfe data: box-and-whiskers plot of Promo tickets price as a function of class ] ] <img src="MATH60604A_w1c_eda_files/figure-html/renfe_boxplot-1.png" width="55%" style="display: block; margin: auto;" /> ] .panel[.panel-name[**R** code] #### Two variables (continuous and categorical): boxplot ```r renfe %>% subset(fare == "Promo") %>% ggplot(aes(y = price, x = class, col = type)) + geom_boxplot() + labs(y = "price (in euros)", col = "train type") + theme(legend.position = "bottom") + scale_colour_viridis_d() ``` .small[ - We added a categorical variable (`type`) through use of colour. - Use an appropriate color palette (for color-blind people and for black and white printing). ] ] .panel[.panel-name[**SAS** graph] <img src="img/c1/EDA/EDA_sas_g3.png" width="55%" style="display: block; margin: auto;" /> ] .panel[.panel-name[**SAS** code] ```sas proc sgplot data=renfe_promo; vbox price / category=class group=type; yaxis label = "price (in euros)"; run; ``` ] ] --- .panelset[ .panel[.panel-name[**R** graph] .center[ .small[ Renfe data: scatterplot of ticket price as a function of travel time for high speed trains ] ] <img src="MATH60604A_w1c_eda_files/figure-html/renfe_nuagepts-1.png" width="55%" style="display: block; margin: auto;" /> ] .panel[.panel-name[**R** code] #### Two continuous variables and one categorical variable: scatterplot ```r renfe %>% subset(type != "REXPRESS") %>% ggplot(aes(x = duration, y = price, col = type)) + geom_point() + labs(y = "price (in euros)", x = "travel duration (in minutes)", col = "train type") + theme(legend.position = "bottom") + scale_colour_viridis_d() ``` .midi[ What is wrong with the previous display? How could we fix the graph? ] ] .panel[.panel-name[**SAS** graph] <img src="img/c1/EDA/EDA_sas_g4.png" width="55%" style="display: block; margin: auto;" /> ] .panel[.panel-name[**SAS** code] ```sas data renfe_ave; set statmod.renfe; where type NE "REXPRESS"; run; proc sgplot data=renfe_ave; scatter y=price x=duration / group=type; xaxis label="travel duration (in minutes)"; yaxis label="price (in euros)"; run; ``` ] ] --- ### Rule 2: A graph tells a story by itself Your graphic must be standalone with the legend. - some visualization choices are more effective/adequate than others - include both variable name **and** units if ambiguous - add a description in the text and cross-reference - pay attention to scale (adequate font size for legibility) --- ### Rule 3: pay attention to human visual perception Avoid junk chart + ratio length/width + spacing between bands + axis limits (with or without zero) + choice of color (grayscale, color-blind friendly palettes) + comparing areas is difficult + avoid 3D graphs / rotation --- # Graphical exploratory data analysis > *Numerical quantities focus on expected values, graphical summaries on unexpected values.* .right[— John Tukey] - Ask questions related to the data - Look for answers using graphs - Infirm or confirm your intuitions - Refine questions based on preliminary findings - Rinse and repeat - Write a summary of key findings --- ### References - [*Fundamentals of Data Visualization* par Claus O. Wilke](https://serialmentor.com/dataviz/) - [Chapter 3 of _**R** for Data Science_ by Garrett Grolemund and Hadley Wickham](https://r4ds.had.co.nz/) - [Chapter 1 of *Data Visualization: A practical introduction* by Kieran Healy](https://socviz.co/lookatdata.html#lookatdata)