Data summary
Name diamonds
Number of rows 53940
Number of columns 10
_______________________
Column type frequency:
factor 3
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cut 0 1 TRUE 5 Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color 0 1 TRUE 7 G: 11292, E: 9797, F: 9542, H: 8304
clarity 0 1 TRUE 8 SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
carat 0 1 0.80 0.47 0.2 0.40 0.70 1.04 5.01 ▇▂▁▁▁
depth 0 1 61.75 1.43 43.0 61.00 61.80 62.50 79.00 ▁▁▇▁▁
table 0 1 57.46 2.23 43.0 56.00 57.00 59.00 95.00 ▁▇▁▁▁
price 0 1 3932.80 3989.44 326.0 950.00 2401.00 5324.25 18823.00 ▇▂▁▁▁
x 0 1 5.73 1.12 0.0 4.71 5.70 6.54 10.74 ▁▁▇▃▁
y 0 1 5.73 1.14 0.0 4.72 5.71 6.54 58.90 ▇▁▁▁▁
z 0 1 3.54 0.71 0.0 2.91 3.53 4.04 31.80 ▇▁▁▁▁

Univariate

p1 <- ggplot(data=diamonds, aes(x=cut))
  1. R code: _______________________________

  1. R code: _______________________________

  1. R code: _______________________________


#..count..: special variable to represent frequency
p1 <- ggplot(data=diamonds, aes(x=cut, y=..count..))
p2 <- ggplot(data=diamonds, aes(x=cut, y=..count../sum(..count..)))
  1. R code: _______________________________

  1. R code: _______________________________

  1. R code: _______________________________
        cut percent
1      Fair     3.0
2      Good     9.0
3 Very Good    22.4
4   Premium    25.6
5     Ideal    40.0
p3 <- ggplot(data=cut.percent, aes(x=cut, y=percent))

  1. R code: _______________________________
# Need to rerun this once you change the factor levels (can't use p3)
ggplot(data=cut.percent, aes(x=cut, y=percent))+geom_bar(stat="identity")

  1. Labeling bars: what is the suitable geom to add here?

  1. R code:___________

Help: use coord_flip

  1. R Code:____________
        cut percent  prop
1      Fair     3.0 0.030
2      Good     9.0 0.090
3 Very Good    22.4 0.224
4   Premium    25.6 0.256
5     Ideal    40.0 0.400
ggplot(data=cut.prop, aes(x="", y=prop, fill=cut))+geom_bar(stat="identity", width=1)

ggplot(data=cut.prop, aes(x="", y=prop, fill=cut))+geom_bar(stat="identity", width=1, position = "dodge")

  1. R Code:____________

Pie charts are controversial in statistics.

Some extra work is needed to make the pie chart appealing to human eye.


Bi-variate

Stacked bar chart

Encoding by colour

Position: stack

b1 <- ggplot(data=diamonds, aes(x=cut, fill=color))

12: R code:___________

Grouped bar chart

Encoding by colour

Position: dodge

13: R code:___________

Segmented bar chart

Position: fill

14: R code:___________

15: Rcode:_______________

Encoding by position

ggplot(data=diamonds, aes(x=color))+geom_bar()+facet_wrap(~cut)

Categorical vs Quantitative

Cleveland dot chart

Heat map

Summary statistics

  1. R code:___________________
# A tibble: 5 x 2
  cut       mean_carat
  <ord>          <dbl>
1 Fair           1.05 
2 Good           0.849
3 Very Good      0.806
4 Premium        0.892
5 Ideal          0.703

Plotting summary statistics: stat_summary

  1. R code:___________________

mean_se: mean and standard error

g1 <- ggplot(diamonds, aes(x = cut, y = carat)) 

  1. R code:___________________

mean_cl_normal: 95 per cent confidence interval assuming normality. (Use library(Hmisc))

  1. R code:___________________

  1. R code:___________________

  1. R code:___________________

  1. R code:___________________

  1. R code:___________________

mean_cl_boot: Bootstrap confidence interval (95%)

  1. R code:___________________

mean_hilow: Median, Q1, Q3

Design of Experiments

Description

The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

Data summary
Name ToothGrowth
Number of rows 60
Number of columns 3
_______________________
Column type frequency:
factor 1
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
supp 0 1 FALSE 2 OJ: 30, VC: 30

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
len 0 1 18.81 7.65 4.2 13.07 19.25 25.27 33.9 ▅▃▅▇▂
dose 0 1 1.17 0.63 0.5 0.50 1.00 2.00 2.0 ▇▇▁▁▇
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5
  1. R code:___________________

  1. R code:___________________

  1. R code:____________________

Avoid overlapping in the last category position_dodge(0.1)

  1. R code: ___________

Not suitable for this example: Why?

Categorical with two Quantitative variables

  1. R code: ___________

  1. R code: ___________

  1. R code: ___________

  1. R code: ___________

  1. R code: ___________

  1. R code: ___________