Categorical Data

Data summary
Name	diamonds
Number of rows	53940
Number of columns	10
_______________________
Column type frequency:
factor	3
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
cut	1	TRUE	5	Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color	1	TRUE	7	G: 11292, E: 9797, F: 9542, H: 8304
clarity	1	TRUE	8	SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
carat	1	0.80	0.47	0.2	0.40	0.70	1.04	5.01	▇▂▁▁▁
depth	1	61.75	1.43	43.0	61.00	61.80	62.50	79.00	▁▁▇▁▁
table	1	57.46	2.23	43.0	56.00	57.00	59.00	95.00	▁▇▁▁▁
price	1	3932.80	3989.44	326.0	950.00	2401.00	5324.25	18823.00	▇▂▁▁▁
x	1	5.73	1.12	0.0	4.71	5.70	6.54	10.74	▁▁▇▃▁
y	1	5.73	1.14	0.0	4.72	5.71	6.54	58.90	▇▁▁▁▁
z	1	3.54	0.71	0.0	2.91	3.53	4.04	31.80	▇▁▁▁▁

Univariate

p1 <- ggplot(data=diamonds, aes(x=cut))

R code: _______________________________

R code: _______________________________

R code: _______________________________

#old version
#..count..: special variable to represent frequency
#p1 <- ggplot(data=diamonds, aes(x=cut, y=..count..))
#p2 <- ggplot(data=diamonds, aes(x=cut, #y=..count../sum(..count..)))

# New version
p1 <- ggplot(data=diamonds, aes(x=cut, y=after_stat(count/sum(count)))) + geom_bar()
p2 <- ggplot(data=diamonds, aes(x=cut, y=after_stat(count/sum(count)))) + geom_bar()

R code: _______________________________

R code: _______________________________

R code: _______________________________

        cut percent
1      Fair     3.0
2      Good     9.0
3 Very Good    22.4
4   Premium    25.6
5     Ideal    40.0

p3 <- ggplot(data=cut.percent, aes(x=cut, y=percent))

R code: _______________________________

# Need to rerun this once you change the factor levels (can't use p3)
ggplot(data=cut.percent, aes(x=cut, y=percent))+geom_bar(stat="identity")

Labeling bars: what is the suitable geom to add here?

R code:___________

Help: use coord_flip

R Code:____________

        cut percent  prop
1      Fair     3.0 0.030
2      Good     9.0 0.090
3 Very Good    22.4 0.224
4   Premium    25.6 0.256
5     Ideal    40.0 0.400

ggplot(data=cut.prop, aes(x="", y=prop, fill=cut))+geom_bar(stat="identity", width=1)

ggplot(data=cut.prop, aes(x="", y=prop, fill=cut))+geom_bar(stat="identity", width=1, position = "dodge")

R Code:____________

Pie charts are controversial in statistics.

Some extra work is needed to make the pie chart appealing to human eye.

Bi-variate

Stacked bar chart

Encoding by colour

Position: stack

b1 <- ggplot(data=diamonds, aes(x=cut, fill=color))

12: R code:___________

Grouped bar chart

Encoding by colour

Position: dodge

13: R code:___________

Segmented bar chart

Position: fill

14: R code:___________

15: Rcode:_______________

Encoding by position

ggplot(data=diamonds, aes(x=color))+geom_bar()+facet_wrap(~cut)

Categorical vs Quantitative

Cleveland dot chart

Heat map

Summary statistics

R code:___________________

# A tibble: 5 × 2
  cut       mean_carat
  <ord>          <dbl>
1 Fair           1.05 
2 Good           0.849
3 Very Good      0.806
4 Premium        0.892
5 Ideal          0.703

Plotting summary statistics: `stat_summary`

R code:___________________

mean_se: mean and standard error

g1 <- ggplot(diamonds, aes(x = cut, y = carat))

R code:___________________

mean_cl_normal: 95 per cent confidence interval assuming normality. (Use library(Hmisc))

R code:___________________

R code:___________________

R code:___________________

R code:___________________

R code:___________________

mean_cl_boot: Bootstrap confidence interval (95%)

R code:___________________

mean_hilow: Median, Q1, Q3

Design of Experiments

Description

The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

Data summary
Name	ToothGrowth
Number of rows	60
Number of columns	3
_______________________
Column type frequency:
factor	1
numeric	2
________________________
Group variables	None