Original Authors: Martin Morgan
Presenting Authors: Martin Morgan
Date: 22 July, 2019
Back: Monday labs
Objective: Gain confidence working with ‘tidy’ R commands and data structures.
Lessons learned:
gather()
to ‘tidy’ data.The ‘tidyverse’ is a a recent approach to working with data in R. The basic principle is that data is often best represented in ‘long-form’ data.frames, with data transformations implemented with a few central functions.
Start by loading some important packages in the tidyverse, readr for data input, tibble for data representation, dplyr for manipulation, and ggplot2 for visualization.
library(readr)
library(tibble)
library(dplyr)
library(ggplot2)
Read the BRFSS data into a tibble
(data.frame) using read_csv()
;
this function is similar to read.csv()
but standardized the argument
naming convention.
fname <- file.choose() ## BRFSS-subset.csv
stopifnot(file.exists(fname))
brfss <- read_csv(fname)
Note that by default character values were not interpreted as factors, that the display is more informative (the class of each column is indicated under the title), and only a ‘preview’ of all the data is displayed.
brfss
## # A tibble: 20,000 x 5
## Age Weight Sex Height Year
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 31 49.0 Female 157. 1990
## 2 57 81.6 Female 157. 1990
## 3 43 80.3 Male 178. 1990
## 4 72 70.3 Male 170. 1990
## 5 31 49.9 Female 155. 1990
## 6 58 54.4 Female 155. 1990
## 7 45 69.9 Male 173. 1990
## 8 37 68.0 Male 180. 1990
## 9 33 65.8 Female 170. 1990
## 10 75 70.8 Female 152. 1990
## # … with 19,990 more rows
A tibble()
can be manipulated like a data.frame, but usually
operations in the tidyverse are ‘piped’ using the %>%
symbol from
one representation to another. We start with some clean-up using the
mutate()
function to update or add individual columns. Start with an
interactive exploration…
brfss %>% mutate( Sex = factor(Sex), Year = factor(Year) )
## # A tibble: 20,000 x 5
## Age Weight Sex Height Year
## <dbl> <dbl> <fct> <dbl> <fct>
## 1 31 49.0 Female 157. 1990
## 2 57 81.6 Female 157. 1990
## 3 43 80.3 Male 178. 1990
## 4 72 70.3 Male 170. 1990
## 5 31 49.9 Female 155. 1990
## 6 58 54.4 Female 155. 1990
## 7 45 69.9 Male 173. 1990
## 8 37 68.0 Male 180. 1990
## 9 33 65.8 Female 170. 1990
## 10 75 70.8 Female 152. 1990
## # … with 19,990 more rows
…and when the pipeline looks good re-assign the updated tibble.
brfss <- brfss %>% mutate( Sex = factor(Sex), Year = factor(Year) )
Common operations are to filter()
rows to those containing only
certain values, and to select()
a subset of columns.
brfss %>% filter(Sex == "Female", Year == "1990") %>%
select(Age, Weight, Height)
## # A tibble: 5,718 x 3
## Age Weight Height
## <dbl> <dbl> <dbl>
## 1 31 49.0 157.
## 2 57 81.6 157.
## 3 31 49.9 155.
## 4 58 54.4 155.
## 5 33 65.8 170.
## 6 75 70.8 152.
## 7 68 66.2 160.
## 8 87 49.9 152.
## 9 44 115. 160.
## 10 57 63.5 157.
## # … with 5,708 more rows
Another common operation is to group_by()
one or more columns, and
summarize()
data in other columns, e.g.,
brfss %>%
group_by(Sex, Year) %>%
summarize(
AveAge = mean(Age, na.rm=TRUE),
AveWeight = mean(Weight, na.rm=TRUE)
)
## # A tibble: 4 x 4
## # Groups: Sex [2]
## Sex Year AveAge AveWeight
## <fct> <fct> <dbl> <dbl>
## 1 Female 1990 46.2 64.8
## 2 Female 2010 57.1 73.0
## 3 Male 1990 43.9 81.2
## 4 Male 2010 56.2 88.8
Note that the output of each pipe is itself a tibble, so that the output can be further transformed using tidyverse functions.
The main features of ‘tidy’ data are
Standard representation as a long-format data.frame-like tibble()
Restricted vocabulary of core functions – mutate()
, filter()
,
select()
, group_by()
, summarize()
, … These functions are
‘isomorphic’, meaning that the return value is the same type (a
tibble!) as the first argument of the function.
Short ‘pipes’ summarizing data transformation steps.
Revisit case study 2.1, ALL Phenotype Data in Lab 1.1: Introduction to R using a tidy approach to data exploration.
Input the data “ALLphenoData.tsv” using read_tsv()
(note: we use
_tsv()
because the columns are _t_ab _s_eparated). Compare column
types, display, etc with base R functions. Note that columns are
never factor()
by default.
Use filter()
to create a subset of the data consisting only of
female individuals over 40. Compare this approach with that used in
base R. Likewise, create a subsect with only “BCR/ABL” and “NEG”
in the mol.biol
column.
Use mutate()
to further transform the bcrabl subset by recoding
the BT
column to be either B or T, based on the first letter of
the column.
Use group_by()
and summarize()
to deterimine the number of
individuals in each combination of BT
and mol.biol
. Can you
perform this same computation using only count()
?
t.test()
We’d like to compare the average age of males and females in the study
using t.test()
. In base R , we can write
pdata <- read_tsv("ALLphenoData.tsv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## age = col_double(),
## `t(4;11)` = col_logical(),
## `t(9;22)` = col_logical(),
## cyto.normal = col_logical(),
## ccr = col_logical(),
## relapse = col_logical(),
## transplant = col_logical()
## )
## See spec(...) for full column specifications.
t.test(age ~ sex, pdata)
##
## Welch Two Sample t-test
##
## data: age by sex
## t = 1.6034, df = 79.88, p-value = 0.1128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.022660 9.504142
## sample estimates:
## mean in group F mean in group M
## 35.16667 30.92593
t.test()
takes a formula age ~ sex
(age
as a
function of sex
) describing the comparison that we’d like to
make. It also takes an argument data=
that contains the data we’d
like to perform the t-test on. Unlike functions we’ve encountered so
far where the data to be processed is the first argument, t.test()
expects the data as the second argument. To adapt t.test()
for use,
we need to explicitly indicate that the data should be the second
argument. One way of doing this is to use the special symbol .
to
represent the location of the incoming data, invoking t.test(age ~ sex, data = .)
:
pdata %>% t.test(age ~ sex, data = .)
Exercise Perform a t-test to ask whether there is evidence of
differences in ages between the sexes. How can we change the default
value of var.equal
to TRUE
? Is this appropriate?
Exercise Write a function that makes it easier to use t.test()
in a ‘tidy’ work flow. Do this by arranging it so that the first
argument of your function is the data set, the second argument the
formula, and then allow for any number of additional arguments. Pass
these to t.test()
, along the lines of
t_test <- function(data, formula, ...) {
t.test(formula, data, ...)
}
Verify that you can use your t_test()
function in a straight-forward way
pdata %>% t_test(age ~ sex)
##
## Welch Two Sample t-test
##
## data: age by sex
## t = 1.6034, df = 79.88, p-value = 0.1128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.022660 9.504142
## sample estimates:
## mean in group F mean in group M
## 35.16667 30.92593
Exercise (advanced) The return value of t.test()
doesn’t fit
well with tidy
data analysis, because it is a complicated object
that is not represented as a tibble
and hence cannot be computed
upon using the common tidy verbs. Update t_test()
so that it is more
tidy-friendly, accepting data =
as it’s first argument, using
t.test()
internally to compute results, and returning a tibble
containing results formatted for subsequent computation. One way to
accomplish the last task is through use of broom::tidy()
function to transform many base R objects into tidy-friendly data
structures.
We’ll encounter the ‘airway’ data set extensively later in the
course. Here we read in the description of 8 samples used in a RNAseq
analysis, usnig select()
to choose specific columns to work with.
pdata_file <- file.choose() # airway-sample-sheet.csv
count_file <- file.choose() # airway-read-counts.csv
pdata <- read_csv(pdata_file)
## Parsed with column specification:
## cols(
## SampleName = col_character(),
## cell = col_character(),
## dex = col_character(),
## albut = col_character(),
## Run = col_character(),
## avgLength = col_double(),
## Experiment = col_character(),
## Sample = col_character(),
## BioSample = col_character()
## )
pdata <-
pdata %>%
select(Run, cell, dex)
pdata
## # A tibble: 8 x 3
## Run cell dex
## <chr> <chr> <chr>
## 1 SRR1039508 N61311 untrt
## 2 SRR1039509 N61311 trt
## 3 SRR1039512 N052611 untrt
## 4 SRR1039513 N052611 trt
## 5 SRR1039516 N080611 untrt
## 6 SRR1039517 N080611 trt
## 7 SRR1039520 N061011 untrt
## 8 SRR1039521 N061011 trt
Now read in the ‘airway-read_counts.csv’ file. Each row is a sample, each column (other than the first, the gene identifier) is a gene, and each entry the number of RNA-seq reads overlapping each gene and sample – a measure of gene expression.
count_file <- "airway-read-counts.csv"
counts <- read_csv(count_file)
## Parsed with column specification:
## cols(
## .default = col_double(),
## Run = col_character()
## )
## See spec(...) for full column specifications.
eg <- counts[, 1:6] # make it easy to work with
eg
## # A tibble: 8 x 6
## Run ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 SRR1… 679 467 260 60
## 2 SRR1… 448 515 211 55
## 3 SRR1… 873 621 263 40
## 4 SRR1… 408 365 164 35
## 5 SRR1… 1138 587 245 78
## 6 SRR1… 1047 799 331 63
## 7 SRR1… 770 417 233 76
## 8 SRR1… 572 508 229 60
## # … with 1 more variable: ENSG00000000938 <dbl>
An advanced tidy concept is to join to data sets, as described on
the help page ?left_join
. Use left_join()
to combine the
phenotypic and expression data.
data <- left_join(pdata, eg)
## Joining, by = "Run"
data
## # A tibble: 8 x 8
## Run cell dex ENSG00000000003 ENSG00000000419 ENSG00000000457
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 SRR1… N613… untrt 679 467 260
## 2 SRR1… N613… trt 448 515 211
## 3 SRR1… N052… untrt 873 621 263
## 4 SRR1… N052… trt 408 365 164
## 5 SRR1… N080… untrt 1138 587 245
## 6 SRR1… N080… trt 1047 799 331
## 7 SRR1… N061… untrt 770 417 233
## 8 SRR1… N061… trt 572 508 229
## # … with 2 more variables: ENSG00000000460 <dbl>, ENSG00000000938 <dbl>
This is how a statistician might organize their data – in a ‘wide’ format, with each line representing a sample and each column a variable measured on that sample.
Tidy data is usually transformed into ‘long’ format, where ‘Count’ is interpreted as a measurement, with ‘Gene’ and ‘Run’ as indicator variables describing which experimental entity the count is associated with.
Use gather()
to transform the wide format data into long-format
‘tidy’ data. gather()
is a tricky function to work with, so study
the help page carefully.
library(tidyr)
tbl <- gather(data, "Gene", "Count", -(1:3))
tbl
## # A tibble: 40 x 5
## Run cell dex Gene Count
## <chr> <chr> <chr> <chr> <dbl>
## 1 SRR1039508 N61311 untrt ENSG00000000003 679
## 2 SRR1039509 N61311 trt ENSG00000000003 448
## 3 SRR1039512 N052611 untrt ENSG00000000003 873
## 4 SRR1039513 N052611 trt ENSG00000000003 408
## 5 SRR1039516 N080611 untrt ENSG00000000003 1138
## 6 SRR1039517 N080611 trt ENSG00000000003 1047
## 7 SRR1039520 N061011 untrt ENSG00000000003 770
## 8 SRR1039521 N061011 trt ENSG00000000003 572
## 9 SRR1039508 N61311 untrt ENSG00000000419 467
## 10 SRR1039509 N61311 trt ENSG00000000419 515
## # … with 30 more rows
Use group_by()
and summarize()
to calculate the library size,
i.e., the total number of reads mapped per run. Likewise use
group_by()
and summarize()
to descrbe average and log-transformed
counts.
tbl %>%
group_by(Run) %>%
summarize(lib_size = sum(Count))
## # A tibble: 8 x 2
## Run lib_size
## <chr> <dbl>
## 1 SRR1039508 1466
## 2 SRR1039509 1229
## 3 SRR1039512 1799
## 4 SRR1039513 972
## 5 SRR1039516 2049
## 6 SRR1039517 2240
## 7 SRR1039520 1496
## 8 SRR1039521 1369
tbl %>%
group_by(Gene) %>%
summarize(
ave_count = mean(Count),
ave_log_count = mean(log(1 + Count))
)
## # A tibble: 5 x 3
## Gene ave_count ave_log_count
## <chr> <dbl> <dbl>
## 1 ENSG00000000003 742. 6.55
## 2 ENSG00000000419 535. 6.26
## 3 ENSG00000000457 242 5.48
## 4 ENSG00000000460 58.4 4.05
## 5 ENSG00000000938 0.375 0.224
Now tidy all the data.
counts_tbl <- gather(counts, "Gene", "Count", -Run)
and join to the pdata
data_tbl <- left_join(pdata, counts_tbl)
## Joining, by = "Run"
data_tbl
## # A tibble: 267,752 x 5
## Run cell dex Gene Count
## <chr> <chr> <chr> <chr> <dbl>
## 1 SRR1039508 N61311 untrt ENSG00000000003 679
## 2 SRR1039508 N61311 untrt ENSG00000000419 467
## 3 SRR1039508 N61311 untrt ENSG00000000457 260
## 4 SRR1039508 N61311 untrt ENSG00000000460 60
## 5 SRR1039508 N61311 untrt ENSG00000000938 0
## 6 SRR1039508 N61311 untrt ENSG00000000971 3251
## 7 SRR1039508 N61311 untrt ENSG00000001036 1433
## 8 SRR1039508 N61311 untrt ENSG00000001084 519
## 9 SRR1039508 N61311 untrt ENSG00000001167 394
## 10 SRR1039508 N61311 untrt ENSG00000001460 172
## # … with 267,742 more rows
Summarize library size (what are the maximum and minimum library sizes?)
data_tbl %>%
group_by(Run) %>%
summarize(lib_size = sum(Count))
## # A tibble: 8 x 2
## Run lib_size
## <chr> <dbl>
## 1 SRR1039508 20637971
## 2 SRR1039509 18809481
## 3 SRR1039512 25348649
## 4 SRR1039513 15163415
## 5 SRR1039516 24448408
## 6 SRR1039517 30818215
## 7 SRR1039520 19126151
## 8 SRR1039521 21164133
and average ‘expression’ of each gene.
gene_summaries <-
data_tbl %>%
group_by(Gene) %>%
summarize(
ave_count = mean(Count),
ave_log_count = mean(log(1 + Count))
)
gene_summaries
## # A tibble: 33,469 x 3
## Gene ave_count ave_log_count
## <chr> <dbl> <dbl>
## 1 ENSG00000000003 742. 6.55
## 2 ENSG00000000419 535. 6.26
## 3 ENSG00000000457 242 5.48
## 4 ENSG00000000460 58.4 4.05
## 5 ENSG00000000938 0.375 0.224
## 6 ENSG00000000971 6035. 8.63
## 7 ENSG00000001036 1305 7.15
## 8 ENSG00000001084 615. 6.40
## 9 ENSG00000001167 392. 5.89
## 10 ENSG00000001460 188. 5.21
## # … with 33,459 more rows
And visualize using ggplot2
library(ggplot2)
ggplot(gene_summaries, aes(ave_log_count)) +
geom_density()
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 18.3
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_0.8.3 ggplot2_3.2.0 dplyr_0.8.3 tibble_2.1.3
## [5] readr_1.3.1 BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 pillar_1.4.2 compiler_3.6.1
## [4] BiocManager_1.30.4 tools_3.6.1 zeallot_0.1.0
## [7] digest_0.6.20 evaluate_0.14 gtable_0.3.0
## [10] pkgconfig_2.0.2 rlang_0.4.0 cli_1.1.0
## [13] yaml_2.2.0 xfun_0.8 withr_2.1.2
## [16] stringr_1.4.0 knitr_1.23 vctrs_0.2.0
## [19] hms_0.5.0 grid_3.6.1 tidyselect_0.2.5
## [22] glue_1.3.1 R6_2.4.0 fansi_0.4.0
## [25] rmarkdown_1.14 bookdown_0.12 purrr_0.3.2
## [28] magrittr_1.5 backports_1.1.4 scales_1.0.0
## [31] codetools_0.2-16 htmltools_0.3.6 assertthat_0.2.1
## [34] colorspace_1.4-1 labeling_0.3 utf8_1.1.4
## [37] stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0
## [40] crayon_1.3.4
Research reported in this tutorial was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)