R Bootcamp - Day 1

R & RStudio overview

Matt Taliaferro

RNA Bioscience Initiative | CU Anschutz

2024-06-24

Course overview

People

Instructors (me, Neel Mukherjee, Kent Riemondy)
TAs (Kathryn Walters and Brandon Buck)

Syllabus

Read the syllabus.
Your grades are based on attendance / participation, problem sets, and a final project. Your lowest problem set grade will be dropped.
If you are sick, let me and Neel know, and stay home. We will record all classes and make them available on Panopto.

Getting help

All course details are on the website.
We use Canvas for problem set submission & grading.
If you get stuck during class: use the #class channel in slack. TAs will come over.
If you need help outside of class (in order):
1. Ask a question on slack.
2. Use calendly to schedule time with the TAs.
3. Use calendly to schedule time with the instructors.
4. E-mail ?var:instructor.main.

Structure of a class

Prior to each block (and sometimes prior to a class), check and complete material in the “Prepare” column on the class schedule.
On the day of class and before class starts, start the day’s “assignment” in Posit Cloud. This will contain blank exercises that you’ll fill in during class.
You’ll also have access to the slides, but it’s probably better for the first few classes to just have the exercises open.

Problem sets

You’ll have a problem set assigned at the end of each class. Our expectation is that you spend a 30-90 minutes on each problem set.
Problem sets will get progressively more difficult.
You can work in groups for problem sets (see the Syllabus), but during the Bootcamp you should avoid it.
If you feel like you’re stuck on something silly, reach out through slack or office hours.
We’ll talk about the problem set at the end of each class. You are welcome to use the remaining class time to start and possibly finish the problem set.

Learning Objectives for the R Bootcamp

Learn the fundamentals of R programming (class 1)
Become familiar with “tidyverse” suite of packages
- tidyr: “Tidy” a messy dataset (class 2)
- dplyr: Transform data to derive new information (classes 3 and 6)
- ggplot2: Visualize and communicate results (classes 4 and 5)
- Putting all of these to use with real data sets (classes 7 and 8)
Practice reproducible analysis using Quarto/Rmarkdown (Rigor & Reproducibility)

Today’s class outline

Review R basics
- R vs Rstudio (Exercises #1-2)
- Functions & Arguments (Exercises #3-4)
- Data types (Exercise #5)
- Data structures (Exercises #6-7)
- R Packages (Exercise #8)
Review Quarto/Rmarkdown (Exercise #9)

RStudio - Exercise 1

We are using RStudio through Posit Cloud for the class.
Look at RStudio panels one at a time.
Environment, History, Console, Files, Plots, Packages, Help, etc.

See menu:

Help > Cheat Sheets > RStudio IDE Cheat Sheet

R as a calculator - Exercise 2

R can function like an advanced calculator

Try simple math.

# This is a comment line
# Note the order of operations (PEMDAS).
2 + 3 * 5

[1] 17

# value of 3-7
3 - 7

[1] -4

# division
3 / 2

[1] 1.5

# 5 raised to the power of 2
5^2

[1] 25

Assign a numeric value to an object.

<- and = are assignment operators.
By convention, R programmers use <-.
x <- 1 reads “set the value of x to 1”.

# create `num` object
num <- 5^2
num

[1] 25

= and == are two different operators.

a = is used for assignment (e.g., x = 1)
a == tests for equivalence (e.g. x == 1 says “does x equal 1?”)

x <- 1
x == 1

[1] TRUE

x == 10

[1] FALSE

# `x` NOT equals 5?
x != 5

[1] TRUE

Functions and arguments - Exercise 3

Functions are fundamental building blocks of R
Most functions take one or more arguments and transform an input object in a specific way.
Use tab-completion to find functions!

?log
log(4)

[1] 1.386294

log(4, base = 2)

[1] 2

Writing a simple function - Exercise 4

addtwo <- function(x) {
  num <- x + 2
  return(num)
}

addtwo(4)

[1] 6

f <- function(x, y) {
  z <- 3 * x + 4 * y
  return(z)
}

f(2, 3)

[1] 18

Data types - Exercise 5

There are many data types in R.
We’ll mainly use numeric, character, and logical.

class(4)

[1] "numeric"

class("jay")

[1] "character"

class(TRUE)

[1] "logical"

# coerce one type to another
class(as.character(TRUE))

[1] "character"

Vectors - Exercise 6

Vectors are a core R data structure.

A vector is an ordered collection of elements of the same type (e.g. numeric, character, or logical).
Later you will see that every column of a data.table / tibble is a vector.
Operations on vectors propagate to all the elements of the vectors.

Let’s create some vectors.

The c function combines values together (e.g., c(1,2,3))

x <- c(1, 3, 2, 10, 5)
x

[1]  1  3  2 10  5

class(x)

[1] "numeric"

y <- 1:5

y + 2

[1] 3 4 5 6 7

2 * y

[1]  2  4  6  8 10

y^2

[1]  1  4  9 16 25

# `y` has not changed!
y

[1] 1 2 3 4 5

# this will update the value of `y`
y <- y * 2
y

[1]  2  4  6  8 10

Data frames

A data.frame is a rectangle, where each column is a vector, and each row is a slice across vectors.
data.frame columns are vectors, and can have different types (numeric, character, factor, etc.).
A data.frame is constructed with data.frame().

class(iris)

[1] "data.frame"

iris

data.frame(x = c(1, 2, 3), y = c(2, 4, 6))

Tibbles

A tibble is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not.
Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist).
This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

tibble() does much less than data.frame():
- it never changes the type of the inputs
- it never changes the names of variables
- it never creates row.names()

Data frames & tibbles - Exercise 7

Create a data.frame and tibble.

chrom <- c("chr1", "chr1", "chr2")
start <- c(200, 4000, 100)
end <- c(250, 410, 200)
strand <- c("-", "-", "+")

df <- data.frame(chrom, start, end, strand)

tbl <- tibble(chrom, start, end, strand)

Now echo the contents of df and tbl to the console and inspect

R packages - Exercise 8

An R package is a collection of code, data, documentation, and tests that is easily shareable.
A package often has a collection of custom functions that enable you to carry out a workflow. eg. DESeq for RNA-seq analysis.
The most popular places to get R packages from are CRAN, Bioconductor, and Github.
Once a package is installed, one still has to “load” them into the environment using a library(<package>) call.

Let’s do the following to explore R packages:

Look at the “Environment” panel in Rstudio
Explore Global Environment
Explore the contents of a package

Quarto Exercise - Exercise 9

Quarto is a fully reproducible authoring framework to create, collaborate, and communicate your work.
Quarto lets you render Rmarkdown documents (in addition to Jupyter notebooks, etc.)
Quarto supports a number of output formats including pdfs, word documents, slide shows, html, etc.

A Quarto document is a plain text file with the extension .qmd and contains the following basic components:
- A YAML header surrounded by ---.
- Chunks of R code surrounded by ```.
- Plain text structured with markdown formatting like # heading and *italics*.

Let’s do the following to explore Quarto documents:

Create a new Quarto document
Render the document to see the output

Problem sets and submission

Your first problem set is in problem-sets/class-01.qmd