R Intro Part

Data structures and functions

Ilan Man
Business Intelligence @ Squarespace

Agenda

  1. R intro
  2. Data structures
  3. Control structures
  4. Functions
  5. Commonly used built in functions
  6. String manipulation
  7. Miscellaneous Tips and tricks

R Intro

Background

  • Derivative of S language, developed at Bell Laboratories by John Chambers
  • R was created by two statisticians at the University of Auckland, New Zealand
  • R is written in C, Fortran and R
  • Open source (Revolution Analytics offers commerical software)
  • Originally command line, but graphical interfaces (including RStudio and Rattle) becoming new norm
  • Very popular, especially among academics and statisticians
  • Intepreted language - easier to write code, but slower computations
  • Packages available to speed up R code - Rcpp, ff, snow, parallel
  • R holds all data in RAM. Problematic for large data sets
  • R is excellent for prototyping
  • ?help -> use this to get help. ? is easily the most useful function in R.

R Intro

Background

  • Installing packages
install.packages('ggplot2')    ## do this once only

require('ggplot2')             ## do this every time you load up an R session

library()                      ## shows you every package in your standard package location

R Intro

Styling

  • CRAN and Google style guide
  • R Coding convention is another resource

  • Use <- NOT = for assignment

  • Spaces between operators like +, %*%, <, > and after closing brackets ), }

  • Don't write functions named rep(), sample(), plot() or any other built-in R names

  • c should not be used for any variable names

  • i and j should only be used in loops, conditionals, etc...

  • Use camel case for functions: myFirstFunction() is better than my.first.function()

  • Use 'hello' or "hello" for strings, but be consistent.

Data structures

  1. Vectors
  2. Matrices
  3. Lists
  4. Data.frames
  5. Factors

Data structures

Vectors

  • Fundamental R data type: Everything is a vector in R (including scalars)
  • Vector elements must be of the same type, or mode in R. Known as atomic.
  • Common ways to initialize a vector
x <- c(1,2,3,4,5,6,7,8,9,10) ## vector from 1 to 10 - class numeric
x <- 1:10                    ## alternative - class integer
x <- seq(from=1,to=10,by=1)  ## alternative - class numeric

n <- 10
x <- numeric(n)
for (i in 1:n) x[i] <- i    ## as n gets large, this is very slow (compared to the alternatives)

x <- numeric(0)             
for (i in 1:n) x <- c(x,i)  ## preferred vs. above
                            ## to the extent possible, provide the size of your object when
                            ## initializing it
  • Six atomic vector types:
      logical, character, integer, double, complex, raw

Data structures

Vectors

  • Vectors obviate need for loops (most of the time!)
x <- seq(from = 1, to = 10, by = 1)
y <- 0
for (i in c(1:length(x))) y[i] <- x[i] * 5
print(y)
##  [1]  5 10 15 20 25 30 35 40 45 50
## alternatively....
y <- x * 5
print(y)
##  [1]  5 10 15 20 25 30 35 40 45 50

Data structures

Vector Indexing

  • Important, but only if you like using vectors. And R.
  • Indexing begins at 1, not 0.
  • Can index a vector by name, if elements are named.
x <- 1:10
x[ c( 1:5 , 8:10 ) ]         
[1] 1  2  3  4  5  8  9 10

x[ c(TRUE , FALSE) ]         ## recycling - common R feature. R will not give you a warning!
[1] 1 3 5 7 9                ## very useful, but make sure you are comparing vectors of same length

x > 5                        ## Boolean vector. mode = "logical"
[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

any(x > 5)              
[1] TRUE
all(x < 8)              
[1] FALSE

Data structures

Vectorized Operations

  • Easiest way to acheive speed in R - apply a function to a vector
f <- function(a, b) return(a^b)
f(x, 2)
##  [1]   1   4   9  16  25  36  49  64  81 100
  • Even operators such as +, -, * are functions
"*"(x,5)              ## returns 5 * x[1], 5 * x[2], ...
'['(x, x > 5 )        ## returns vector of values where x[1] > 5, x[2] > 5, ..., x[10] > 5 is TRUE

ifelse(x < 5, x^2, 0) ## if (condition) { do something } else { do something else }
[1]  1  4  9 16  0  0  0  0  0  0

Data structures

Vectorized Operations

  • When coming from a different language, probably best NOT to translate code verbatim
  • Loops are your friend in C. In R, loops are like a bad friend - timeconsuming at best.
  • Under the hood, a vectorized operation is running a loop - in C. Much faster than in R.
  • Vectorization also provides clarity (but don't get carried away one-lining everthing)
logsum <- 0
x <- seq(100,1000000,by=10)
for (i in 1:length(x)){
  logsum <- logsum + log(x[i])
}
logsum
[1] 1281524                 ## this calculation takes about 0.17 seconds

# R translation
logsum <- sum(log(x))       ## this calculation takes about 0.002 seconds.
[1] 1281524

Data structures

Vectorized Operations

  • Be careful when thinking you are vectorizing
  • Many R functions take a function as an argument
  • sum, max, min, ... are exceptions
mean(1,3,2)
[1] 1                 ## huh??

mean(c(1,3,2))
[1] 2                 ## that's better

max(1,3,2)
[1] 3
  • Vectorization might not work when the current iteration depends on the previous (think $ \sum \sum $)
  • Try to put code outside of loops when possible
  • Use built-in functions such as rowSums(x) instead of apply(x,1,sum)...more on this later!

Data structures

Vectors: NA and NULL

  • NA appears often in messy data, especially when a value doesn't exist
  • R will attempt to calculate NA, and therefore return NA
  • If R sees NULL, it skips it. NULL is non existant. Yet it exists as a NULL. ?philosophy.
x <- c(5, 10, NA, 20, 25)
mean(x)                
[1] NA

is.na(x)                ## commonly used when cleaning data sets
[1] FALSE FALSE  TRUE FALSE FALSE
mean(x,na.rm=TRUE)      ## 15

x <- c(5,10,NULL,20,25)
mean(x)                 ## 15

length(NA)              ## NA is a logical constant of length 1
[1] 1
length(NULL)            ## NULL does not take any value. By definition, it's undefined
[1] 0                   ## ?philosphy

Data structures

Vector Filtering

  • Extremely useful for quick data analysis. Similar to indexing.
x <- 1:10
x[ x > 5 ]          ## What's happening here?
  • x > 5 is a function call to ">"(a,b) which returns TRUE or FALSE on every element of vector x.
  • Output of x > 5 is logical vector. And when used as an index on x...
x[c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)]
## [1]  6  7  8  9 10

...returns elements of x that are TRUE

Data structures

Vector Filtering


Common filtering functions include:

subset(x, x > 5)      ## [1]  6  7  8  9 10
which(x > 5)          ## [1]  6  7  8  9 10
4%in%x                ## [1]  TRUE

Data structures

Vectors: Summary

  • Everything in R is a vector
  • All elements are of one type, atomic
  • Vectorize whenever possible
  • Filtering and indexing are important concepts
  • Recycling - useful but note that R will not give you an error message
  • seq(), rep(), sample(), runif()
  • any(), all(), which(), subset(), %in%

Data structures

Exercise!


Compute the following:

a) \(\large \sum_{i=1}^{500} \ln{(i^{2})} + \frac{2}{i}\)

b) \(\large \frac{1}{n}\sum_{i=1}^{n} (\bar{X} - X_{i})^{2}\), where X ~ Normal(5,100) and n = 1000

Hint:

?rnorm

c) \(\large \frac{1}{n}\sum_{i=1}^{n} (\bar{X} - X_{i})(\bar{Y} - Y_{i})\), where X ~ Poisson with lambda of 2, Y ~ Exponential with a rate of 1, and n = 1000

Data structures

Matrices

  • Like vectors with two additional attributes: rows and columns
  • Column-major order: insert values in first column, going down, then continuing to second column, going down, as so on
x <- matrix(seq(1, 6, by = 1), nrow = 3, ncol = 2)  ## 3 by 2 matrix
print(x)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
x <- matrix( seq(1,6,by=1), nrow=3)                ## same as above
x <- matrix( seq(1,6,by=1), nrow=3, byrow=TRUE)    ## row-major order    
x <- matrix( seq(1,6,by=1), nrow=4)                ## is this ok?
x <- matrix( seq(1,6,by=1), nrow=3, ncol=3)        ## is this ok?

Data structures

Matrix operations

x <- matrix(seq(1,9),nrow=3,ncol=3)
x + 5
x * 2
t(x)                        ## transpose
x %*% x                     ## inner product
crossprod(x,x)              ## cross product of x and x
x * x                       ## element-wise product
diag(x)                     ## diagonal components - identity matrix
det(x)                      ## determinant
eigen(x)                    ## list of eigenvalues and eigenvectors
  • Remember your linear algebra!

Data structures

Matrix indexing and filtering

x[2,1]            ## second row, first column

x[,1]             ## all rows, first column. Vector form, not matrix.

x[,]              ## all rows, all columns. Same as print(x), or just x. 

x[-1,]            ## remove first row. Negative indexing.

Data structures

Matrix class

x <- matrix( c(1:9), nrow=3, ncol=3)
class(x)                      ## matrix

y <- x[1,]                    ## 3 element vector
class(y)                      ## integer
attributes(y)                 ## returns NULL

y <- x[1,, drop=FALSE]     
class(y)                      ## matrix
attributes(y)                 ## 1 by 3 matrix

colnames(x) <- c( 'first col' , 'second col' , 'third col' )
rownames(x) <- c( 'row 1' , 'row 2' , 'row 3' )
  • Higher dimension matrices also possible, arrays

Data structures

Exercise!

a) 
x <- matrix(rep(c(1,3,-1,2),5),ncol=4)
(i)  What is returned by the following? Do it by hand before typing it in.
     mean(x[ x[1,] > 1,  c(1:2) ])
(ii) Find the column in x which has the largest total.

b) 
y <- matrix(c(c(1,2,4,8),c(2,3,-1,-7),c(0,5,12,-4),c(3,4,5,0)),ncol=4)
(i)  Calculate the trace of y.
(ii) Replace each element of the 3rd column with the median of the elements of the first, second and 
fourth columns for the same row. 

Data structures

Lists

  • Combine objects of different types. Can have different modes.
  • Forms basis for data.frames
  • Vectors, matrices cannot be broken down into smaller components, hence atomic.
  • Lists can be broken down - known as recursive vectors.
x <- list(title = "R presentation", date = format(as.POSIXlt(Sys.time(), "EDT"), 
    "%m %d %Y"), num_attendees = 10)
print(x)
## $title
## [1] "R presentation"
## 
## $date
## [1] "12 06 2013"
## 
## $num_attendees
## [1] 10

Data structures

Lists

  • Accessing list components
## one bracket - [ - returns a list type
x[1]
## $title
## [1] "R presentation"
## two brackets  -  [[  -  returns the actual element, in this case a character
x[[1]]
x$title
x[['title']]

[1] "R presentation"

Data structures

Lists

  • Accessing list components and values
names(x)      
[1] "title"         "date"          "num_attendees"

unlist(x)      ## flattens the list into a character vector

pres_1 <- format(as.POSIXlt(Sys.Date(),"EDT"),"%m %d %Y") 
pres_2 <- format(as.POSIXlt(Sys.Date()+30,"EDT"),"%m %d %Y") 

x <- list(title='1st R presentation', date=pres_1, num_attendees=10)
y <- list(title='2nd R presentation', date=pres_2, num_attendees=20)

z <- list(x,y)       ## list of lists
## z[[1]][1] is equivalent to x

Data structures

Lists

  • The result of most statistical operations in R return a list
  • Knowing how to manipulate lists is important
n <- 100
x <- rnorm(n, mean = 0, sd = 1)          ## sample of 100 random standard normal variables
y <- 1 - 2 * x + rnorm(n)
f <- y ~ x                               ## y ~ x is a formula object
r <- lm(f)                               ## r is linear model object, i.e. linear regression

## the function str() - "structure"" - is VERY useful in exploratory data analysis
## structure of r is a bunch of lists
str(r)

r$coeff
r$residuals

Data structures

Data Frames

  • The most useful object in R for data analysis
  • Like a matrix of lists, of equal length
  • Many R functions and packages assume input is in the form of a data.frame
  • Every CSV or Text file you read in is a data.frame, i.e. most real data comes in the form of a data.frame

  • Creating Data Frames

z <- data.frame()  ## data frame with 0 columns and 0 rows
y <- data.frame(col1 = c(1, 2), col2 = c("a", "b"), row.names = c("row1", "row2"))
print(y)
##      col1 col2
## row1    1    a
## row2    2    b

Data structures

Data Frames

x <- data.frame(matrix( sample(c(50:100), size=12, replace=TRUE), nrow=6, ncol=2))

## return first column
x[,1]                 ## type is vector
x$X1                  ## type is vector
x[1]                  ## type data.frame. 
x['X1']               ## type data.frame

Data structures

Data Frames

  • Helpful data.frame functions
x <- x[-6,]                   ## remove rows or columns with a "-" sign. Like negative indexing.
y <- data.frame(names = c("dave","jenny","scott","mary","harry") )
z <- cbind(y, x)               ## column bind. Can be used on matrices too.
                               ## if you cbind two vectors you get a matrix, NOT data.frame

## alternatively you can create columns implicitly
x$names <- c("dave" ,"jenny" ,"scott" ,"mary" ,"harry")

w <- data.frame(names="megan", X1=82, X2=85)
z <- rbind(z, w)               ## row bind

## make sure number of elements in row, column are consistent

Data structures

Data Frames

## explicitly set columns names for z. Use rownames() for row names. Shocker.
names(z) <- c("names", "Exam 1","Exam 2")

## get dimensions
dim(z)                        
[1] 6 3

head(z)           ## default to first 6 rows
tail(z)           ## default to last 6 rows
  • While very useful, data.frames are more memory intensive than matrices
  • When initializing, if possible, preallocate data.frame, i.e. set size of data.frame before using it
  • Whenever possible, use matrices

Data structures

Factors

  • Comes from the notion of categorical variables in statistics
  • Can be thought of as a vector with additional information - categories, or levels
  • Used to split up data sets; commonly seen as columns of data.frames
x <- factor(c("finance", "tech", "tech", "auto", "finance", "energy", "tech"))
print(x)
## [1] finance tech    tech    auto    finance energy  tech   
## Levels: auto energy finance tech
y <- factor(x, levels = c(levels(x), "tv"))  ## include new level, even though no tv data exists
print(y)
## [1] finance tech    tech    auto    finance energy  tech   
## Levels: auto energy finance tech tv

Data structures

Factors

  • use levels to order your levels. Helpful when sorting factors
wday <- c("mon", "tues", "mon", "wed", "fri", "wed")
wdayf <- factor(wday)
sort(wdayf)  ## did this do what we expected?
## [1] fri  mon  mon  tues wed  wed 
## Levels: fri mon tues wed
wdayf <- factor(wday, levels = c("mon", "tues", "wed", "thurs", "fri"))  ## let's add Thursday as well
sort(wdayf)
## [1] mon  mon  tues wed  wed  fri 
## Levels: mon tues wed thurs fri

Data structures

Factors

  • Common factor functions
z$names2 <- NULL                            ## NULL removes the object from the factor (or list)
z$gender <- c("m","f","m","f","m","f")
z$party <- c("D","D","R","R","D","D")
tbl <- table(z$gender,z$party)              ## contingency table. class "table"
addmargins(tbl)           ## marginal sums

##      D R Sum
##  f   2 1   3
##  m   2 1   3
# # Sum 4 2   6

Data structures

Factors

  • Converting between factors and other types
x <- seq(5,20,by=5)
f <- factor(x)

print(f)
[1] 5  10 15 20
Levels: 5 10 15 20

as.numeric(f)                    ## huh??
[1] 1 2 3 4

as.numeric(as.character(f))
[1]  5 10 15 20                  ## much better

as.numeric(levels(f))            ## more efficient due to less conversions

Data structures

Summary

  • Vectors - lifeblood of R
  • Matrices - great for linear algebra and stats functions
  • Lists - store and access elements of complex objects
  • Data.frames - data analysis object of choice
  • Factors - good for statistics and categorization of data into groups

Control Structures

  1. for()
  2. while()
  3. repeat()
  4. try()
  5. if()

Control Structures

for()

x <- seq(0, 20, by=1)           ## default increment is 1 

for (i in c(1:length(x))){
  x[i] <- x[i] * 2
}

## can be written on one line - but careful to not make it too messy
for (i in c(1:length(x))) x[i] <- x[i] * 2

while()

i=1
while (i <= 21) {
  x[i] <- x[i] * 2
  i <- i + 1
}

Control Structures

repeat()

x <- seq(0,20,by=1) 

i = 1
repeat {
  x[i] <- x[i] * 2
  i <- i + 1
  if (i > 21) break
}

try()

try("hello" + 1, silent = FALSE)
try("hello" + 1, silent = TRUE)
tryCatch("hello" + 1, error = function(e) print("don't be ridiculous"))
## [1] "don't be ridiculous"

Control Structures

if()

if (a == b) {
  # do something
} else {                  ## the else statement MUST be on the same line as the 
  # do something else     ## closing bracket of the if()
}

if (a == b) {
  # do something
} 
else {                    ## WRONG. returns lots of headaches. 
  # do something else     
}

if (a == b) do something  ## one-liners

ifelse (a == b, x, y)     ## use ifelse() on vectors

Control Structures

Exercise!

(a) Write a loop to scan through an integer vector and return the index of the 
largest value. The loop should terminate as soon as the index is found. Ignore ties.

(b) Redo the above using built-in R functions such as rank(), sort() and order().

Functions

  1. Functions
  2. Arguments
  3. Environment
  4. Pointers
  5. Generic

Functions

Functions

  • Write functions - it's good practice
  • Each function should perform a specified task - easily understood inputs and outputs
  • function() is a built-in R function whose job is to create functions...#mindblown
exponentiate <- function(x, y) {
    return(x^y)
}

exponentiate(2, 4)
## [1] 16
  • The right hand side of exponentiate() has two arguments: the parameters and the body

Functions

Arguments

formals(exponentiate)        # $x   $y   These are the arguments to exponentiate()
body(exponentiate)           # { return (x^y) }
exponentiate  # prints out the entire function - good if you forget what's in it!
## function(x, y) {
##     return(x^y)
## }
  • Try it out on any built-in R function to see its innards
  • Note that it won't work on some functions that are written in C (e.g sum(), mean())

Functions

Arguments

  • Arguments can have default values.
f <- function(x, y=3) { ... }

## Some functions have tons of parameters. You don't need to enter them all.
f <- function(x, ...) {
  plot(x, ...)
}
  • Anonymous functions. Single use. No name. No feelings exchanged.
sapply (x, function(x) x*2)

Functions

Environment and Scope

  • A function consists of its arguments, body and environment
d <- 8
f <- function(y){
  x <- 3 * y
  h <- function (){
    return(y*(x+d))
  }
  return(x+h())
}
f(2)

# d is global to f()
# x is local to f() and global to h()
# h cannot be called at the "top level" since it's environment is limited to f()'s

Functions

Exercise!


Write a function that finds the maximum value in corresponding indices for two vectors. For example:

x <- c(1,2,3,4)
y <- c(0,3,5,4)
## output should be 
[1] 1 3 5 4

Functions

Environment and scope

ls()           # returns all the variables in the environment
               # good to know when you've created a ton and are starting to lose track
rm(x)          # rm(x) removes x...rm(list=ls()) is usually not a great idea!
  • You can even make custom operators!
"%powerUp%" <- function(a, b) return(a^b)

3 %powerUp% 2
## [1] 9

Functions

Pointers

  • R does not have pointer variables like Python
## in Python
>>> x = c(5,2,8)
>>> x.sort()             ## this doesn't exist in R
>>> x
[2, 5 , 8]

## in R
x <- c(5,2,8)
sort(x)
[1] 2 5 8
x
[1] 5 2 8
x <- sort(x)
x
[1] 2 5 8

Functions

Generic functions

  • R is a dialect of S3 (S4 is the latest)
  • S3 has generic functions, such as print(), plot(), summary(). Concept of OOP.
data(cars)              ## load built in dataset
fit <- lm(dist ~ speed, data=cars)
summary(fit)

## same function call, on a different object type
summary(c(1,2,3))

## lists out all the methods for the summary function
methods(summary)  
  • There is a summary.lm() function and a summary.default() function
  • Makes it harder to get the object that is printed, but this is possible

Functions

Exercise!


a) Get the Adjusted R-squared from the regression of distance on speed in the cars dataset

b) Get the t-value of the X variable (i.e. speed)

c) Predict the braking distance if going 200 miles per hour

Commonly used built-in functions

  1. apply()
  2. lapply(), tapply(), sapply()
  3. mapply()
  4. by(), cut(), aggregate(), split()

Commonly used built-in functions

apply()

  • Apply a function to a row or column of a matrix
x <- matrix(sample(c(0:100),20,replace=TRUE),nrow=5,ncol=4)
apply(x,1,sum)                                   ## sum rows
apply(x,1,function(x) x^2)                       ## apply function to every element
                                                 ## What type of function is this?
  • Easy to read
  • Less lines of code
  • NOT faster than for loops. Loops are built into these functions.

Commonly used built-in functions

lapply(), tapply(), sapply()

  • like apply() for other data structures
lapply(list(z$'Exam 1',z$'Exam 2'),mean)    ## mean of Exam scores; returns a list
                                            ## like apply() but can be used on data.frames
sapply(list(z$'Exam 1',z$'Exam 2'),mean)    ## mean of Exam scores; returns a vector
sapply(list(z$'Exam 1',z$'Exam 2'), mean, simplify=FALSE)   ## same as lappy()
## find mean exam 1 scores, split by party
tapply(z$'Exam 1', z$party, FUN = mean, simplify=TRUE)      ## simplify determines output type

##     D      R 
## 82.25  84.50 

Commonly used built-in functions

mapply()

  • Apply a function to corresponding elements of a list
a<-c(1:5)
b<-c(6:10)
d<-c(11:15)

mapply(sum,a,b,d)
sum(a[1],b[1],d[1])
sum(a[2],b[2],d[2])

mapply(mean,a,b,d)            ## What's happening here?

Commonly used built-in functions

by(), cut(), aggregate(), split()

  • Like apply(), but used on data.frames
  • Split, apply, combine is an important concept in data analysis
  • Package plyr is very popular and useful, but important to learn Base R first
aggregate(z[,c(2:3)], by=list(z$party), mean)          ## mean Exam score by party
aggregate(z[,c(2:3)], by=list(z$part, z$gender), sum)  ## sum by party and gender

## same as tapply() but for data.frames (instead of arrays)
## returns class "by"
by(z$'Exam 1',z$party,sum)

## convert numeric column of data.frame into factor
## great for binning data
z$age <- c(21,29,38,41,26,50)
cut(z$age,breaks=c(20,30,40,50))

split(z,f = z$gender)                       ## split a dataframe according to a factor

Commonly used built-in functions

Exercise!

a) Create a column for the average grade for each student. Label it.

  • D is between 50 and 59
  • C is between 60 and 69
  • B is between 70 and 79
  • A is between 80 and 100
  • b)

    • The function system.time() returns timings for R operations. Examine the help documentation for this function.
    • Compute the median standard deviation of every column of a 100 by 100 matrix. Initialize a 100 x 100 matrix using a Random Uniform Variable, between 20 and 50 in each cell. Compute the median standard deviation of each column using:
      • A for() loop
      • An apply() function
    • Which is the fastest?

    String Manipulation

    Basic string functions

    example <- c("THIS IS AN EXAMPLE","and so is this","this is not","hello world","extra")
    grep("an", example)                        ## return index of occurence "an"
    [1] 2
    grep("an", example, ignore.case=TRUE)
    [1] 1 2
    grep("an", example, ignore.case=TRUE, value=TRUE)  
    [1] "THIS IS AN EXAMPLE" "and so is this" 
    
    nchar(example)
    [1] 18 14 11 11  5
    
    paste(example[1],example[2])
    [1] "THIS IS AN EXAMPLE and so is this"
    files <- c("ex1","ex2")
    for (i in files){
       save(filename = paste("Title",i,".pdf"))
    }
    

    String Manipulation

    Basic string functions

    # formatting strings
    sprintf("%f",exp(1))
    [1] "2.718282"
    sprintf("%0.2f",exp(1))
    [1] "2.72"
    sprintf("Today's date is %s",format(Sys.Date(),"%d %b %Y"))
    [1] "Today's date is 31 Oct 2013"
    
    example2 <- "Substring takes a subset of...the string!...It's nuts!"
    paste(substr(example2, 19, 21), substr(example2, 22, 24), sep = "")
    
    ## [1] "subset"
    

    String Manipulation

    Basic string functions

    sp <- strsplit(example2,split="of")
    sp
    [[1]]
    [1] "Substring takes a sub-set "  "...the string!...It's nuts!"
    
    length(sp)
    [1] 1
    
    length(unlist(sp))
    [1] 2
    
    regexpr("!",example2)     ## first occurence of "!" in example2
    
    gregexpr("!",example2)    ## all occurrences of "!" in example2
    

    String Manipulation

    Exercise!

    a) Convert the following character vector into a 3 column dataframe. Name each column.
    b) Format the numbers to be percentages with 2 decimal places.
    c) Find the total score for people with J-letter first names. 
    d) Find the most common weekday.
    
    char_vec <- c("{'al' 'einst'} score:0.4503-[12302013]",
    "{'isaac' 'knewt'} score:0.0007-[11202013]",
    "{'ralph' 'emerson'} score:0.10321-[09122013]",
    "{'james' 'dean'} score:0.84-[02032012]",
    "{'jim' 'beam'} score:0.2-[10172013]",
    "{'tommy' 'bahamas'} score:0.761-[05212013]",
    "{'george' 'of the jungle'} score:0.9434-[01302013]",
    "{'harry' 'henderson'} score:0.5456-[08112012]",
    "{'johnny' 'walker'} score:0.309118-[08212011]")
    
    e) Print out the following sentence with one word on each line:
    y <- "This is a sentence."
    

    Miscellaneous Tips and tricks

    Basic statistical functions


    x <- rnorm(1000,85,5)
    y <- 2 * runif(1000,0,10)
    mean(x)             ## [1] 85.22434
    median(x)           ## [1] 85.24506
    sd(x)               ## [1] 4.869123
    var(x)              ## [1] 23.70836
    cov(x,y)            ## [1] 0.8053751
    cor(x,y)            ## [1] 0.02883522
    

    Miscellaneous Tips and tricks

    Operator precedence

    1:n-1       ## wrong
    1:(n-1)     ## this is what you want
    
    -2.4 ^ 2.5  ## nice    
    [1] -8.923354
    x <- -2.4
    x ^ 2.5     ## not so nice
    [1] NaN
    
    • Parenthesis will force the operator to do what you want

    Miscellaneous Tips and tricks

    Boolean operations

    • Boolean operations coerce numbers to being TRUE
    x == 4 | 6          ## OR function - returns bogus result
    x == 4 | TRUE       ## weird
    
    x == 4 | x == 6     ## better
    
    x %in% c(4,6)       ## best
    

    Miscellaneous Tips and tricks

    Coercion

    • Sometimes it's good to coerce
    • read.csv() returns a data.frame. What if you want to do math? Matrices are better.
    x <- data.frame(num=c(1,2,3,4))
    mean(x)                             ## Nope
    x <- as.matrix(x)
    mean(x)                             ## Yup
    
    as.numeric()
    as.character()
    as.factor()
    as.data.frame()
    

    Miscellaneous Tips and tricks

    print() vs cat()

    • print() is a generic function
    • cat() is a concatenate function
    x <- 2
    print("One plus one is",x)
    [1] "One plus one is"
    
    ## alternatively...
    print("One plus one is");print(x);
    [1] "One plus one is"
    [1] 2
    
    ## even better...
    cat(paste("One plus one is",x))
    One plus one is 2
    

    Miscellaneous Tips and tricks

    class vs mode

    • An object's mode determines how it's stored in memory
    • An object's class determines its abstract type, a concept borrowed from OOP.
    • Most statistical programs don't have the OOP concept.
    x <- data.frame(scores = c(80, 90, 70))
    y <- as.Date("2013-11-05")
    cat(paste(mode(x), mode(y)))
    
    ## list numeric
    
    cat(paste(class(x), class(y)))
    
    ## data.frame Date
    

    Miscellaneous Tips and tricks

    do.call()

    • lots of real data comes in row form, each element is a different mode
    • many times stored as a list. Use do.call to combine the elements into a data.frame
    a <- list(1.3, 2.5, "jeff")
    b <- list(4.5, 2.8, "jerry")
    d <- list(6.5, 0.8, "joe")
    z <- list(a, b, d)
    df <- data.frame(do.call(rbind, z))
    df
    
    ##    X1  X2    X3
    ## 1 1.3 2.5  jeff
    ## 2 4.5 2.8 jerry
    ## 3 6.5 0.8   joe
    

    Miscellaneous Tips and tricks

    Exercise!

    a)
    ## Create a new column called num2 for which each value is double the corresponding value in num
    ## Make sure num2 is also a factor
    x <- data.frame(num=factor(c(1.0,0.03,8.0, 0.4)))
    
    b)
    ## Find the letters in z corresponding to the indices of even numbers in y
    y <- c(1,2,NA,4,5,8,5,2,3)
    z <- c("f","g","e","i","l","o","p","u")
    

    Not covered

    1. Object oriented programming with S3 and S4
    2. Input/output
    3. Packages: ggplot, reshape, plyr, forecast, MASS
    4. Debugging: browser(), warnings()
    5. Parallelizing
    6. Much more

    Resources

    PRACTICE, PRACTICE, PRACTICE

    Questions?

    Thank you!