Playing with data in R

It’s coming up to that time of year again when the students get introduced to R> and the joys of the command line, which means that they can’t simply click and type a new value the way that they can when working with a spreadsheet.

R uses a variety of data types and one (data frames) has the familiar grid structure familiar from spreadsheets. I’m going to cover some of the ways we can manipulate data frames within R, and well as one of the ‘gotchas’ that can catch you out.

First of all, we need some data to play with.

sample_id <- c(1:20)
length <- c(rnorm(10, mean=22.1, sd=3), rnorm(10, mean=18.2, sd=3))
weight <- c(rnorm(10, mean=900, sd=125), rnorm(10, mean=700,sd=125)
site <- c(rep(1:2, each=5),rep(1:2, each=5))
sex <- c(rep("M", each=10), rep("F", each=10))
my_sample <- data.frame(sample_id, length, weight, site, sex)

The rnorm function selects data from a normal distribution, so the length line gives us ten samples around a mean of 22.1 with a standard deviation of 3 and ten samples around a mean of 18.2 also with a standard deviation of 3. The rep function is a quick way of entering repeating values. The result of all these commands is a data frame called my_sample, which we can examine by typing in the name:

my_sample
sample_id length weight site sex
1 1 23.09771 899.9570 1 M
2 2 19.51399 819.5591 1 M
3 3 21.79052 893.0299 1 M
4 4 24.84175 822.7836 1 M

There’s a number of ways we can look at a column. We can use the $ notation (my_sample$weight), which is my preferred method for referring to a single column. I tend not to use attach(my_sample) because if I have two data frames and the same variable name in both then things start to get confusing if both have been attached at the same time.
The second way is my_sample[, 3], which gives me the third column. Note that in R, indexing starts at one rather than at zero as is the case in many programming languages. This second notation has the form dataframe[row, col] and by leaving the row part blank R defaults to giving us all rows.
The third method (and my least favourite) is the double bracket method: my_sample[[3]]. Any of these three methods give exactly the same result.

If we want to look at rows we use the dataframe[row, col] notation but this time with the column part left blank, so for example

my_sample[4,]

gives us the fourth row and

my_sample[1:10,]

gives us the first ten rows.

This technique of selecting only particular rows or columns is called slicing and gives a lot of control over manipulating our data. What about if we want rows (or more likely) columns that aren’t next to each other in the data frame? The [row, col] format expects a single number (object) separated by commas. We can use the c() function to create a variable that contains more than one value:

my_sample[,c(2,3,5)] #gives us length, weight and sex columns.

and these slices can themselves be saved as data frames.

subset_of_sample <- my_sample[,c(2,3,5)]

Suppose I want all the male samples as one subset and all the females as another. Because the males and females are grouped within the data set I could use row slicing to do it:

female_sample <- my_sample[11:20,] #rows 11-20

but what if that wasn’t the case? We can use the value in the column to decide whether to select it.

males <- my_sample$sex == "M"
males

This gives a variable that is the same length as the column and consists only of false or true values depending on whether the value was “M” or not, and we can use this variable when we slice the data frame. The logical value means that we select the row if the value in ‘males‘ are ‘TRUE‘.

male_sample <- my_sample[males,]
head(male_sample) # head displays just the top few rows.
head(male_sample, 3) # or we can be specific about how many

We can look at a particular row and column using:

my_sample$weight[4]
my_sample[[3]][4] # third column (weight), fourth row

which is good if we suddenly discover from our lab notebook that this particular value is incorrect, because we can read a single value directly into that place without affecting the rest of the data frame.

my_sample$weight[4] <- 900.2154

The next stage is that I want to add the weight:length ratio to the data. I can calculate the values using the existing columns and then put the results into a new column that I create at the same time:

my_sample$ratio <- my_sample$weight / my_sample$length
head(my_sample)

I can also do the same thing if I need to change an existing column. Suppose I have multiple samples and I need to renumber the sample IDs to avoid confusion.

my_sample$sample_id <- my_sample$sample_id+100
head(my_sample)

Now I want to add some new data to the bottom of the data frame. First I’m going to add a row using a method that works, and then I’ll cover one of the surprises that R can spring on you.

mean(my_sample$length) # just testing
new_row <- list(121, 18.72391, 710.1846, 1, "F", 710.1846/18.72391)
my_sample <- rbind(my_sample, new_row)
mean(my_sample$length) # still works
tail(my_sample) # bottom few rows

The two lines doing the work here are the second and third. First I create the data for a new row using the list() function, and then use the rbind() function (which stands for row bind) to add it to the bottom of the data frame. The list() function is the critical part as we’ll see later. Now for the gotcha. I try to add a new row:

new_row2 <- c(122, 17.99321, 698.7815, 1, "F", 698.7815/17.99321)
my_sample <- rbind(my_sample,new_row2)

This time I’m using the c() function. It seems to have worked, but if I try to find the mean of my length column I get an error:

mean(my_sample$length) # gives error

So we might think we can remove the ‘bad’ row and have a rethink:

my_sample <- my_sample[1:21,]
mean(my_sample$length)

but that still doesn’t work. We then take a look at the row we added and find that R has converted the values to text.

new_row2
class(new_row2) # class tells what data type the variable is

That’s because the c() function needs data that’s all of the same type. R got around the problem by converting the other values into text without telling us, although we should at least be grateful it did the calculation for the ratio before the conversion. That’s called ‘type coercion’ by R or ‘casting’ by programmers, but why can’t we calculate the mean even though we’ve deleted the bad row?

str(my_sample)

See where it says ‘chr’? R wants all the values in a single column to be the same data type so it’s converted each of the columns to text as well. So that little ‘c’ in the second example of adding a row has had the effect of basically converting our entire data frame to text values rather than numbers. Depending on the size and importance of our data set that’s either an ‘oopsie’ or a moment of sheer blind panic. The good news is that we can switch it back.

my_sample$sample_id <- as.integer(my_sample$sample_id)
my_sample$length <- as.numeric(my_sample$length)
my_sample$weight <- as.numeric(my_sample$weight)
my_sample$site <- as.factor(my_sample$site)
my_sample$ratio <- as.numeric(my_sample$ratio)

These ‘as.type‘ functions convert from one data type to another (where possible), and can get us out of trouble if R does some type coercion in the background without us noticing. R has its quirks, but it’s a powerful package and well worth spending some time on if you need to move beyond basic statistics.