Playing with data in R

It’s coming up to that time of year again when the students get introduced to R> and the joys of the command line, which means that they can’t simply click and type a new value the way that they can when working with a spreadsheet.

R uses a variety of data types and one (data frames) has the familiar grid structure familiar from spreadsheets. I’m going to cover some of the ways we can manipulate data frames within R, and well as one of the ‘gotchas’ that can catch you out.

First of all, we need some data to play with.

sample_id <- c(1:20)
length <- c(rnorm(10, mean=22.1, sd=3), rnorm(10, mean=18.2, sd=3))
weight <- c(rnorm(10, mean=900, sd=125), rnorm(10, mean=700,sd=125)
site <- c(rep(1:2, each=5),rep(1:2, each=5))
sex <- c(rep("M", each=10), rep("F", each=10))
my_sample <- data.frame(sample_id, length, weight, site, sex)

The rnorm function selects data from a normal distribution, so the length line gives us ten samples around a mean of 22.1 with a standard deviation of 3 and ten samples around a mean of 18.2 also with a standard deviation of 3. The rep function is a quick way of entering repeating values. The result of all these commands is a data frame called my_sample, which we can examine by typing in the name:

my_sample
sample_id length weight site sex
1 1 23.09771 899.9570 1 M
2 2 19.51399 819.5591 1 M
3 3 21.79052 893.0299 1 M
4 4 24.84175 822.7836 1 M

There’s a number of ways we can look at a column. We can use the $ notation (my_sample$weight), which is my preferred method for referring to a single column. I tend not to use attach(my_sample) because if I have two data frames and the same variable name in both then things start to get confusing if both have been attached at the same time.
The second way is my_sample[, 3], which gives me the third column. Note that in R, indexing starts at one rather than at zero as is the case in many programming languages. This second notation has the form dataframe[row, col] and by leaving the row part blank R defaults to giving us all rows.
The third method (and my least favourite) is the double bracket method: my_sample[[3]]. Any of these three methods give exactly the same result.

If we want to look at rows we use the dataframe[row, col] notation but this time with the column part left blank, so for example

my_sample[4,]

gives us the fourth row and

my_sample[1:10,]

gives us the first ten rows.

This technique of selecting only particular rows or columns is called slicing and gives a lot of control over manipulating our data. What about if we want rows (or more likely) columns that aren’t next to each other in the data frame? The [row, col] format expects a single number (object) separated by commas. We can use the c() function to create a variable that contains more than one value:

my_sample[,c(2,3,5)] #gives us length, weight and sex columns.

and these slices can themselves be saved as data frames.

subset_of_sample <- my_sample[,c(2,3,5)]

Suppose I want all the male samples as one subset and all the females as another. Because the males and females are grouped within the data set I could use row slicing to do it:

female_sample <- my_sample[11:20,] #rows 11-20

but what if that wasn’t the case? We can use the value in the column to decide whether to select it.

males <- my_sample$sex == "M"
males

This gives a variable that is the same length as the column and consists only of false or true values depending on whether the value was “M” or not, and we can use this variable when we slice the data frame. The logical value means that we select the row if the value in ‘males‘ are ‘TRUE‘.

male_sample <- my_sample[males,]
head(male_sample) # head displays just the top few rows.
head(male_sample, 3) # or we can be specific about how many

We can look at a particular row and column using:

my_sample$weight[4]
my_sample[[3]][4] # third column (weight), fourth row

which is good if we suddenly discover from our lab notebook that this particular value is incorrect, because we can read a single value directly into that place without affecting the rest of the data frame.

my_sample$weight[4] <- 900.2154

The next stage is that I want to add the weight:length ratio to the data. I can calculate the values using the existing columns and then put the results into a new column that I create at the same time:

my_sample$ratio <- my_sample$weight / my_sample$length
head(my_sample)

I can also do the same thing if I need to change an existing column. Suppose I have multiple samples and I need to renumber the sample IDs to avoid confusion.

my_sample$sample_id <- my_sample$sample_id+100
head(my_sample)

Now I want to add some new data to the bottom of the data frame. First I’m going to add a row using a method that works, and then I’ll cover one of the surprises that R can spring on you.

mean(my_sample$length) # just testing
new_row <- list(121, 18.72391, 710.1846, 1, "F", 710.1846/18.72391)
my_sample <- rbind(my_sample, new_row)
mean(my_sample$length) # still works
tail(my_sample) # bottom few rows

The two lines doing the work here are the second and third. First I create the data for a new row using the list() function, and then use the rbind() function (which stands for row bind) to add it to the bottom of the data frame. The list() function is the critical part as we’ll see later. Now for the gotcha. I try to add a new row:

new_row2 <- c(122, 17.99321, 698.7815, 1, "F", 698.7815/17.99321)
my_sample <- rbind(my_sample,new_row2)

This time I’m using the c() function. It seems to have worked, but if I try to find the mean of my length column I get an error:

mean(my_sample$length) # gives error

So we might think we can remove the ‘bad’ row and have a rethink:

my_sample <- my_sample[1:21,]
mean(my_sample$length)

but that still doesn’t work. We then take a look at the row we added and find that R has converted the values to text.

new_row2
class(new_row2) # class tells what data type the variable is

That’s because the c() function needs data that’s all of the same type. R got around the problem by converting the other values into text without telling us, although we should at least be grateful it did the calculation for the ratio before the conversion. That’s called ‘type coercion’ by R or ‘casting’ by programmers, but why can’t we calculate the mean even though we’ve deleted the bad row?

str(my_sample)

See where it says ‘chr’? R wants all the values in a single column to be the same data type so it’s converted each of the columns to text as well. So that little ‘c’ in the second example of adding a row has had the effect of basically converting our entire data frame to text values rather than numbers. Depending on the size and importance of our data set that’s either an ‘oopsie’ or a moment of sheer blind panic. The good news is that we can switch it back.

my_sample$sample_id <- as.integer(my_sample$sample_id)
my_sample$length <- as.numeric(my_sample$length)
my_sample$weight <- as.numeric(my_sample$weight)
my_sample$site <- as.factor(my_sample$site)
my_sample$ratio <- as.numeric(my_sample$ratio)

These ‘as.type‘ functions convert from one data type to another (where possible), and can get us out of trouble if R does some type coercion in the background without us noticing. R has its quirks, but it’s a powerful package and well worth spending some time on if you need to move beyond basic statistics.

Advertisements

But … but … that’s not what I said

I’m coming to the end of week three of Coursera’s statistics one course, and this week we’ve been looking at regression and hypothesis testing. Once again, the subject matter has been very good. I’ll need to take a closer look at getting the regression coefficients using matrices to make sure I fully understand it, but apart from that it’s been a good week. I still have the quiz and assignment to do, but that’s a job for the weekend.

Last weekend, I did the quiz and assignment for week two, with a rather critical error becoming apparent: the answers recorded by the system didn’t match the answers I submitted. Yes, that’s right, the online marking system (from an organisation top-heavy with computer scientists) doesn’t record a student’s submissions correctly. I became aware of this when looking back at previous attempts to try and work out which questions I was getting wrong so I could go back and look at the data analysis again. I posted to the forum to report the issue and received an email saying that a solution was being worked on and would be applied to existing submissions retrospectively. Surely this was tested before deployment? In which case, how did such a fundamental bug get past testing? Can you imagine how frustrated and angry students would be getting if they were only allowed one attempt and a completion certificate was being offered? Last week I said:

  • Is is too much to ask that something as well funded as Coursera using video as the primary teaching method could actually produce videos without errors in them?

to which I can add this week:

  • Is it too much to ask that when using an automated online marking system it marks what I actually submitted?

Week three and again the quality of the subject matter is being let down by pedagogy and planning issues.

Coursera Statistics Week Two – Mr Grumpy Comes to Town

I’m coming to the end of week two of the Coursera Statistics One course, with just the quiz and assignment to do over the weekend. There have been a lot of forum postings because people are having difficulty using R, with many people saying they’re dropping out because of the problems they’re having getting the software to run and get results etc, particularly since there were some errors in the main lectures, for example hist(someVar) was used when it should have been hist(someObject$someVar). I’ve been posting to the forums and helping out where I can, which has fitted nicely with the eModerating course I’ve also been taking over the last two weeks. In response, Coursera has posted a number of video tutorials on using R by a female staff member. She’s very good – the tutorials are detailed and comprehensive without being confusing. For example, she demos common mistakes and what the corresponding error messages look like, but this is where Mr Grumpy makes an appearance. This is week two and these videos have been created specifically to help people with R, but there are mistakes in them. At one point, list.files() is shown as all one word, which would give an error.

  • Is is too much to ask that something as well funded as Coursera using video as the primary teaching method could actually produce videos without errors in them?
  • This is week two – surely anyone who’s used R would see the need to give support to students who’ve never encountered it before (and probably are strangers to the command line as well) from the beginning of the course and possibly as a week 0 activity.
  • There is no certificate of achievement (not an issue for me) but quiz and assignment submissions were initially restricted to one attempt only. If there’s no certificate, why not allow multiple attempts from the start so that students can master the materials and fomatively assess their own progress?

Whatever happened to learning design? How does the initial course presentation meet Professor Conway’s aim of maximising retention? And just to make it clear, I’m criticising the pedagogy here, not the content or the presentation of the content, which I find to be very good.

I’d be interested in hearing perspectives from others on the course.

Statistical MOOCing

I’ve recently started yet another MOOC. This time it’s Coursera’s Statistics One. It’s early days yet (I’m only on lecture two), but there are some interesting contrasts with another statistics MOOC I recently did, Udacity’s ST101 Introduction to Statistics.

The Coursera offering consists of videos, typically about fifteen to twenty minutes long and totalling around three to four hours per week, one quiz and one assignment per week. The quiz and assignment only allow one attempt.

What I like is that the content looks more formal and rigorous than the Udacity offering, and critically, we’ll actually be doing meaningful calculations using the R statistical software, which we’re using with the first year students at Leicester University. With Udacity, I felt their statistics course was more of a ‘Look how interesting statistics are’ rather than ‘This is how you use statistics’.

My concern is with the assessment. With Udacity, the videos were short, sometimes only a few seconds long in some sections before an in-video quiz was used, often to take a student step-by-step through a process or development of an idea, rather than simply recall information. With the Coursera MOOC, there are a couple of quiz slides at the end of each video, but the course notes specifically state:

‘The purpose of these “in-video quizzes” is to motivate you to engage in the material and to practice retrieving newly learned information. Your performance on these questions will be monitored for course evaluation purposes only.’

In other words, they are there to promote recall and aid course management, and by having the quizzes at the end the student simply sits and listens rather than learns by doing. It may be that once we get into actually calculating things the in-video quizzes will require more interaction, but at the moment I’m disappointed. Even with the scale of these MOOCs Udacity shows that there are alternatives to the ultra-didactic route. What I’d like is the rigour of the Coursera content and the engagement of the Udacity formative assessment. What I’d really like is a stats MOOC that is more ‘task-based’ to use Lisa M Lane’s terminology.

Using R – my initial thoughts.

I’m helping out on a course that teaches IT and numeracy skills to first-year undergraduates in Biological Sciences. As I’m new to the university, this is the first time I’ve been involved with this course, and it coincides with a switch to teaching statistics using R.

R is open source. Now, I’m a big fan of open source software for two reasons. First, my laptop and home PC are both dual boot machines, running Kubuntu and Windows XP, and secondly because, until recently I’d spent seven years with my working week split across different locations, departments or universities. That meant that having software that ran cross-platform to make my data as portable as possible was a high priority.

I understand that using R gives the students a tool that they’ll be able to use during their studies and afterwards without having to worry about licences or fees, and that as a tool it’s probably has more power than they’re ever likely to need. There are a couple of sticking points. The first is that its working environment is not a ‘tick and click’ interface – it involves working on the command line, which to those used to touchscreens probably seems like going back to the stone age. I can see the benefits. For example, I often convert videos on the command line using ffmpeg, and the insights I’ve gained doing that means I much better understand the options and alternatives when I use a conversion tool that has a graphical interface. It means I understand what I’m doing rather than picking some options and seeing what pops out the other end, and it’s that understanding that using a tool like R will develop.

The second point is that when you’re using R you’re not running a program to manipulate data and perform statistical tests, you’re writing one, albeit one command at a time. That may not be immediately apparent when someone first starts using R, but it’s a shift in attitude that’s critical to getting the best out of R. Without this shift in attitude commands lack a context and data structures seem an irrelevance, so that a user might run the command:

> myData = c("row1", 2, 4, 6, 8)

and then wonder why:

> mean(myData)

gives us:

[1] NA
Warning message:
In mean.default(myData) : argument is not numeric or logical: returning NA

The reason is that when a vector is created using the combine function c() the individual elements need to be the same type and to enforce this R has silently converted (coerced in R-speak) the numbers to strings, which we can see from looking at the contents of myData:

> myData
[1] "row1" "2" "4" "6" "8"

So what advice would I give a new user of R? First, get more of a programmer’s mindset – you’re coding, not calculating. Calculating is just the end result of the coding. Secondly, explore the different data types. There’s a good overview in the Quick-R section of the statmethods.net site (http://www.statmethods.net/input/datatypes.html) and the introduction section of r-tutor.com (http://www.r-tutor.com/r-introduction). Thirdly, ask questions – what’s the difference between a vector and a list? How do I import data from a spreadsheet and name the columns? And then fire up R, open up your favourite search engine and find the answers.