Using R – my initial thoughts.

I’m helping out on a course that teaches IT and numeracy skills to first-year undergraduates in Biological Sciences. As I’m new to the university, this is the first time I’ve been involved with this course, and it coincides with a switch to teaching statistics using R.

R is open source. Now, I’m a big fan of open source software for two reasons. First, my laptop and home PC are both dual boot machines, running Kubuntu and Windows XP, and secondly because, until recently I’d spent seven years with my working week split across different locations, departments or universities. That meant that having software that ran cross-platform to make my data as portable as possible was a high priority.

I understand that using R gives the students a tool that they’ll be able to use during their studies and afterwards without having to worry about licences or fees, and that as a tool it’s probably has more power than they’re ever likely to need. There are a couple of sticking points. The first is that its working environment is not a ‘tick and click’ interface – it involves working on the command line, which to those used to touchscreens probably seems like going back to the stone age. I can see the benefits. For example, I often convert videos on the command line using ffmpeg, and the insights I’ve gained doing that means I much better understand the options and alternatives when I use a conversion tool that has a graphical interface. It means I understand what I’m doing rather than picking some options and seeing what pops out the other end, and it’s that understanding that using a tool like R will develop.

The second point is that when you’re using R you’re not running a program to manipulate data and perform statistical tests, you’re writing one, albeit one command at a time. That may not be immediately apparent when someone first starts using R, but it’s a shift in attitude that’s critical to getting the best out of R. Without this shift in attitude commands lack a context and data structures seem an irrelevance, so that a user might run the command:

> myData = c("row1", 2, 4, 6, 8)

and then wonder why:

> mean(myData)

gives us:

[1] NA
Warning message:
In mean.default(myData) : argument is not numeric or logical: returning NA

The reason is that when a vector is created using the combine function c() the individual elements need to be the same type and to enforce this R has silently converted (coerced in R-speak) the numbers to strings, which we can see from looking at the contents of myData:

> myData
[1] "row1" "2" "4" "6" "8"

So what advice would I give a new user of R? First, get more of a programmer’s mindset – you’re coding, not calculating. Calculating is just the end result of the coding. Secondly, explore the different data types. There’s a good overview in the Quick-R section of the statmethods.net site (http://www.statmethods.net/input/datatypes.html) and the introduction section of r-tutor.com (http://www.r-tutor.com/r-introduction). Thirdly, ask questions – what’s the difference between a vector and a list? How do I import data from a spreadsheet and name the columns? And then fire up R, open up your favourite search engine and find the answers.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s