Hannah got the sweets, who got indigestion?

Last Thursday in the UK around half a million 15 and 16-year olds took a GCSE maths exam, specifically the second paper in the non-calculator exam. By Friday the exam was trending on twitter (#EdexcelMaths), with one particular question attracting attention:

There are n sweets in a bag.
6 of the sweets are orange.
The rest of the sweets are yellow.

Hannah takes at random a sweet from the bag.
She eats the sweet.

Hannah then takes at random another sweet from the bag.
She eats the sweet.

The probability that Hannah eats two orange sweets is 1/3.

(a) Show that n2 – n – 90 = 0

I had a quick attempt and after one unproductive sidetrack I got the answer. So why am I writing about this? Because it fits in with the other posts on assessment I’m doing, and to explore some of the issues around it.

First, the actual  mathematical content is pretty straightforward – you only need to know how to do three things: calculate a probability without replacement, multiply fractions and rearrange an equation. This is hardly Sheldon Cooper territory.

The exam board has two tiers for the qualification (foundation and higher) and probability without replacement is only explicitly mentioned for the higher tier. The exam has been quoted as saying the question was aimed at those students who would achieve the highest grades (A and A*), and I think grade discrimination is a fair approach. I did ask my daughter (who’s currently revising and taking A-level maths) and she said she wasn’t sure she would have been able to answer it at 16. For non-UK readers, GCSE exams are taken at the end of compulsory schooling and A-levels are taken at 18, typically as a route to studying at university.

So why my unease with students finding it difficult? There’s always the charge of dumbing down levelled at exams but I don’t think that’s it. True, when I did my maths exam at that age the syllabus included calculus of polynomials and their applications, which now is only introduced at A-level, but they were different qualifications – GCSEs were only introduced after my school career had ended. I think my unease comes from the fact that I think this shouldn’t have been seen as a difficult question. Donald Clark has blogged seven reasons why he agrees with the children and thinks it wasn’t a fair question, some of which I agree with and most that I don’t.

There’s a couple of factors involved here. I recall reading a study where they looked at who could answer questions with the same maths content but that were written in different ways. That study found that questions written as word questions rather than equations were consistently harder to answer, even though there was no difference in the actual mathematical content. Secondly, I think it’s the way that maths is taught as rules and recipes to follow rather than a creative problem solving activity. This is not a criticism of the teachers because I think that it’s taught that way precisely because of the pressures that have (politically) been placed on education. As I’ve mentioned before I’m a big fan of Jo Boaler’s approach and it’s emphasis on flexibility and application of technique rather than stamp-collecting formulae. Donald Clark makes the distinction between functional maths (maths for a practical purpose such as employment) and the type of maths typically found in exams, but I think that’s a false dichotomy in this case. As Stephen Downes said “… what this question tells me is the difference between learning some mathematics and thinking mathematically.”. The difference between functional and theoretical maths (at this level) starts to disappear when we think mathematically – maths becomes a toolbox of skills to be applied to the problem at hand, rather than a particular formula in a particular topic to be remembered.

And if you’re wondering what the answer was:

The solution to Hannah's sweets

The solution

Times tables – a matter of life and death?

Recently I took my youngest daughter to visit a university in the north-east of the UK, which involved a round trip of nearly 500 miles and an overnight stay. There’s a general election due in less than three months, which means we’re into that ‘interesting’ phase of the electoral cycle where all the parties try to outcompete each other either with bribes incentives for certain groups (‘Unicorns for every five-year old!’) or to outdo themselves with demonising whatever group is the scapegoat this month. If you’ve ever seen the Monty Python’s four Yorkshiremen sketch, you’ll know what I mean.

So what has this to do with times tables? Well, one of the announcements was for every child to know their times tables up to 12 by the time they leave primary school (i.e. by age 11), and by ‘know’ they appear to mean memorise.

I have a number of misgivings about this. Firstly rote learning without understanding isn’t particularly useful. Memorisation isn’t education. Secondly, as the work of Jo Boaler has remarked students perform much better at maths when they learn to interact more flexibly with maths (number sense) rather than than simply remembering the answers. As she points out, calculating when stressed works less well when relying on memory, which is presumably why politicians refuse to answer maths questions when interviewed, as Nicky Morgan the education secretary did recently. In one of my previous jobs I worked in a health sciences department and the statistics on drug errors (such as calculating dosages) were frightening, and there are few things less stressful than someone potentially dying if the answer to a maths problem is wrong.

The outcome of all this memorisation is that the application suffers. As we travelled back there was a radio phone-in quiz and as times-tables were in the news one of the questions was zero times eight. The caller answered eight, and was told they were wrong. A few minutes later someone else called to tell the presenter that they were wrong because zero times eight was eight, but eight times zero was zero. And this is the real problem. While maths is seen (and taught) as a recipe, a set of instructions to follow, misconceptions like this will continue to prosper. Personally, I see maths as more of a Lego set – a creative process where you combine different components in different ways to get to the end result you want. As Jo Boaler has said “When we emphasize memorization and testing in the name of fluency we are harming children, we are risking the future of our ever-quantitative society and we are threatening the discipline of mathematics”. Unfortunately, I’m doubtful whether that will count for anything against the one-upmanship in the closing months of an election campaign.

Playing with data in R

It’s coming up to that time of year again when the students get introduced to R> and the joys of the command line, which means that they can’t simply click and type a new value the way that they can when working with a spreadsheet.

R uses a variety of data types and one (data frames) has the familiar grid structure familiar from spreadsheets. I’m going to cover some of the ways we can manipulate data frames within R, and well as one of the ‘gotchas’ that can catch you out.

First of all, we need some data to play with.

sample_id <- c(1:20)
length <- c(rnorm(10, mean=22.1, sd=3), rnorm(10, mean=18.2, sd=3))
weight <- c(rnorm(10, mean=900, sd=125), rnorm(10, mean=700,sd=125)
site <- c(rep(1:2, each=5),rep(1:2, each=5))
sex <- c(rep("M", each=10), rep("F", each=10))
my_sample <- data.frame(sample_id, length, weight, site, sex)

The rnorm function selects data from a normal distribution, so the length line gives us ten samples around a mean of 22.1 with a standard deviation of 3 and ten samples around a mean of 18.2 also with a standard deviation of 3. The rep function is a quick way of entering repeating values. The result of all these commands is a data frame called my_sample, which we can examine by typing in the name:

my_sample
sample_id length weight site sex
1 1 23.09771 899.9570 1 M
2 2 19.51399 819.5591 1 M
3 3 21.79052 893.0299 1 M
4 4 24.84175 822.7836 1 M

There’s a number of ways we can look at a column. We can use the $ notation (my_sample$weight), which is my preferred method for referring to a single column. I tend not to use attach(my_sample) because if I have two data frames and the same variable name in both then things start to get confusing if both have been attached at the same time.
The second way is my_sample[, 3], which gives me the third column. Note that in R, indexing starts at one rather than at zero as is the case in many programming languages. This second notation has the form dataframe[row, col] and by leaving the row part blank R defaults to giving us all rows.
The third method (and my least favourite) is the double bracket method: my_sample[[3]]. Any of these three methods give exactly the same result.

If we want to look at rows we use the dataframe[row, col] notation but this time with the column part left blank, so for example

my_sample[4,]

gives us the fourth row and

my_sample[1:10,]

gives us the first ten rows.

This technique of selecting only particular rows or columns is called slicing and gives a lot of control over manipulating our data. What about if we want rows (or more likely) columns that aren’t next to each other in the data frame? The [row, col] format expects a single number (object) separated by commas. We can use the c() function to create a variable that contains more than one value:

my_sample[,c(2,3,5)] #gives us length, weight and sex columns.

and these slices can themselves be saved as data frames.

subset_of_sample <- my_sample[,c(2,3,5)]

Suppose I want all the male samples as one subset and all the females as another. Because the males and females are grouped within the data set I could use row slicing to do it:

female_sample <- my_sample[11:20,] #rows 11-20

but what if that wasn’t the case? We can use the value in the column to decide whether to select it.

males <- my_sample$sex == "M"
males

This gives a variable that is the same length as the column and consists only of false or true values depending on whether the value was “M” or not, and we can use this variable when we slice the data frame. The logical value means that we select the row if the value in ‘males‘ are ‘TRUE‘.

male_sample <- my_sample[males,]
head(male_sample) # head displays just the top few rows.
head(male_sample, 3) # or we can be specific about how many

We can look at a particular row and column using:

my_sample$weight[4]
my_sample[[3]][4] # third column (weight), fourth row

which is good if we suddenly discover from our lab notebook that this particular value is incorrect, because we can read a single value directly into that place without affecting the rest of the data frame.

my_sample$weight[4] <- 900.2154

The next stage is that I want to add the weight:length ratio to the data. I can calculate the values using the existing columns and then put the results into a new column that I create at the same time:

my_sample$ratio <- my_sample$weight / my_sample$length
head(my_sample)

I can also do the same thing if I need to change an existing column. Suppose I have multiple samples and I need to renumber the sample IDs to avoid confusion.

my_sample$sample_id <- my_sample$sample_id+100
head(my_sample)

Now I want to add some new data to the bottom of the data frame. First I’m going to add a row using a method that works, and then I’ll cover one of the surprises that R can spring on you.

mean(my_sample$length) # just testing
new_row <- list(121, 18.72391, 710.1846, 1, "F", 710.1846/18.72391)
my_sample <- rbind(my_sample, new_row)
mean(my_sample$length) # still works
tail(my_sample) # bottom few rows

The two lines doing the work here are the second and third. First I create the data for a new row using the list() function, and then use the rbind() function (which stands for row bind) to add it to the bottom of the data frame. The list() function is the critical part as we’ll see later. Now for the gotcha. I try to add a new row:

new_row2 <- c(122, 17.99321, 698.7815, 1, "F", 698.7815/17.99321)
my_sample <- rbind(my_sample,new_row2)

This time I’m using the c() function. It seems to have worked, but if I try to find the mean of my length column I get an error:

mean(my_sample$length) # gives error

So we might think we can remove the ‘bad’ row and have a rethink:

my_sample <- my_sample[1:21,]
mean(my_sample$length)

but that still doesn’t work. We then take a look at the row we added and find that R has converted the values to text.

new_row2
class(new_row2) # class tells what data type the variable is

That’s because the c() function needs data that’s all of the same type. R got around the problem by converting the other values into text without telling us, although we should at least be grateful it did the calculation for the ratio before the conversion. That’s called ‘type coercion’ by R or ‘casting’ by programmers, but why can’t we calculate the mean even though we’ve deleted the bad row?

str(my_sample)

See where it says ‘chr’? R wants all the values in a single column to be the same data type so it’s converted each of the columns to text as well. So that little ‘c’ in the second example of adding a row has had the effect of basically converting our entire data frame to text values rather than numbers. Depending on the size and importance of our data set that’s either an ‘oopsie’ or a moment of sheer blind panic. The good news is that we can switch it back.

my_sample$sample_id <- as.integer(my_sample$sample_id)
my_sample$length <- as.numeric(my_sample$length)
my_sample$weight <- as.numeric(my_sample$weight)
my_sample$site <- as.factor(my_sample$site)
my_sample$ratio <- as.numeric(my_sample$ratio)

These ‘as.type‘ functions convert from one data type to another (where possible), and can get us out of trouble if R does some type coercion in the background without us noticing. R has its quirks, but it’s a powerful package and well worth spending some time on if you need to move beyond basic statistics.