Measuring Item Reliability - Pearson's Product Moment Correlation Coefficient

What does the Pearson’s product moment correlation coefficient (PCC) show you?

The PCC measures the strength of the relationship between two variables. When it is used during the analysis of an exam, this would be between candidates’ item/question/scenario marks and their exam marks.

The Pearson Correlation attempts to draw a line of best fit through the data points plotted on a graph, and the coefficient evaluates how far away from the line of best fit these data points are. Pearson’s product moment correlation coefficient is often denoted by ‘r’ and is a number between -1 and 1 – with numbers between 0 and 1 showing a positive correlation and numbers between 0 and -1 showing a negative correlation. For clarity for our users we use PCC to denote the Pearson’s product moment correlation coefficient within Maxexam and we will do the same here.

What would be considered a ‘good’ correlation?

The level of PCC considered to represent a good correlation very much depends on the field in which it is used. Often the PCC is used to measure whether there is correlation between an environmental factor and another factor – for example the correlation between the temperature and the amount spent on heating. Analysts are looking for high levels of positive or negative correlation – depending on what is being measured - either being a ‘good correlation’.  In this case scientists would potentially expect a negative PCC – so as temperature increases the amount spent on heating decreases.

 However, within an exam context a negative PCC is not going to be considered ‘good’. In exams better performing candidates should always do better on good items, so a negative correlation, i.e. one which suggests the better the candidate the worse they do, are always considered bad.

What would be considered to be a good PCC in an exam context?

Exam administrators may take their own view on this, but as a rough guide the following table illustrates what is commonly accepted in the exam domain:

Very good

 0.4 ≤ PCC

Good

 0.3 ≤ PCC < 0.4

Fair

 0.2 ≤ PCC < 0.3

Poor

-1.0 ≤ PCC < 0.2

 

Note: Poor in the above table refers to the indicated quality of the item rather than the strength of the correlation, a PCC of -1 is strongly correlating but likely a very poor item.

Examples 

The graphs below clearly illustrate that the PCC is higher where the points are closer to the line of best fit (left), and lower where the points are further from the line of best fit (right).

In both of these graphs there is a positive correlation indicating that as candidate’s overall performance increases so does their expected question mark – so in the context of an exam when a candidate does well in an item, question or scenario, they are likely to also do well in the whole exam.

High PCC (left), Fair PCC (right)

In the below graphs, the points of the left graph are very scattered producing a low PCC. In the right graph there is a negative correlation, this would mean the better a candidate does in the exam the less likely they are going to do well on this question – which would suggest a problem with the item, question or scenario!

Low PCC (left), Negative PCC (right)

Why use the Pearson’s product moment correlation coefficient?

The PCC is widely used and recognised as a way to confirm the validity of an exam. The people developing an exam will want to look at items that have a high correlation coefficient (i.e. PCC) to ensure high performing questions are selected. Post exam, poorly performing items can be identified as those with a low PCC and potentially disabled to ensure that the exam is internally consistent – i.e. that the candidates doing well on an item are also likely to do better in the exam as a whole.

Does the PCC show how discriminating an item is?

The short answer to this question is no! You can have two lines of different gradients, both of which have the same PCC value, as in the example below.

The PCC does not indicate how discriminating an item is

The image on the left is more discriminating (a steeper slope and higher DI) than the image on the right, yet they both have the same PCC of 1. In both cases students who did better in the question did better in the exam, but how much better is influenced by how discriminating the question was.

Having said this, given a well discriminating exam, a good PCC will indirectly indicate a good gradient and therefore a well discriminating item, question or scenario. However, the reverse cannot be said – having a low correlation doesn’t necessarily say anything about the discrimination. It just means it needs to be looked at further to understand why.

Do all items/questions/scenarios in an exam need to be well correlated?

Whilst, as we said above, it is desirable for items, questions or scenarios to have a high PCC indicating they are well correlated with the exam as a whole, essential knowledge questions are less likely to have a high correlation as most people would be expected to get those right. In fact, if an item was essential knowledge and everyone got it right, you wouldn’t be able to calculate the correlation of that item at all – everyone is on the same point on the x axis and the maths used in the equation does not hold up as it needs 2 sets of independent values that vary.  This is illustrated in the example below with the PCC being incalculable as everyone got the question right whether they did well or badly in the exam.

The PCC can't be calculated as everyone got the question right

This independence requirement is also why an exam with only 1 scenario cannot have a scenario PCC calculated as its mark is equal to the exam mark.

PCC and dichotic items

The graph below shows a typical MCQ where a candidate can either get it right or wrong. In this case the Pearson’s Correlation Coefficient can be quite low because the marking scheme is dichotic. Middle performing candidates are pushed into the extremes, i.e. 0 or 1, as there are no half marks which in turn supresses the PCC value, this is why a PCC of 0.4 can still be considered very good in an exam context.

PCC and Dichotic Items

A note of warning

Looking at the PCC value on its own without context can be misleading. As with any statistic, the PCC should be used to identify potentially problematic areas, at which point other statistics should be considered. In the context of an exam a subject matter expert may be required to confirm whether the problem lies with the item or the teaching.

As an example, the GIF below illustrates how the PCC, as well as other statistics, can be pretty meaningless if they are not used in context.

The GIF is called the datasaurus dozen – it shows 12 pieces of data plotted on a graph, each of which have the same x & y means, standard deviations, and the same PCC, yet the data itself is very different. Even when the points are in transformation between the 12 graphs, they still give the same measurements! 

 

Datasaurus dozen

 

In conclusion

The Pearson’s product moment correlation coefficient (PCC) is used widely to confirm that the items, questions or scenarios within an exam have high correlation with the total exam mark to ensure that the exam is internally consistent.

It is a flexible statistic that can be calculated at the item, question and scenario levels. It can also handle non-dichotomous items unlike point biserial, and handle more advanced marking schemes, unlike biserial. As a result, the option to calculate the PCC is offered within Maxexam at the item, question and scenario levels.

However, in common with other statistics the PCC does need to be used alongside other statistics to confirm that what it is telling you is what you think it is! Maxexam’s post exam analysis tools can help here, by identifying problematic items given a set of desired criteria, one of which is the PCC. This would signal a requirement for further investigation.