Item Response Theory Models (JAGS)

Quarto
R
Academia
Software
Statistics
Published

March 21, 2020

Abstract

This tutorial will focus on the use of Bayesian estimation to fit simple linear regression models …

Keywords

Software, Statistics, Stan

This tutorial will focus on the use of Bayesian estimation to fit simple linear regression models. BUGS (Bayesian inference Using Gibbs Sampling) is an algorithm and supporting language (resembling R) dedicated to performing the Gibbs sampling implementation of Markov Chain Monte Carlo (MCMC) method. Dialects of the BUGS language are implemented within three main projects:

  1. OpenBUGS - written in component pascal.

  2. JAGS - (Just Another Gibbs Sampler) - written in C++.

  3. STAN - a dedicated Bayesian modelling framework written in C++ and implementing Hamiltonian MCMC samplers.

Whilst the above programs can be used stand-alone, they do offer the rich data pre-processing and graphical capabilities of R, and thus, they are best accessed from within R itself. As such there are multiple packages devoted to interfacing with the various software implementations:

This tutorial will demonstrate how to fit models in JAGS (Plummer (2004)) using the package R2jags (Su et al. (2015)) as interface, which also requires to load some other packages.

Overview

Item response theory (IRT) is a paradigm for investigating the relationship between an individual’s response to a single test item and their performance on an overall measure of the ability or trait that item was intended to measure. Many models exist in the IRT field for evaulating how well an item captures an underlying latent trait, but some of the most popular IRT models are logistic IRT models for dichotmous responses. In particular, the main types of models are:

1 1 parameter logistic model

2 2 parameter logistic model

3 3 parameter logistic model

Throughout this tutorial, I assume that the reader has some basic understanding of IRT model and working knowledge of a software implementation of the JAGS language. However, if this is not the case, excellent sources for learning IRT are Baker and Kim (2004), who provide a mathematically detailed introduction to IRT, and Hambleton, Swaminathan, and Rogers (1991), who give an intuitive introduction to the topic. For an in-depth description of how to implement different types of IRT models in OpenBUGS and JAGS, I also refer to this very nice review of Curtis et al. (2010) and this other online tutorial.

At the core of all the IRT models presented in this tutorial is the Item Response Function (IRF). The IRF estimates the probability of getting item \(j\) “correct” as a function of item characteristics and the \(i\)-th individual’s latent trait/ability level (\(\theta_i\)). These item response functions are defined by a logistic curve (i.e. an \(S\)-shape from \(0-1\)).

1 parameter logistic model (1PLM)

The 1PLM is used for data collected on \(n\) individuals who have each given responses on \(p\) different items. The items have binary outcomes, i.e. the items are scored as \(1\) if correct and \(0\) if not. The \(i\)-th individual in the sample is assumed to have a latent ability \(\theta_i\), and the \(i\)-th individual’s response on the \(j\)-th item is a random variable \(Y_{ij}\) with a Bernoulli distribution. The probability that the \(i\)-th individual correctly answers the \(j\)-th item (i.e. the probability that \(Y_{ij} = 1\)) is assumed to have the following IRF form:

\[ p\_{ij} = P(Y\_{ij}=1 \mid \theta\_i,\delta\_j)=\frac{1}{1+\text{exp}(\theta_i-\delta_j)}, \]

where \(\delta_j\) is the difficulty parameter for item \(j\) of the test, and is assumed to be normally distributed according to some mean \(\mu_{\delta}\) and standard deviation \(\sigma_{\delta}\) which must be specified by the analyst. Each latent ability parameter \(\theta_i\) is also assumed to be distributed according to a standard normal distribution.

Load the data

I read in the data from the file wideformat.csv, which contains (simulated) data from \(n=1000\) individuals taking a \(5\)-item test. Items are coded \(1\) for correct and \(0\) for incorrect responses. When we get descriptives of the data, we see that the items differ in terms of the proportion of people who answered correctly, so we expect that we have some differences in item difficulty here.

data_dicho<-read.csv("wideformat.csv", sep = ",")
head(data_dicho)
NA        ID gender age Item.1 Item.2 Item.3 Item.4 Item.5
NA 1 person1   Male  40      0      0      0      0      0
NA 2 person2 Female  27      0      0      0      0      0
NA 3 person3   Male  13      0      0      0      0      0
NA 4 person4 Female  17      0      0      0      0      1
NA 5 person5 Female  30      0      0      0      0      1
NA 6 person6 Female  46      0      0      0      0      1
#check proportion of correct responses by item
apply(data_dicho[,4:8], 2, sum)/nrow(data_dicho)
NA Item.1 Item.2 Item.3 Item.4 Item.5 
NA  0.924  0.709  0.553  0.763  0.870
#summarise the data
library(psych)
describe(data_dicho)
NA         vars    n   mean     sd median trimmed    mad min  max range  skew
NA ID*        1 1000 500.50 288.82  500.5  500.50 370.65   1 1000   999  0.00
NA gender*    2 1000   1.50   0.50    2.0    1.51   0.00   1    2     1 -0.02
NA age        3 1000  25.37  14.43   25.0   25.36  17.79   1   50    49  0.01
NA Item.1     4 1000   0.92   0.27    1.0    1.00   0.00   0    1     1 -3.20
NA Item.2     5 1000   0.71   0.45    1.0    0.76   0.00   0    1     1 -0.92
NA Item.3     6 1000   0.55   0.50    1.0    0.57   0.00   0    1     1 -0.21
NA Item.4     7 1000   0.76   0.43    1.0    0.83   0.00   0    1     1 -1.24
NA Item.5     8 1000   0.87   0.34    1.0    0.96   0.00   0    1     1 -2.20
NA         kurtosis   se
NA ID*        -1.20 9.13
NA gender*    -2.00 0.02
NA age        -1.21 0.46
NA Item.1      8.22 0.01
NA Item.2     -1.16 0.01
NA Item.3     -1.96 0.02
NA Item.4     -0.48 0.01
NA Item.5      2.83 0.01

Fit the model

We fit the 1PLM to the data. First, I rename and preprocess the data to be passed to JAGS.

Y<-data_dicho[,4:8]
n<-nrow(Y)
p<-ncol(Y)
data_list<-list("Y","n","p")

Then I specify the model using the following JAGS code.

model1<-"
model {
 for (i in 1:n){
  for (j in 1:p){
   Y[i , j] ~ dbern ( prob [i , j])
   logit ( prob [i , j]) <- theta [i] - delta [j]
  }
  theta [i] ~ dnorm (0.0 , 1.0)
 }

 for (j in 1:p){
  delta [j] ~ dnorm (mu.delta , pr.delta )
 }
  pr.delta <- pow(s.delta , -2)
  mu.delta ~ dnorm(0, 5)
  s.delta ~ dunif(0, 10)
  
  for(i in 1:n){
   for(j in 1:p){
    Y.rep[i, j] ~ dbern(prob[i, j])
    loglik_y[i, j]<-logdensity.bern(Y[i,j], prob[i, j])
   }
  }
}
"
## write the model to a text file
writeLines(model1, con = "model1PLM.txt")

Next, I define the nodes (parameters and derivatives) to monitor and the chain parameters.

params <- c("delta", "theta", "prob","loglik_y","Y.rep")
nChains = 2
burnInSteps = 500
thinSteps = 1
numSavedSteps = 2500  #across all chains
nIter = ceiling(burnInSteps + (numSavedSteps * thinSteps)/nChains)
nIter
NA [1] 1750

Start the JAGS model (check the model, load data into the model, specify the number of chains and compile the model). Run the JAGS code via the R2jags interface and the jags function.

library(R2jags)
library(coda)
set.seed(3456)
m1.r2jags <- jags(data = data_list, inits = NULL, parameters.to.save = params,
    model.file = "model1PLM.txt", n.chains = nChains, n.iter = nIter,
    n.burnin = burnInSteps, n.thin = thinSteps)
NA Compiling model graph
NA    Resolving undeclared variables
NA    Allocating nodes
NA Graph information:
NA    Observed stochastic nodes: 5000
NA    Unobserved stochastic nodes: 6007
NA    Total graph size: 26016
NA 
NA Initializing model

Plot the item characteristic curves

Item characteristic curves (ICC) are the logistic curves which result from the fitted models (e.g. estimated item difficulty, plugged into the item response function). Latent trait/ability is plotted on the \(x\)-axis (higher values represent hight ability). Probability of a “correct” answer (\(Y_{ij}=1\)) to an item is plotted on the \(y\)-axis.

#see average value of item difficulty
diff<-m1.r2jags$BUGSoutput$sims.list$delta
apply(diff,2,mean)
NA [1] -2.8583972 -1.0620835 -0.2608953 -1.3853516 -2.2159276
#plot icc for each individual with respect to each of the 5 items
theta<-apply(m1.r2jags$BUGSoutput$sims.list$theta, 2, mean)
prob<-apply(m1.r2jags$BUGSoutput$sims.list$prob,c(2,3),mean)
plot(theta,prob[,1], type = "n", ylab = "probability of correct response", xlab="ability",
     xlim = c(-2.5,1), ylim = c(0,1))
lines(theta,prob[,1],col="red")
lines(theta,prob[,2],col="blue")
lines(theta,prob[,3],col="orange")
lines(theta,prob[,4],col="green")
lines(theta,prob[,5],col="black")
legend("bottomright",legend = c("1","2","3","4","5"), lty = c(1), col=c("red","blue","orange","green","black"), bty = "n", cex = 0.5)

We see that item \(3\) is the most difficult item (it’s curve is farthest to the right), and item \(1\) is the easiest (it’s curve is farthest to the left). The same conclusions can be drawn by checking the difficulty estimates above.

Plot the item information curves

Item information curves (IIC) show how much “information” about the latent trait ability an item gives. Mathematically, these are the \(1\)st derivatives of the ICCs or, equivalently, to the product of the probability of correct and incorrect response. Item information curves peak at the difficulty value (point where the item has the highest discrimination), with less information at ability levels farther from the difficulty estimate. Practically speaking, we can see how a very difficult item will provide very little information about persons with low ability (because the item is already too hard), and very easy items will provide little information about persons with high ability levels.

#plot iic for each individual with respect to each of the 5 items
neg_prob<-1-prob
information<-prob*neg_prob
plot(theta,information[,1], type = "n", ylab = "information", xlab="ability",
     xlim = c(-2.5,1), ylim = c(0,0.3))
lines(theta,information[,1],col="red")
lines(theta,information[,2],col="blue")
lines(theta,information[,3],col="orange")
lines(theta,information[,4],col="green")
lines(theta,information[,5],col="black")
legend("bottomleft",legend = c("1","2","3","4","5"), lty = c(1), col=c("red","blue","orange","green","black"), bty = "n", cex = 0.5)

Similar to the ICCs, we see that item \(3\) provides the most information about high ability levels (the peak of its IIC is farthest to the right) and item \(1\) and \(5\) provides the most information about lower ability levels (the peak of its IIC is farthest to the left). We have seen that all ICCs and IICs for the items have the same shape in the 1PL model (i.e. all items are equally good at providing information about the latent trait). In the 2PL and 3PL models, we will see that this does not have to be the case.

Next, we plot the information curve for the whole test. This is simply the sum of the individual IICs above. Ideally, we want a test which provides fairly good covereage of a wide range of latent ability levels. Otherwise, the test is only good at identifying a limited range of ability levels.

#plot iic for each individual with respect to whole test
information_test<-apply(information,1,sum)
plot(theta,information_test, type = "n", ylab = "information (test)", xlab="ability",
     xlim = c(-2.5,1), ylim = c(0,1.5))
lines(theta,information_test,col="black",lty=2)

summary(information_test)
NA    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
NA  0.5472  0.5710  0.7669  0.7756  0.9305  1.0786

We see that this test provides the most information about low ability levels (the peak is around ability level \(-1.5\)), and less information about very high ability levels.

Assess fit

We perform posterior predictive checks to test whether individual items fit the 1PLM by comparing quantities computed from the predictions of the model with those from the observed data. If these match reasonably well, then there is indication that the model has a good fit.

library(bayesplot)
library(ggplot2)
Y.rep<-m1.r2jags$BUGSoutput$sims.list$Y.rep

#Bar plot of y with yrep medians and uncertainty intervals superimposed on the bars
ppc_bars(Y[,1],Y.rep[1:8,,1]) + ggtitle("Item 1")

ppc_bars(Y[,2],Y.rep[1:8,,2]) + ggtitle("Item 2")

ppc_bars(Y[,3],Y.rep[1:8,,3]) + ggtitle("Item 3")

ppc_bars(Y[,4],Y.rep[1:8,,4]) + ggtitle("Item 4")

ppc_bars(Y[,5],Y.rep[1:8,,5]) + ggtitle("Item 5")

Plot ability scores

We can conclude by summarising and plotting the latent ability scores of the participants

#summary stats for theta across both iterations and individuals
summary(theta)
NA      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
NA -2.074587 -0.483828  0.078138  0.001189  0.692185  0.763989
#histogram and kernel density plot of theta averaged across iterations
dens.theta<-density(theta, bw=0.3)
hist(theta, breaks = 5, prob = T)
lines(dens.theta, lwd=2, col="red")

We see that the mean of ability scores is around \(0\), and the standard deviation about \(1\) (these are estimated ability scores are standardised).

2 parameter logistic model (2PLM)

In the 2PLM, the probability that the \(i\)-th individual correctly answers the \(j\)-th item (i.e. the probability that \(Y_{ij} = 1\)) is assumed to have the following IRF form:

\[ p\_{ij} = P(Y\_{ij}=1 \mid \theta\_i,\delta\_j,\alpha\_j)=\frac{1}{1+\text{exp}(\alpha_j(\theta_i-\delta_j))}, \]

where \(\alpha_j\) is the discrimination parameter for item \(j\) of the test, and is assumed to be positive and normally distributed (truncated above zero) according to some mean \(\mu_{\alpha}\) and standard deviation \(\sigma_{\alpha}\) which must be specified by the analyst. The item discriminability \(\alpha_j\) indicates how well an item is able to discriminate between persons with different ability levels. Item discriminability is reflected in the steepness of the slope of the ICC.

Fit the model

We fit the 2PLM to the data using the following JAGS code.

model2<-"
model {
 for (i in 1:n){
  for (j in 1:p){
   Y[i , j] ~ dbern ( prob [i , j])
   logit ( prob [i , j]) <- alpha [j] * ( theta [i] - delta [j])
  }
  theta [i] ~ dnorm (0.0 , 1.0)
 }

 for (j in 1:p){
  delta [j] ~ dnorm (m.delta , pr.delta )
  alpha [j] ~ dnorm (m.alpha , pr.alpha ) T(0 , )
 }
  pr.delta <- pow(s.delta , -2)
  pr.alpha <- pow(s.alpha , -2)
  
  m.delta ~ dnorm(0,5)
  m.alpha ~ dnorm(0,5)
  s.delta ~ dunif(0,10)
  s.alpha ~ dunif(0,10)
  
  for(i in 1:n){
   for(j in 1:p){
    Y.rep[i, j] ~ dbern(prob[i, j])
    loglik_y[i, j]<-logdensity.bern(Y[i,j], prob[i, j])
   }
  }
}
"
## write the model to a text file
writeLines(model2, con = "model2PLM.txt")

Next, I define the nodes (parameters and derivatives) to monitor and the chain parameters.

params <- c("delta", "alpha","theta", "prob","loglik_y","Y.rep")
nChains = 2
burnInSteps = 500
thinSteps = 1
numSavedSteps = 2500  #across all chains
nIter = ceiling(burnInSteps + (numSavedSteps * thinSteps)/nChains)
nIter
NA [1] 1750

Start the JAGS model (check the model, load data into the model, specify the number of chains and compile the model). Run the JAGS code via the R2jags interface and the jags function.

m2.r2jags <- jags(data = data_list, inits = NULL, parameters.to.save = params,
    model.file = "model2PLM.txt", n.chains = nChains, n.iter = nIter,
    n.burnin = burnInSteps, n.thin = thinSteps)
NA Compiling model graph
NA    Resolving undeclared variables
NA    Allocating nodes
NA Graph information:
NA    Observed stochastic nodes: 5000
NA    Unobserved stochastic nodes: 6014
NA    Total graph size: 31024
NA 
NA Initializing model

Plot the item characteristic curves

Item characteristic curves (ICC) are the logistic curves which result from the fitted models (e.g. estimated item difficulty, plugged into the item response function). Latent trait/ability is plotted on the \(x\)-axis (higher values represent hight ability). Probability of a “correct” answer (\(Y_{ij}=1\)) to an item is plotted on the \(y\)-axis.

discr<-m2.r2jags$BUGSoutput$sims.list$alpha
diff<-m2.r2jags$BUGSoutput$sims.list$delta
#see average value of item difficulty
apply(diff,2,mean)
NA [1] -3.4898494 -1.4275914 -0.3210807 -1.8517091 -3.0020341
#see average value of item discriminability
apply(discr,2,mean)
NA [1] 0.8300819 0.7204114 0.7766516 0.7243196 0.7142145
#plot icc for each individual with respect to each of the 5 items
theta<-apply(m2.r2jags$BUGSoutput$sims.list$theta, 2, mean)
prob<-apply(m2.r2jags$BUGSoutput$sims.list$prob,c(2,3),mean)
plot(theta,prob[,1], type = "n", ylab = "probability of correct response", xlab="ability",
     xlim = c(-2.5,1), ylim = c(0,1))
lines(theta,prob[,1],col="red")
lines(theta,prob[,2],col="blue")
lines(theta,prob[,3],col="orange")
lines(theta,prob[,4],col="green")
lines(theta,prob[,5],col="black")
legend("bottomright",legend = c("1","2","3","4","5"), lty = c(1), col=c("red","blue","orange","green","black"), bty = "n", cex = 0.5)

Unlike the ICCs for the 1PLM, the ICCs for the 2PLM do not all have the same shape. Item curves which are more “spread out” indicate lower discriminability (i.e. that individuals of a range of ability levels have some probability of getting the item correct). Compare this to an item with high discriminability (steep slope): for this item, we have a better estimate of the individual’s latent ability based on whether they got the question right or wrong. Because of the differing slopes, the rank-order of item difficulty changes across different latent ability levels. We can see that item \(3\) is still the most difficult item (i.e. lowest probability of getting correct for most latent trait values, up until about \(\theta=0.2\)). Items \(1\) and \(5\) are the easiest.

Plot the item information curves

Item information curves (IIC) show how much “information” about the latent trait ability an item gives. Mathematically, these are the \(1\)st derivatives of the ICCs or, equivalently, to the product of the probability of correct and incorrect response. Item information curves peak at the difficulty value (point where the item has the highest discrimination), with less information at ability levels farther from the difficulty estimate. Practially speaking, we can see how a very difficult item will provide very little information about persons with low ability (because the item is already too hard), and very easy items will provide little information about persons with high ability levels.

#plot iic for each individual with respect to each of the 5 items
neg_prob<-1-prob
information<-prob*neg_prob
information2<-information*(apply(discr,2,mean))^2
plot(theta,information2[,1], type = "n", ylab = "information", xlab="ability",
     xlim = c(-2.5,1), ylim = c(0,0.3))
lines(theta,information2[,1],col="red")
lines(theta,information2[,2],col="blue")
lines(theta,information2[,3],col="orange")
lines(theta,information2[,4],col="green")
lines(theta,information2[,5],col="black")
legend("bottomleft",legend = c("1","2","3","4","5"), lty = c(1), col=c("red","blue","orange","green","black"), bty = "n", cex = 0.5)

The item IICs demonstrate that some items provide more information about latent ability for different ability levels. The higher the item discriminability estimate, the more information an item provides about ability levels around the point where there is a \(50\%\) chance of getting the item right (i.e. the steepest point in the ICC slope). For example, item \(3\) (orange) clearly provides the most information at high ability levels, around \(\theta=-0.5\), but almost no information about low ability levels (\(< -1\)) because the item is already too hard for those participants. In contrast, item \(1\) (red), which has low discriminability, doesn’t give very much information overall, but covers a wide range of ability levels.

Next, we plot the item information curve for the whole test. This is the sum of all the item IICs above.

#plot iic for each individual with respect to whole test
information_test<-apply(information2,1,sum)
plot(theta,information_test, type = "n", ylab = "information (test)", xlab="ability",
     xlim = c(-2.5,1), ylim = c(0,1.5))
lines(theta,information_test,col="black",lty=2)

summary(information_test)
NA    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
NA  0.3242  0.3973  0.4500  0.4541  0.5182  0.7528

The IIC for the whole test shows that the test provides the most information for slightly-lower-than average ability levels (about \(\theta=-1\)), but does not provide much information about extremely high ability levels.

Assess fit

Next, we check how well the 2PLM fits the data.

Y.rep<-m2.r2jags$BUGSoutput$sims.list$Y.rep

#Bar plot of y with yrep medians and uncertainty intervals superimposed on the bars
ppc_bars(Y[,1],Y.rep[1:8,,1]) + ggtitle("Item 1")

ppc_bars(Y[,2],Y.rep[1:8,,2]) + ggtitle("Item 2")

ppc_bars(Y[,3],Y.rep[1:8,,3]) + ggtitle("Item 3")

ppc_bars(Y[,4],Y.rep[1:8,,4]) + ggtitle("Item 4")

ppc_bars(Y[,5],Y.rep[1:8,,5]) + ggtitle("Item 5")

We can also compare the fit of the 1PLM and 2PLM using relative measures of fit or information criteria. These are computed based on the deviance and a penalty for model complexity called the effective number of parameters \(p\). Here we consider two Bayesian measures known as the Widely Applicable (WAIC) and Leave One Out (LOOIC) Information Criterion, which can be easily obtained through the functions waic and loo in the package loo.

library(loo)
#extract log-likelihood
loglik_m1<-m1.r2jags$BUGSoutput$sims.list$loglik_y
loglik_m2<-m2.r2jags$BUGSoutput$sims.list$loglik_y

#waic
waic_m1<-waic(loglik_m1)
waic_m2<-waic(loglik_m2)

#looic
looic_m1<-loo(loglik_m1)
looic_m2<-loo(loglik_m2)

#compare
table_waic<-rbind(waic_m1$estimates[2:3,1],waic_m2$estimates[2:3,1])
table_looic<-rbind(looic_m1$estimates[2:3,1],looic_m2$estimates[2:3,1])
rownames(table_waic)<-rownames(table_looic)<-c("1PLM","2PLM")
knitr::kable(cbind(table_waic,table_looic), "pandoc", align = "c")
p_waic waic p_loo looic
1PLM 1.525000 6.517631 1.867805 7.203242
2PLM 1.395955 6.446221 1.649366 6.953044

Both criteria suggest that the 2PLM has a slightly better fit to the data.

Plot ability scores

Plot the density curve of the estimated ability scores

#summary stats for theta across both iterations and individuals
summary(theta)
NA      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
NA -1.917888 -0.419394  0.078240  0.001822  0.608518  0.677354
#histogram and kernel density plot of theta averaged across iterations
dens.theta<-density(theta, bw=0.3)
hist(theta, breaks = 5, prob = T)
lines(dens.theta, lwd=2, col="red")

We see that the mean of ability scores is around \(0\), and the standard deviation about \(1\) (these are estimated ability scores are standardised).

3 parameter logistic model (3PLM)

In the 3PLM, the probability that the \(i\)-th individual correctly answers the \(j\)-th item (i.e. the probability that \(Y_{ij} = 1\)) is assumed to have the following IRF form:

\[ p_{ij} = P(Y_{ij}=1 \mid \theta_i,\delta_j,\alpha_j,\eta_j)=\eta_j + (1-\eta_j) \frac{1}{1+\text{exp}(\alpha_j(\theta_i-\delta_j))}, \]

where \(\eta_j\) is the guessing parameter. Under this model, individuals with zero ability have a nonzero chance of endorsing any item, just by guessing randomly. The guessing parameter is reflected in the \(y\)-intercept (i.e. probability) of the ICC. The parameter is normally distributed according to some mean \(\mu_{\eta}\) and standard deviation \(\sigma_{\eta}\) which must be specified by the analyst.

Fit the model

We fit the 3PLM to the data using the following JAGS code.

model3<-"
model {
 for (i in 1:n){
  for (j in 1:p){
   Y[i , j] ~ dbern ( prob [i , j])
   logit ( prob.star [i , j]) <- alpha [j]*( theta [i] - delta [j])
   prob [i , j] <- eta[j ] + (1 - eta[j]) * prob.star [i , j]
  }
  theta [i] ~ dnorm (0.0 , 1.0)
 }

 for (j in 1:p){
  delta [j] ~ dnorm (m.delta , pr.delta )
  alpha [j] ~ dnorm (m.alpha , pr.alpha ) T(0 , )
  eta[j] ~ dbeta (a.eta , b.eta)
 }
  pr.delta <- pow(s.delta , -2)
  pr.alpha <- pow(s.alpha , -2)
  
  m.delta ~ dnorm(0,5)
  m.alpha ~ dnorm(0,5)
  s.delta ~ dunif(0,10)
  s.alpha ~ dunif(0,10)
  a.eta ~ dunif(0,100)
  b.eta ~ dunif(0,100)
    
  for(i in 1:n){
   for(j in 1:p){
    Y.rep[i, j] ~ dbern(prob[i, j])
    loglik_y[i, j]<-logdensity.bern(Y[i,j], prob[i, j])
   }
  }
}
"
## write the model to a text file
writeLines(model3, con = "model3PLM.txt")

Next, I define the nodes (parameters and derivatives) to monitor and the chain parameters.

params <- c("delta", "alpha", "eta","theta", "prob","loglik_y","Y.rep")
nChains = 2
burnInSteps = 500
thinSteps = 1
numSavedSteps = 2500  #across all chains
nIter = ceiling(burnInSteps + (numSavedSteps * thinSteps)/nChains)
nIter
NA [1] 1750

Start the JAGS model (check the model, load data into the model, specify the number of chains and compile the model). Run the JAGS code via the R2jags interface and the jags function.

m3.r2jags <- jags(data = data_list, inits = NULL, parameters.to.save = params,
    model.file = "model3PLM.txt", n.chains = nChains, n.iter = nIter,
    n.burnin = burnInSteps, n.thin = thinSteps)
NA Compiling model graph
NA    Resolving undeclared variables
NA    Allocating nodes
NA Graph information:
NA    Observed stochastic nodes: 5000
NA    Unobserved stochastic nodes: 6021
NA    Total graph size: 41037
NA 
NA Initializing model

Plot the item characteristic curves

Item characteristic curves (ICC) are the logistic curves which result from the fitted models (e.g. estimated item difficulty, plugged into the item response function). Latent trait/ability is plotted on the \(x\)-axis (higher values represent hight ability). Probability of a “correct” answer (\(Y_{ij}=1\)) to an item is plotted on the \(y\)-axis.

discr<-m3.r2jags$BUGSoutput$sims.list$alpha
diff<-m3.r2jags$BUGSoutput$sims.list$delta
gues<-m3.r2jags$BUGSoutput$sims.list$eta
#see average value of item difficulty
apply(diff,2,mean)
NA [1] -2.8361776 -0.6672983  0.3323608 -1.0648926 -2.1758250
#see average value of item discriminability
apply(discr,2,mean)
NA [1] 0.9249227 1.1278880 1.4997368 0.8733855 0.8561385
#see average value of item guessing
apply(gues,2,mean)
NA [1] 0.2446947 0.2360422 0.2150625 0.2438348 0.2395731
#plot icc for each individual with respect to each of the 5 items
theta<-apply(m3.r2jags$BUGSoutput$sims.list$theta, 2, mean)
prob<-apply(m3.r2jags$BUGSoutput$sims.list$prob,c(2,3),mean)
plot(theta,prob[,1], type = "n", ylab = "probability of correct response", xlab="ability",
     xlim = c(-2.5,1), ylim = c(0,1))
lines(theta,prob[,1],col="red")
lines(theta,prob[,2],col="blue")
lines(theta,prob[,3],col="orange")
lines(theta,prob[,4],col="green")
lines(theta,prob[,5],col="black")
legend("bottomright",legend = c("1","2","3","4","5"), lty = c(1), col=c("red","blue","orange","green","black"), bty = "n", cex = 0.5)

The slopes of the ICCs look very similar to those of the 2PLM. We can see that all items have \(y\)-intercepts greater than zero, so that even at very low ability levels, there is some chance of getting these items correct (via guessing).

Plot the item information curves

Item information curves (IIC) show how much “information” about the latent trait ability an item gives. Mathematically, these are the \(1\)st derivatives of the ICCs or, equivalently, to the product of the probability of correct and incorrect response. Item information curves peak at the difficulty value (point where the item has the highest discrimination), with less information at ability levels farther from the difficulty estimate. Practially speaking, we can see how a very difficult item will provide very little information about persons with low ability (because the item is already too hard), and very easy items will provide little information about persons with high ability levels.

Here I plot the IICs using points, rather than lines, to better display the patterns of the individuals, which vary substantially according to whether the item was correctly chosen due to chance or not.

#plot iic for each individual with respect to each of the 5 items
neg_prob<-1-prob
information.p1<-neg_prob/prob
information.p2<-(prob-apply(gues,2,mean))^2/(1-apply(gues,2,mean))^2
information3<-(apply(discr,2,mean))^2*(information.p2)*(information.p1)
plot(theta,information3[,1], type = "n", ylab = "information", xlab="ability",
     xlim = c(-2.5,1), ylim = c(0,0.7))
points(theta,information3[,1],col="red")
points(theta,information3[,2],col="blue")
points(theta,information3[,3],col="orange")
points(theta,information3[,4],col="green")
points(theta,information3[,5],col="black")
legend("bottomleft",legend = c("1","2","3","4","5"), lty = c(1), col=c("red","blue","orange","green","black"), bty = "n", cex = 0.5)

Next, we plot the item information curve for the whole test. This is the sum of all the item IICs above.

#plot iic for each individual with respect to whole test
information_test<-apply(information3,1,sum)
plot(theta,information_test, type = "n", ylab = "information (test)", xlab="ability",
     xlim = c(-2.5,1), ylim = c(0,1.5))
points(theta,information_test,col="black",lty=2)

summary(information_test)
NA    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
NA  0.3793  0.4489  0.5062  0.6978  0.7956  1.5096

Assess fit

Next, we check how well the 3PLM fits the data.

Y.rep<-m3.r2jags$BUGSoutput$sims.list$Y.rep

#Bar plot of y with yrep medians and uncertainty intervals superimposed on the bars
ppc_bars(Y[,1],Y.rep[1:8,,1]) + ggtitle("Item 1")

ppc_bars(Y[,2],Y.rep[1:8,,2]) + ggtitle("Item 2")

ppc_bars(Y[,3],Y.rep[1:8,,3]) + ggtitle("Item 3")

ppc_bars(Y[,4],Y.rep[1:8,,4]) + ggtitle("Item 4")

ppc_bars(Y[,5],Y.rep[1:8,,5]) + ggtitle("Item 5")

We can also compare the fit of the 1PLM, 2PLM and 3PLM using relative measures of fit or information criteria. These are computed based on the deviance and a penalty for model complexity called the effective number of parameters \(p\). Here we consider two Bayesian measures known as the Widely Applicable (WAIC) and Leave One Out (LOOIC) Information Criterion, which can be easily obtained through the functions waic and loo in the package loo.

#extract log-likelihood
loglik_m3<-m3.r2jags$BUGSoutput$sims.list$loglik_y

#waic
waic_m3<-waic(loglik_m3)

#looic
looic_m3<-loo(loglik_m3)

#compare
table_waic<-rbind(waic_m1$estimates[2:3,1],waic_m2$estimates[2:3,1],waic_m3$estimates[2:3,1])
table_looic<-rbind(looic_m1$estimates[2:3,1],looic_m2$estimates[2:3,1],looic_m3$estimates[2:3,1])
rownames(table_waic)<-rownames(table_looic)<-c("1PLM","2PLM","3PLM")
knitr::kable(cbind(table_waic,table_looic), "pandoc", align = "c")
p_waic waic p_loo looic
1PLM 1.525000 6.517631 1.867805 7.203242
2PLM 1.395955 6.446221 1.649366 6.953044
3PLM 1.368535 6.395247 1.654432 6.967042

Both criteria suggest that both 1PLM and 2PLM have a better fit to the data than 3PLM.

Plot ability scores

Plot the density curve of the estimated ability scores

#summary stats for theta across both iterations and individuals
summary(theta)
NA       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
NA -1.7435119 -0.4170588  0.0471025  0.0006053  0.6298597  0.7189831
#histogram and kernel density plot of theta averaged across iterations
dens.theta<-density(theta, bw=0.3)
hist(theta, breaks = 5, prob = T)
lines(dens.theta, lwd=2, col="red")

We see that the mean of ability scores is around \(0\), and the standard deviation about \(1\) (these are estimated ability scores are standardised).

Conclusions

The use of JAGS software to estimate IRT models allows the user to alter existing code to fit new variations of current models that cannot be fit in existing software packages. For example, longitudinal or multilevel data can easily be accommodated by small changes to existing JAGS code. The JAGS software takes care of the “grunt work” involved in estimating model parameters by constructing an MCMC algorithm to sample from the posterior distribution. Using JAGS frees the user to experiment with different models that may be more appropriate for specialised data than the models that can currently be fit in other software packages. Of course, more complicated models involve more parameters than simpler models, and the analyst must specify prior distributions for these new parameters. This is a small price to pay, however, for the flexibility that the Bayesian framework and JAGS software provide.

References

Baker, Frank B, and Seock-Ho Kim. 2004. Item Response Theory: Parameter Estimation Techniques. CRC Press.
Curtis, S McKay et al. 2010. “BUGS Code for Item Response Theory.” Journal of Statistical Software 36 (1): 1–34.
Hambleton, Ronald K, Hariharan Swaminathan, and H Jane Rogers. 1991. Fundamentals of Item Response Theory. Sage.
Plummer, Martyn. 2004. “JAGS: Just Another Gibbs Sampler.”
Su, Yu-Sung, Masanao Yajima, Maintainer Yu-Sung Su, and JAGS SystemRequirements. 2015. “Package ‘R2jags’.” R Package Version 0.03-08, URL Http://CRAN. R-Project. Org/Package= R2jags.