Better Measurement with Item Response Theory

class: center, middle, inverse, title-slide

# Better Measurement with Item Response Theory
### Ben Stenhaug, Stanford University
### December 17, 2019

---

# Organization

These slides, the code that produced them, and the option of opening that code in an Rstudio cloud environment is available at

**tinyurl.com/irt-basics**

which redirects to

**stenhaug.github.io/irt-basics**

---

# The power of Item Response Theory (IRT)

In a world with and more big and naturally-occuring data, IRT offers a few promises:

1. Understand and leverage item variability

2. More precise measures of latent constructs

3. More information with fewer data points

---

# Wordbank example

Wordbank (wordbank.stanford.edu) provides open source data from over 80k MacArthur-Bates Communicative Development Inventory (MB-CDI) administrations.

---

# Warm up: Answer with a partner

1. Who is the highest ability person? Who is the lowest ability person?
2. Which item is the hardest? Which is the easiest?
3. Which item is the best? Which is the worst?
4. Who has a higher ability between person D and person I?				
5. Estimate the probability of person G getting item 2 correct.

<small>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> person </th>
   <th style="text-align:right;"> item 1 </th>
   <th style="text-align:right;"> item 2 </th>
   <th style="text-align:right;"> item 3 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> D </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> F </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> G </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> H </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> I </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> J </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
</tbody>
</table>

</small>

---

# What is measurement?

1. You're interested in a latent construct (math ability, extroversion, anxiety etc.)

2. You measure that latent construct by giving people items (which we'll call a test)

3. You do some science with that measurement

---

# Relevant questions

1. Is this a good test? Are some items better than others?

2. Does this test measure the latent construct I care about?

3. Is this test fair?

4. How do we get from responses to the items to the measure of latent trait?

---

# How do I get from responses to the latent trait?

<small>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> child </th>
   <th style="text-align:right;"> mommy </th>
   <th style="text-align:right;"> yesterday </th>
   <th style="text-align:right;"> trash </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> B </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> C </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> D </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> F </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> G </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> H </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> I </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> J </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
</tbody>
</table>

</small>

---

# The sum score

1. What assumptions does it make?

2. What are its limitations?

---

# The sum score

## Assumptions

1. Items are equally difficult

2. Items are equally related to the latent construct

3. 1 on all items is positively related to the construct

## Limitations

1. How do I handle missing data?

2. How do I make predictions?

3. How do I make an adaptive test?

---

# Item Response Theory (IRT) to the rescue!

A parametric framework for item response data

Each person `\(p\)` has an ability `\(\theta_p\)`

Each item `\(i\)` has an easiness `\(b_i\)`

These combine to give the probability of correct response

---

# The logistic function

We use the logistic `\(\sigma(x) = \dfrac{\exp(x)}{1 + \exp(x)}\)` function to map the sum of ability and easiness to probability of correct response

---

# Looking at easiness

---

# Question: Probability of responses

1. Calculate P(correct, correct, incorrect | ability = 0)

2. Calculate P(correct, correct, incorrect | ability = 1)

---

# Answer: Probability of responses

```r
logistic <- function(x) {exp(x) / (1 + exp(x))}
```

1. Calculate P(correct, correct, incorrect | ability = 0)

```r
logistic(2 + 0) * logistic(0 + 0) * (1 - logistic(-2 + 0))
```

```
## [1] 0.3879017
```

2. Calculate P(correct, correct, incorrect | ability = 1)

```r
logistic(2 + 1) * logistic(0 + 1) * (1 - logistic(-2 + 1))
```

```
## [1] 0.5091
```

---

# Who uses IRT?

Basically any measurement that happens in education:

- The Programme for International Student Assessment (PISA)

- State tests

- GRE

- Department of Motor Vehicles

Very common in other fields as well:

- Psychology

- Health

- Economics

---

# IRT in practice

We'll show the power of IRT with the Wordbank data (wordbank.stanford.edu)

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:right;"> yum yum </th>
   <th style="text-align:right;"> bee </th>
   <th style="text-align:right;"> cockadoodledoo </th>
   <th style="text-align:right;"> buy </th>
   <th style="text-align:right;"> camping </th>
   <th style="text-align:right;"> moo </th>
   <th style="text-align:right;"> ouch </th>
   <th style="text-align:right;"> aunt </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 27 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 26 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 27 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 30 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
</tbody>
</table>

---

# Items

---

# Children

---

# Fit item parameters

## code

```r
irt_model_rasch <- 
  mirt(
    data = english_words %>% select(-sex, -age),
    model = 1,
    itemtype = "Rasch",
    verbose = FALSE
  )
```

---

## item curves

---

# Ability Estimates

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> sex </th>
   <th style="text-align:right;"> age </th>
   <th style="text-align:right;"> sum_score </th>
   <th style="text-align:right;"> theta_rasch </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 27 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> -0.8383632 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 1.3984515 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 26 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> -0.1224427 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Male </td>
   <td style="text-align:right;"> 27 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> -0.8383632 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> -0.8383632 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:right;"> 30 </td>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 2.3204348 </td>
  </tr>
</tbody>
</table>

---

# Ability estimates by sex

---

# Wait a second

---

# Moving from Rasch to 2PL

## Rasch

Each person has ability `\(\theta_p\)`. Each item has easiness `\(b_i\)`.

`\(P(y_{pi} = 1 | \theta_p, b_i) = \sigma(\theta_p + b_i)\)`

where

`\(\sigma(x) = \dfrac{\exp(x)}{1 + \exp(x)}\)`

## 2PL

Each person has ability `\(\theta_p\)`. Each item has easiness `\(b_i\)` and discrimination `\(a_i\)`.

`\(P(y_{pi} = 1 | \theta_p, b_i, a_i) = \sigma(a_i \cdot \theta_p + b_i)\)`

---

# Discrimination

The discrimination `\(a_i\)` describes the strength of the relationship between the item and ability

---

# Question: Weighting

Which of the outcomes is more likely for a person with ability `\(\theta_p = 2\)`? (The easiness of each item is 0).

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> item discrimination </th>
   <th style="text-align:left;"> outcome 1 </th>
   <th style="text-align:left;"> outcome 2 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:left;"> correct </td>
   <td style="text-align:left;"> correct </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1.0 </td>
   <td style="text-align:left;"> incorrect </td>
   <td style="text-align:left;"> correct </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2.0 </td>
   <td style="text-align:left;"> incorrect </td>
   <td style="text-align:left;"> correct </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3.0 </td>
   <td style="text-align:left;"> correct </td>
   <td style="text-align:left;"> incorrect </td>
  </tr>
</tbody>
</table>

---

# Answer: Weighting

Which of the outcomes is more likely for a person with ability `\(\theta_p = 2\)`? (The easiness of each item is 0).

Outcome 1

```r
logistic(0.5 * 2 + 0) * 
  (1 - logistic(1 * 2 + 0)) * 
  (1 - logistic(2 * 2 + 0)) * 
  logistic(3 * 2 + 0)
```

```
## [1] 0.00156352
```

Outcome 2

```r
logistic(0.5 * 2 + 0) * 
  logistic(1 * 2 + 0) * 
  logistic(2 * 2 + 0) * 
  (1 - logistic(3 * 2 + 0))
```

```
## [1] 0.00156352
```

---

# Fit 2PL model

## code

```r
irt_model_2pl <-
  mirt(
    data = english_words %>% select(-sex, -age),
    model = 1,
    itemtype = "2PL",
    verbose = FALSE
  )
```

---

## item curves

---

# 2PL item parameters

---

# 2PL abilities

---

# Why stop at 2 item parameters?

## 2PL

Each person has ability `\(\theta_p\)`. Each item has easiness `\(b_i\)` and discrimination `\(a_i\)`.

`\(P(y_{pi} = 1 | \theta_p, b_i) = \sigma(a_i \cdot \theta_p + b_i)\)`

## What might a 3rd item parameter do?

---

# 3PL

Each person has ability `\(\theta_p\)`. Each item has easiness `\(b_i\)`, discrimination `\(a_i\)`, and guessability `\(g_i\)`.

`\(P(y_{pi} = 1 | \theta_p, a_i, b_i, g_i) = g_i + (1 - g_i) \cdot \sigma(a_i \cdot \theta_p + b_i)\)`

---

# Intuition behind each of the 3 parameters

- Easiness is horizontal translation

- Discrimination is slope

- Guessability is starting point at ability negative infinity

---

# Fit 3PL model

## code

```r
irt_model_3pl <-
  mirt(
    data = english_words %>% select(-sex, -age),
    model = 1,
    itemtype = "3PL",
    verbose = FALSE
  )
```

---

## item curves

---

# 3PL item parameters

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> item </th>
   <th style="text-align:right;"> a1 </th>
   <th style="text-align:right;"> b </th>
   <th style="text-align:right;"> g </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> yum yum </td>
   <td style="text-align:right;"> 1.33 </td>
   <td style="text-align:right;"> 1.21 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bee </td>
   <td style="text-align:right;"> 3.34 </td>
   <td style="text-align:right;"> 0.85 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> cockadoodledoo </td>
   <td style="text-align:right;"> 2.18 </td>
   <td style="text-align:right;"> -0.56 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> buy </td>
   <td style="text-align:right;"> 3.04 </td>
   <td style="text-align:right;"> -1.97 </td>
   <td style="text-align:right;"> 0.01 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> camping </td>
   <td style="text-align:right;"> 2.35 </td>
   <td style="text-align:right;"> -3.28 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> moo </td>
   <td style="text-align:right;"> 3.05 </td>
   <td style="text-align:right;"> 2.19 </td>
   <td style="text-align:right;"> 0.24 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ouch </td>
   <td style="text-align:right;"> 1.90 </td>
   <td style="text-align:right;"> 1.75 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> aunt </td>
   <td style="text-align:right;"> 2.81 </td>
   <td style="text-align:right;"> -1.11 </td>
   <td style="text-align:right;"> 0.04 </td>
  </tr>
</tbody>
</table>

---

# 3PL abilities - compare to 2PL

---

# 3PL abilities - compare to sum score

---

# Comparing sexes

---

# Comparing ages

---

# Differential item functioning (DIF)

---

# Polytymous item response theory

---

# Multidimensional models

---

# A few examples of IRT

- The Programme for International Student Assessment (PISA)

- State tests

- GRE

---

# Summary

Item response theory (IRT) provides a parametric framework for people responding to items (which can be broadly defined!).

It has a few specific advantages:

- Putting students and item on the same scale

- Understanding items through item parameters

- Better measurement of the latent construct

- Better understanding of the relationship between the latent construct  and the items

- Handling of missing data

- Ability to make predictions

- More complicated things like equating, testing for bias, comparisons with other models etc.

---

# Learning more

- Most popular way to estimate is the mirt R package written by Phil Chalmers

- Phil Chalmers has some good workshop materials on [his GitHub](https://github.com/philchalmers/mirt/wiki)

- Mike Frank reccommends the Embretson & Reise book [Item Response Theory for Psychologists](https://www.amazon.com/Response-Theory-Psychologists-Multivariate-Applications/dp/0805828192)

- Great resources on Bayesian Item Response Theory with at education-stan.github.io

- [Exercise](https://github.com/stenhaug/irt-basics/blob/master/exercise.Rmd) associated with this presentation

- Denny Borsboom article [The attack of the psychometricians](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2779444/) is fantastic (and Mike Frank wrote it [a love letter](http://babieslearninglanguage.blogspot.com/2019/11/letter-of-recommendation-attack-of.html))

---

# Moving forward

- Where might IRT be useful in your work?

- What would be helpful in getting started?

---

# Getting in touch

- Ben Stenhaug

- benstenhaug.org

- stenhaug@stanford.edu

- These slides, the code that produced them, and the option of opening that code in an Rstudio cloud environment is available at **tinyurl.com/irt-basics** which redirects to **stenhaug.github.io/irt-basics**