Knowing whether software produces the right answer is a critical question in psychometrics. PACER addresses this through transparent benchmarking against mirt, a peer-reviewed R package that serves as a standard reference in the field. We document that comparison honestly — showing where results align closely, and explaining what differences mean.
Two IRT packages implementing the same model will rarely produce bit-for-bit identical results. This is not a defect — it reflects legitimate differences in implementation: convergence tolerance (when the EM algorithm stops), quadrature specification (number of points, adaptive vs. fixed Gauss-Hermite), and optimizer internals (line-search strategy, gradient approximation).
On scaling: PACER defaults to D = 1.702 (normal-ogive metric), while mirt defaults to D = 1.0 (logistic). For all benchmarks on this page, PACER was configured to use D = 1.0 to match mirt and ensure a direct, fair comparison. This is standard practice in cross-software validation.
The standard we hold ourselves to: parameter estimates should agree to within rounding error for well-identified models with adequate sample sizes. The benchmarks below document whether PACER meets that standard.
All benchmarks on this page were run using default quadrature and convergence settings in both PACER and mirt — specifically, Q = 21 fixed Gauss-Hermite quadrature points and standard EM convergence tolerance (max-change < 0.001). These defaults strike a practical balance between speed and precision for typical use cases.
It is worth noting that increasing the number of quadrature points (e.g., Q = 61) and tightening the convergence criterion would yield even closer agreement between the two packages, potentially reducing differences to the fifth or sixth decimal place. We chose default settings for this demonstration because they reflect real-world usage conditions and still produce results that agree to a degree that is entirely satisfactory for psychometric practice.
| Item | — B (DIFFICULTY) — | ||
|---|---|---|---|
| PACER | mirt | |Δ| | |
| Item 1 | −2.7380 | −2.7310 | 0.0070 |
| Item 2 | −0.9986 | −0.9990 | 0.0004 |
| Item 3 | −0.2399 | −0.2400 | 0.0001 |
| Item 4 | −1.3064 | −1.3070 | 0.0006 |
| Item 5 | −2.0994 | −2.1000 | 0.0006 |
# ── 1PL / Rasch Benchmark ────────────────────────────────────────────── # Dataset: 5-item binary response matrix # PACER default is D = 1.702; configured to D = 1.0 here to match mirt # Quadrature: GH Q=21 (default); convergence: max-change < 0.001 (default) library(mirt) mod_1pl <- mirt(dat, 1, itemtype = 'Rasch') coef(mod_1pl, IRTpars = TRUE, simplify = TRUE) # $items # a b g u # Item.1 1 -2.731 0 1 # Item.2 1 -0.999 0 1 # Item.3 1 -0.240 0 1 # Item.4 1 -1.307 0 1 # Item.5 1 -2.100 0 1 # Note: latent variance freely estimated; $cov F1 = 0.572
| Item | — A (DISCRIMINATION) — | — B (DIFFICULTY) — | ||||
|---|---|---|---|---|---|---|
| PACER | mirt | |Δ| | PACER | mirt | |Δ| | |
| Item 1 | 0.8330 | 0.8250 | 0.0080 | −3.3343 | −3.3610 | 0.0267 |
| Item 2 | 0.7217 | 0.7230 | 0.0013 | −1.3718 | −1.3700 | 0.0018 |
| Item 3 | 0.8911 | 0.8900 | 0.0011 | −0.2798 | −0.2800 | 0.0002 |
| Item 4 | 0.6877 | 0.6890 | 0.0013 | −1.8679 | −1.8660 | 0.0019 |
| Item 5 | 0.6530 | 0.6580 | 0.0050 | −3.1420 | −3.1230 | 0.0190 |
# ── 2PL Benchmark ────────────────────────────────────────────────────── # Same 5-item binary dataset as 1PL benchmark # PACER default D = 1.702; configured to D = 1.0 to match mirt # Quadrature: GH Q=21 (default); convergence: max-change < 0.001 (default) library(mirt) mod_2pl <- mirt(dat, 1, itemtype = '2PL') coef(mod_2pl, IRTpars = TRUE, simplify = TRUE) # $items # a b g u # Item.1 0.825 -3.361 0 1 # Item.2 0.723 -1.370 0 1 # Item.3 0.890 -0.280 0 1 # Item.4 0.689 -1.866 0 1 # Item.5 0.658 -3.123 0 1
| Item | — A (DISCRIMINATION) — | — B (DIFFICULTY) — | ||||
|---|---|---|---|---|---|---|
| PACER | mirt | |Δ| | PACER | mirt | |Δ| | |
| V3 | 1.7965 | 1.7970 | 0.0005 | −0.9294 | −0.9300 | 0.0006 |
| V8 | 0.9132 | 0.9130 | 0.0002 | −0.7199 | −0.7210 | 0.0011 |
| V10 | 1.1654 | 1.1650 | 0.0004 | −0.5362 | −0.5370 | 0.0008 |
| V11 | 1.2487 | 1.2490 | 0.0003 | −0.4979 | −0.4980 | 0.0001 |
| V13 | 1.1769 | 1.1770 | 0.0001 | −0.4151 | −0.4150 | 0.0001 |
| V16 | 0.9643 | 0.9640 | 0.0003 | 0.0562 | 0.0560 | 0.0002 |
| V17 | 1.3456 | 1.3460 | 0.0004 | 0.1584 | 0.1580 | 0.0004 |
| V19 | 1.2047 | 1.2050 | 0.0003 | 0.1776 | 0.1770 | 0.0006 |
| Item | — A — | — B1 — | — B2 — | — B3 — | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PACER | mirt | |Δ| | PACER | mirt | |Δ| | PACER | mirt | |Δ| | PACER | mirt | |Δ| | |
| V1 | 1.2391 | 1.2390 | 0.0001 | −1.5207 | −1.5210 | 0.0003 | 0.1962 | 0.1960 | 0.0002 | — | — | — |
| V2 | 1.3285 | 1.3290 | 0.0005 | −1.2533 | −1.2540 | 0.0007 | −0.3240 | −0.3240 | 0.0000 | 2.0259 | 2.0250 | 0.0009 |
# ── Mixed 2PL + GPCM Benchmark ───────────────────────────────────────── # Dataset: combination.csv — 2,000 respondents, 10 items # V1, V2 → polytomous (GPCM) # V3,V8,V10,V11,V13,V16,V17,V19 → binary (2PL) # PACER default D = 1.702; configured to D = 1.0 to match mirt # Quadrature: GH Q=21 (default); convergence: max-change < 0.001 (default) library(mirt) itemtype <- c(rep("gpcm", 2), rep("2PL", 8)) mod <- mirt(dat, 1, itemtype = itemtype) coef(mod, IRTpars = TRUE, simplify = TRUE) # $items # a b1 b2 b3 b g u # V1 1.239 -1.521 0.196 NA NA NA NA # V2 1.329 -1.254 -0.324 2.025 NA NA NA # V3 1.797 NA NA NA -0.930 0 1 # ... # V19 1.205 NA NA NA 0.177 0 1
| Item | — A (DISCRIMINATION) — | — B (DIFFICULTY) — | ||||
|---|---|---|---|---|---|---|
| PACER | mirt | |Δ| | PACER | mirt | |Δ| | |
| b1 | 1.3632 | 1.3660 | 0.0028 | −1.5169 | −1.5130 | 0.0039 |
| b2 | 1.5480 | 1.5520 | 0.0040 | −1.0416 | −1.0380 | 0.0036 |
| b3 | 2.0449 | 2.0520 | 0.0071 | −0.4947 | −0.4920 | 0.0027 |
| b4 | 1.4252 | 1.4280 | 0.0028 | 0.0855 | 0.0870 | 0.0015 |
| b5 | 2.2365 | 2.2470 | 0.0105 | 0.5387 | 0.5300 | 0.0087 |
| b6 | 1.7373 | 1.7440 | 0.0067 | 1.1069 | 1.1050 | 0.0019 |
| b7 | 1.0645 | 1.0680 | 0.0035 | 1.7211 | 1.7180 | 0.0031 |
| b8 | 2.4024 | 2.4010 | 0.0014 | −0.7840 | −0.7820 | 0.0020 |
| b9 | 1.6520 | 1.6560 | 0.0040 | 0.3798 | 0.3800 | 0.0002 |
| b10 | 1.4435 | 1.4480 | 0.0045 | 1.2655 | 1.2630 | 0.0025 |
# ── Multigroup 2PL Benchmark ─────────────────────────────────────────── # Dataset: multigroup_responses.csv — 3,000 respondents, 2 groups (G1, G2) # b1–b10 → binary items (2PL) # Constraints: slopes and intercepts held equal across groups # G1 = reference (μ = 0, σ² = 1 fixed) # G2 = focal (μ and σ² freely estimated) # PACER default D = 1.702; configured to D = 1.0 to match mirt # Quadrature: GH Q=21 (default); convergence: max-change < 0.001 (default) library(mirt) mod <- multipleGroup( data = dat[, 3:12], model = 1, group = dat$group, itemtype = rep('2PL', 10), invariance = c('slopes', 'intercepts', 'free_means', 'free_vars') ) coef(mod, IRTpars = TRUE, simplify = TRUE) # $G1 $means: F1 = 0 (reference — fixed) # $G1 $cov: F1 = 1 (reference — fixed) # $G2 $means: F1 = 0.57 (focal mean — estimated) # $G2 $cov: F1 = 1.697 (focal variance → σ = 1.303) # # Item parameters identical across groups (constrained) # b1: a=1.366, b=-1.513 ... b10: a=1.448, b=1.263
The datasets used to produce all benchmarks on this page are available for download below. The R scripts are documented in each benchmark's code drawer above. Each script is self-contained and produces the mirt output shown when run against the corresponding dataset. To reproduce PACER results, load the same dataset in the calibration interface, select the matching model type, set D = 1.0 in the scaling options, and use default quadrature (GH Q=21) and convergence settings.
We invite scrutiny. If you identify a discrepancy not accounted for by the factors described above, please contact us — reproducible differences are taken seriously and investigated.