Transparency & Validation

Building Trust
in Results

Knowing whether software produces the right answer is a critical question in psychometrics. PACER addresses this through transparent benchmarking against mirt, a peer-reviewed R package that serves as a standard reference in the field. We document that comparison honestly — showing where results align closely, and explaining what differences mean.

1PL · Rasch
0.003
RMSE on difficulty estimates across all items
✓ Excellent match
2PL · Binary
0.004
RMSE on discrimination parameters
✓ Excellent match
2PL + GPCM · Mixed
<0.001
Max absolute difference across all parameters
✓ Excellent match
Multigroup · 2PL
0.011
Max |Δa| across 10 constrained binary items
✓ Excellent match

Why exact agreement is not the goal

Two IRT packages implementing the same model will rarely produce bit-for-bit identical results. This is not a defect — it reflects legitimate differences in implementation: convergence tolerance (when the EM algorithm stops), quadrature specification (number of points, adaptive vs. fixed Gauss-Hermite), and optimizer internals (line-search strategy, gradient approximation).

On scaling: PACER defaults to D = 1.702 (normal-ogive metric), while mirt defaults to D = 1.0 (logistic). For all benchmarks on this page, PACER was configured to use D = 1.0 to match mirt and ensure a direct, fair comparison. This is standard practice in cross-software validation.

The standard we hold ourselves to: parameter estimates should agree to within rounding error for well-identified models with adequate sample sizes. The benchmarks below document whether PACER meets that standard.

A note on default estimation settings

All benchmarks on this page were run using default quadrature and convergence settings in both PACER and mirt — specifically, Q = 21 fixed Gauss-Hermite quadrature points and standard EM convergence tolerance (max-change < 0.001). These defaults strike a practical balance between speed and precision for typical use cases.

It is worth noting that increasing the number of quadrature points (e.g., Q = 61) and tightening the convergence criterion would yield even closer agreement between the two packages, potentially reducing differences to the fifth or sixth decimal place. We chose default settings for this demonstration because they reflect real-world usage conditions and still produce results that agree to a degree that is entirely satisfactory for psychometric practice.

Summary of Findings

01
Benchmark 01 · 1PL
Excellent agreement
All difficulty parameters match to 3–4 decimal places. Max |Δ| < 0.007. Log-likelihood difference of 0.002 is negligible.
✓ RMSE = 0.003
02
Benchmark 02 · 2PL
Excellent agreement
Discrimination and difficulty align closely across all items. Largest differences occur at extreme difficulty values, as expected from quadrature tail effects.
✓ Discrimination RMSE = 0.004
03
Benchmark 03 · 2PL + GPCM
Excellent agreement
Binary and polytomous items match to 4 decimal places across all slope and threshold parameters.
✓ Max |Δ| < 0.001
04
Benchmark 04 · Multigroup 2PL
Excellent agreement
10 binary items constrained across groups. Item parameters and latent group distribution estimates agree closely throughout.
✓ G2 μ and σ agree well
1PL difficulty RMSE = 0.003 2PL discrimination RMSE = 0.004 2PL difficulty RMSE = 0.012 Mixed 2PL+GPCM max |Δ| < 0.001 All four models converged |ΔLL| < 1.5 across all single-group runs

One-Parameter Logistic (1PL / Rasch)

1PLDifficulty Parameters — PACER vs mirt
D = 1.0 · GH Q=21 · EM defaults
PACER
mirt
|Δ| absolute difference
Item — B (DIFFICULTY) —
PACERmirt|Δ|
Item 1−2.7380−2.73100.0070
Item 2−0.9986−0.99900.0004
Item 3−0.2399−0.24000.0001
Item 4−1.3064−1.30700.0006
Item 5−2.0994−2.10000.0006
Max |Δb|
0.0070
Item 1
RMSE (b)
0.0032
all items
Pearson r
>0.9999
rank order preserved
LL · PACER
−2466.94
mirt: −2466.938
|ΔLL|
0.002
negligible
Interpretation: The 1PL model shows essentially perfect agreement. All five difficulty parameters match to 3–4 decimal places. The largest difference (Item 1, |Δ| = 0.0070) falls well below any threshold of practical concern. This is the level of agreement expected between two correctly implemented Rasch estimators using matched quadrature settings.
benchmark_1pl.Rmirt 1.41+
# ── 1PL / Rasch Benchmark ──────────────────────────────────────────────
# Dataset: 5-item binary response matrix
# PACER default is D = 1.702; configured to D = 1.0 here to match mirt
# Quadrature: GH Q=21 (default); convergence: max-change < 0.001 (default)

library(mirt)

mod_1pl <- mirt(dat, 1, itemtype = 'Rasch')
coef(mod_1pl, IRTpars = TRUE, simplify = TRUE)

# $items
#        a       b    g  u
# Item.1 1  -2.731   0  1
# Item.2 1  -0.999   0  1
# Item.3 1  -0.240   0  1
# Item.4 1  -1.307   0  1
# Item.5 1  -2.100   0  1
# Note: latent variance freely estimated; $cov F1 = 0.572

Two-Parameter Logistic (2PL)

2PLDiscrimination & Difficulty — PACER vs mirt
D = 1.0 · GH Q=21 · EM defaults
PACER
mirt
|Δ|
Item — A (DISCRIMINATION) — — B (DIFFICULTY) —
PACERmirt|Δ| PACERmirt|Δ|
Item 10.83300.82500.0080−3.3343−3.36100.0267
Item 20.72170.72300.0013−1.3718−1.37000.0018
Item 30.89110.89000.0011−0.2798−0.28000.0002
Item 40.68770.68900.0013−1.8679−1.86600.0019
Item 50.65300.65800.0050−3.1420−3.12300.0190
Max |Δa|
0.0080
Item 1
RMSE (a)
0.0042
discrimination
Max |Δb|
0.0267
Item 1 (extreme b)
RMSE (b)
0.0121
difficulty
|ΔLL|
0.29
−2466.65 vs −2466.94
Interpretation: Agreement across both discrimination and difficulty parameters is excellent. The largest differences appear for Items 1 and 5, which have extreme difficulty values near −3.1 to −3.4. Estimation near the tails of the ability distribution is more sensitive to quadrature boundary behavior, making slightly larger differences there expected. All differences have zero practical impact on scoring or reporting decisions.
benchmark_2pl.Rmirt 1.41+
# ── 2PL Benchmark ──────────────────────────────────────────────────────
# Same 5-item binary dataset as 1PL benchmark
# PACER default D = 1.702; configured to D = 1.0 to match mirt
# Quadrature: GH Q=21 (default); convergence: max-change < 0.001 (default)

library(mirt)

mod_2pl <- mirt(dat, 1, itemtype = '2PL')
coef(mod_2pl, IRTpars = TRUE, simplify = TRUE)

# $items
#            a       b    g  u
# Item.1  0.825  -3.361   0  1
# Item.2  0.723  -1.370   0  1
# Item.3  0.890  -0.280   0  1
# Item.4  0.689  -1.866   0  1
# Item.5  0.658  -3.123   0  1

Mixed Calibration — 2PL + GPCM

2PLGPCMMixed Calibration — PACER vs mirt
D = 1.0 · GH Q=21 · EM defaults · N = 2,000
Binary Items — 2PL (8 items)
PACER
mirt
|Δ|
Item — A (DISCRIMINATION) — — B (DIFFICULTY) —
PACERmirt|Δ| PACERmirt|Δ|
V3 1.79651.79700.0005−0.9294−0.93000.0006
V8 0.91320.91300.0002−0.7199−0.72100.0011
V101.16541.16500.0004−0.5362−0.53700.0008
V111.24871.24900.0003−0.4979−0.49800.0001
V131.17691.17700.0001−0.4151−0.41500.0001
V160.96430.96400.00030.05620.05600.0002
V171.34561.34600.00040.15840.15800.0004
V191.20471.20500.00030.17760.17700.0006
Polytomous Items — GPCM (2 items)
Item — A — — B1 — — B2 — — B3 —
PACERmirt|Δ| PACERmirt|Δ| PACERmirt|Δ| PACERmirt|Δ|
V1 1.23911.23900.0001 −1.5207−1.52100.0003 0.19620.19600.0002
V2 1.32851.32900.0005 −1.2533−1.25400.0007 −0.3240−0.32400.0000 2.02592.02500.0009
2PL Max |Δa|
0.0005
binary items
2PL Max |Δb|
0.0011
binary items
GPCM Max |Δa|
0.0005
polytomous
GPCM Max |Δb|
0.0009
all thresholds
|ΔLL|
1.17
−13651.58 vs −13650.41
Interpretation: Both binary and polytomous items show exceptional agreement. All 2PL parameters match to 4 decimal places. The GPCM items V1 and V2 agree on slope and all category thresholds with maximum differences under 0.001. The LL difference of 1.17 reflects accumulated quadrature rounding across 31 iterations — consistent with cross-software comparisons in the literature and well within the range expected at default settings.
benchmark_mixed.Rmirt 1.41+
# ── Mixed 2PL + GPCM Benchmark ─────────────────────────────────────────
# Dataset: combination.csv — 2,000 respondents, 10 items
#   V1, V2                               → polytomous (GPCM)
#   V3,V8,V10,V11,V13,V16,V17,V19        → binary (2PL)
# PACER default D = 1.702; configured to D = 1.0 to match mirt
# Quadrature: GH Q=21 (default); convergence: max-change < 0.001 (default)

library(mirt)

itemtype <- c(rep("gpcm", 2), rep("2PL", 8))
mod <- mirt(dat, 1, itemtype = itemtype)
coef(mod, IRTpars = TRUE, simplify = TRUE)

# $items
#         a     b1     b2    b3      b  g  u
# V1  1.239 -1.521  0.196    NA     NA NA NA
# V2  1.329 -1.254 -0.324 2.025     NA NA NA
# V3  1.797     NA     NA    NA -0.930  0  1
# ...
# V19 1.205     NA     NA    NA  0.177  0  1

Multigroup Calibration — 2PL

Multigroup2PLConstrained Multigroup — PACER vs mirt
D = 1.0 · GH Q=21 · EM defaults · 67 iterations
Log-likelihood · PACER
−9595.10
67 EM iterations · converged
Reference G1 σ
1.0000
μ = 0.000 (fixed)
G2 μ · PACER / mirt
+0.5704 / +0.570
|Δ| < 0.001 — exact
G2 σ · PACER / mirt
1.3066 / 1.303
|Δ| = 0.004 — excellent
Binary Items — 2PL (10 items, constrained equal across groups)
PACER
mirt
|Δ|
Item — A (DISCRIMINATION) — — B (DIFFICULTY) —
PACERmirt|Δ| PACERmirt|Δ|
b1 1.36321.36600.0028−1.5169−1.51300.0039
b2 1.54801.55200.0040−1.0416−1.03800.0036
b3 2.04492.05200.0071−0.4947−0.49200.0027
b4 1.42521.42800.00280.08550.08700.0015
b5 2.23652.24700.01050.53870.53000.0087
b6 1.73731.74400.00671.10691.10500.0019
b7 1.06451.06800.00351.72111.71800.0031
b8 2.40242.40100.0014−0.7840−0.78200.0020
b9 1.65201.65600.00400.37980.38000.0002
b101.44351.44800.00451.26551.26300.0025
Converged
Yes
67 EM iterations
Max |Δa|
0.0105
b5
Max |Δb|
0.0087
b5
G2 μ
+0.570
both agree exactly
G2 σ · |Δ|
0.004
1.307 vs 1.303
Interpretation: The multigroup benchmark shows excellent agreement across all 10 binary items and both latent group distribution estimates. Discrimination and difficulty parameters match to 3–4 decimal places throughout. The focal group mean (G2 μ = +0.5704) is in exact agreement between PACER and mirt. The focal group variance estimate differs by only 0.004 (PACER σ = 1.307 vs mirt σ = √1.697 = 1.303). This is the level of agreement expected from two correctly implemented multigroup EM estimators with matched constraints running at default settings.
benchmark_multigroup.Rmirt 1.41+
# ── Multigroup 2PL Benchmark ───────────────────────────────────────────
# Dataset: multigroup_responses.csv — 3,000 respondents, 2 groups (G1, G2)
#   b1–b10  → binary items (2PL)
# Constraints: slopes and intercepts held equal across groups
# G1 = reference (μ = 0, σ² = 1 fixed)
# G2 = focal     (μ and σ² freely estimated)
# PACER default D = 1.702; configured to D = 1.0 to match mirt
# Quadrature: GH Q=21 (default); convergence: max-change < 0.001 (default)

library(mirt)

mod <- multipleGroup(
  data      = dat[, 3:12],
  model     = 1,
  group     = dat$group,
  itemtype  = rep('2PL', 10),
  invariance = c('slopes', 'intercepts', 'free_means', 'free_vars')
)

coef(mod, IRTpars = TRUE, simplify = TRUE)

# $G1 $means: F1 = 0      (reference — fixed)
# $G1 $cov:   F1 = 1      (reference — fixed)
# $G2 $means: F1 = 0.57   (focal mean — estimated)
# $G2 $cov:   F1 = 1.697  (focal variance → σ = 1.303)
#
# Item parameters identical across groups (constrained)
# b1: a=1.366, b=-1.513 ... b10: a=1.448, b=1.263

Reproducibility

The datasets used to produce all benchmarks on this page are available for download below. The R scripts are documented in each benchmark's code drawer above. Each script is self-contained and produces the mirt output shown when run against the corresponding dataset. To reproduce PACER results, load the same dataset in the calibration interface, select the matching model type, set D = 1.0 in the scaling options, and use default quadrature (GH Q=21) and convergence settings.

We invite scrutiny. If you identify a discrepancy not accounted for by the factors described above, please contact us — reproducible differences are taken seriously and investigated.