● Technical Reference

Statistical Methods
Reference

PACER is built on a fully transparent statistical foundation. This reference documents every method — the probability models, estimation algorithms, and numerical procedures — exactly as they are implemented. Nothing is a black box. Throughout, PACER employs modern computational methods including parallel processing, numerical stabilization, and efficient algorithms designed to perform well at scale (Doran, 2023).

Item Analysis

Classical Test Theory

CTT characterizes items and tests using observed-score statistics from an \(N\times K\) response matrix (\(N\) respondents, \(K\) items). Items may be dichotomous (0/1) or polytomous (non-negative integers). All variance estimates use the sample \((n-1)\) denominator throughout. Missing data are handled pairwise (default) or listwise.

Item-Level Descriptive Statistics

Mean, SD, and Standard Error

Let \(\mathcal{V}_j\) be the set of respondents with non-missing responses to item \(j\) and \(n_j=|\mathcal{V}_j|\). Then

\[\bar{x}_j=\frac{1}{n_j}\sum_{i\in\mathcal{V}_j}x_{ij},\quad s_j=\sqrt{\frac{\sum_{i\in\mathcal{V}_j}(x_{ij}-\bar{x}_j)^2}{n_j-1}},\quad \widehat{SE}_j=\frac{s_j}{\sqrt{n_j}}\]

Relative Difficulty (p-value)

\[p_j=\frac{\bar{x}_j}{x_j^{\max}}\]

Items with \(p_j<0.15\) or \(p_j>0.85\) are flagged. For dichotomous items this equals the classical p-value.

Discrimination Indices

Adjusted Point-Biserial (Item-Rest Correlation)

Item \(j\) is correlated with the sum of the remaining \(K-1\) items, eliminating the inflation from item \(j\) appearing in its own total:

\[T_i^{(-j)}=\sum_{k\neq j}x_{ik},\qquad r_{pb,j}=\frac{\sum_i(x_{ij}-\bar{x}_j)(T_i^{(-j)}-\overline{T^{(-j)}})}{\sqrt{\sum_i(x_{ij}-\bar{x}_j)^2\cdot\sum_i(T_i^{(-j)}-\overline{T^{(-j)}})^2}}\]

This is Pearson's \(r\), computed pairwise over non-missing observations. Items with \(r_{pb,j}<0.20\) are flagged.

Biserial and Polyserial Correlation

These indices estimate the latent linear association assuming the true variable is continuous and normal but observed only in coarsened categories. For an item with \(M\) categories, let \(\hat{F}_m\) be the cumulative proportion through category \(m\). The internal threshold is \(z_m=\Phi^{-1}(1-\hat{F}_m)\) and its standard normal ordinate is \(\varphi(z_m)\):

\[r_{\text{bis},j}=\frac{s_j\cdot r_{pb,j}}{\displaystyle\sum_{m=1}^{M-1}\varphi(z_m)}\]

The last category is excluded because its threshold is \(+\infty\). For \(M=2\) this reduces to the classical biserial formula. \(\Phi^{-1}\) is computed via the Beasley-Springer-Moro rational approximation (accuracy \(\approx10^{-9}\)).

Reliability and Standard Error of Measurement

Cronbach's \(\alpha\)

\[\alpha=\frac{K}{K-1}\!\left(1-\frac{\sum_{j=1}^K s_j^2}{s_T^2}\right)\]

where \(s_j^2\) is the sample variance of item \(j\) and \(s_T^2\) is the sample variance of total scores (Cronbach, 1951). Both use the \((n-1)\) denominator.

Alpha-if-Item-Deleted

For each item \(j\), \(\alpha\) is recomputed on the \((K-1)\)-item submatrix formed by removing column \(j\). An increase signals that item \(j\) is reducing internal consistency.

Standard Error of Measurement

\[\mathrm{SEM}=s_T\sqrt{1-\alpha}\]

IRT Tools — Models

Item Response Models

PACER supports binary and polytomous items in any mixture within a single test. The scaling constant \(D\) defaults to 1.7 (close to the normal-ogive metric; \(D=1\) gives the pure logistic metric). All probabilities are clamped to \([\varepsilon,\,1-\varepsilon]\) with \(\varepsilon\approx1.49\times10^{-8}\) to prevent \(\log 0\) in the likelihood.

Binary Item Response Models

1PL2PL3PL

Three-Parameter Logistic (3PL)

\[P(X=1\mid\theta,a,b,c)=c+(1-c)\,\frac{1}{1+e^{-Da(\theta-b)}}\]

Param	Description
a	Discrimination — slope at the ICC inflection point; constrained \(a>0\).
b	Difficulty — \(\theta\) at the inflection point; \(P(b)=(1+c)/2\).
c	Pseudo-guessing — lower asymptote on the probability scale; estimated on the logit scale internally.

2PL and 1PL / Rasch

Setting \(c=0\) gives the 2PL. The 1PL additionally fixes \(a=1\) and estimates a free population standard deviation \(\sigma\) that scales the quadrature nodes: \(\theta_k^*=\sigma z_k\), where \(\{z_k\}\) are the standard GH nodes.

Graded Response Model (GRM)

GRM

The GRM (Samejima, 1969) models items with \(K+1\) ordered categories \((0,1,\ldots,K)\). The boundary response function gives the cumulative probability of responding in category \(k\) or higher:

\[P^*(X\ge k\mid\theta)=\frac{1}{1+e^{-Da(\theta-d_k)}},\quad k=1,\ldots,K\]

with ordered thresholds \(d_1 < d_2 < \cdots < d_K\) and sentinels \(P^*(X\ge0)\equiv1\), \(P^*(X\ge K+1)\equiv0\). Category probabilities are differences of adjacent boundaries:

\[P(X=k\mid\theta)=P^*(X\ge k\mid\theta)-P^*(X\ge k+1\mid\theta)\]

Boundary probabilities use the sign convention \(p_{kq}=\sigma(Da(d_k-\theta_q))\), so larger \(d_k\) means a harder threshold. Category probabilities are formed by differencing consecutive rows of the \(K\times Q\) boundary matrix.

Generalized Partial Credit Model (GPCM) and PCM

GPCMPCM

The GPCM (Muraki, 1992) uses adjacent-category transitions. For an item with \(K+1\) categories \((0,\ldots,K)\), define the log-numerator of category \(k\):

\[\eta_k(\theta)=Da\!\left(k\theta-\sum_{j=1}^{k}d_j\right),\quad k=0,1,\ldots,K\]

with \(\eta_0\equiv0\) and a leading \(d_0=0\) prepended to the step-parameter vector. Category probabilities follow the multinomial logit (softmax):

\[P(X=k\mid\theta)=\frac{e^{\eta_k(\theta)}}{\sum_{j=0}^{K}e^{\eta_j(\theta)}}\]

Evaluated via log-sum-exp: \(P(X=k)=\exp(\eta_k-M)/\sum_j\exp(\eta_j-M)\) where \(M=\max_j\eta_j\), preventing overflow at extreme \(\theta\). The PCM (Masters, 1982) is the GPCM with \(a\equiv1\).

ConventionThe step-parameter vector \(\bm{d}\) always has length equal to the number of categories, with \(d_0=0\) prepended. Scores are 1-based internally; scores used in expected-value and information calculations are 0-based.

Rating Scale Model (RSM)

RSM

The RSM (Andrich, 1978) constrains the PCM so that all items share identical step-difficulty offsets \(\tau_1,\ldots,\tau_K\) (\(\tau_0=0\) prepended by convention). Each item has a unique location \(\delta_i\), and its step parameters are

\[d_{ij}=\delta_i+\tau_j,\quad j=0,1,\ldots,K\]

Response probabilities are evaluated with the PCM formula using these compound parameters. The free parameter vector is \([\delta_1,\ldots,\delta_I,\,\tau_1,\ldots,\tau_K]\).

IRT Tools — Calibration

Item Calibration: Marginal Maximum Likelihood via EM

PACER estimates item parameters by marginal maximum likelihood (MML), integrating the latent trait \(\theta\) out of the likelihood. Because \(\theta\) is unknown, direct maximization of the item parameters is not feasible; instead PACER forms the marginal likelihood and maximizes it using the Bock-Aitkin (1981) EM algorithm. Unique response patterns are identified before the E-step to eliminate redundant likelihood evaluations.

The Marginal Likelihood

The goal is to estimate the full vector of item parameters \(\bm{\gamma}\). Since the person parameters \(\theta_i\) are also unknown, they are integrated out under an assumed population distribution \(g(\theta\mid\mu,\sigma)\):

\[\mathcal{L}(\bm{\gamma}\mid\bm{X})=\prod_{i=1}^{N}\int_{-\infty}^{\infty}p(\bm{x}_i\mid\theta_i,\bm{\gamma})\,g(\theta\mid\mu,\sigma)\,d\theta\]

The integral has no closed form and is evaluated by Gauss-Hermite quadrature over \(Q\) nodes:

\[\mathcal{L}(\bm{\gamma}\mid\bm{X})=\prod_{i=1}^{N}\sum_{k=1}^{Q}p(\bm{x}_i\mid\theta_k,\bm{\gamma})\,w_k\]

In practice, logs of individual pattern likelihoods are accumulated rather than computing products, ensuring numerical stability. When the data contain both binary and polytomous items (\(J\) binary, \(M\) polytomous), the pattern likelihood factors as a product over both item types:

\[p(\bm{x}_i\mid\theta,\bm{\gamma})=\prod_{j=1}^{J}P_j(x_{ij}\mid\theta)\cdot\prod_{l=1}^{M}P_l(x_{il}\mid\theta)\]

where \(P_j\) is the binary ICC (1PL/2PL/3PL) and \(P_l\) is the polytomous category probability (GPCM or GRM). Both item types are estimated within the same EM loop.

Model Identification

For the 2PL and 3PL, the population distribution is normally fixed at \(N(0,1)\) to identify the latent scale. For the 1PL (Rasch model), only the mean need be fixed; PACER therefore estimates \(N(0,\sigma)\), with \(\sigma\) as a free parameter interpretable as the common discrimination across all items.

When anchor items are present in a single-group calibration, the scale is externally defined by the anchor constraints and the \(N(0,1)\) identification constraint is no longer required. In this case PACER automatically estimates \(\mu\) and \(\sigma\) for the single group using the same closed-form M-step update as the multigroup focal groups (see §Multigroup below). This behavior can be overridden by explicitly setting EstimatePopulation = false in the request.

Starting Values

Because the EM algorithm is iterative, starting values are required. Three strategies are available. Random draws each parameter from a uniform distribution over the admissible parameter space — useful when the likelihood may be multimodal; running the analysis multiple times with different random starts is a recommended diagnostic. Fixed sets \(a_j=1\), \(b_j=0\), \(c_j=0.25\) for all binary items and uses ordered random starting values for polytomous step parameters. Logistic uses logistic regression intercepts to initialize the difficulty parameters (and proportional-odds intercepts for polytomous step parameters), with \(a\) and \(c\) initialized as in the fixed option — this is the default and generally performs best.

E-Step: Posterior Weights and Sufficient Statistics

Given current item parameters, the likelihood of response pattern \(\bm{x}_i\) at node \(\theta_k\) is

\[L_k(\bm{x}_i)=\prod_{j=1}^{J}P_j(X_j=x_{ij}\mid\theta_k)\]

The posterior weight is

\[h_{ik}=\frac{L_k(\bm{x}_i)\,w_k}{\displaystyle\sum_{k'=1}^{Q}L_{k'}(\bm{x}_i)\,w_{k'}}\]

The marginal log-likelihood used to monitor convergence is

\[\ell=\sum_{i=1}^{N}\log\sum_{k=1}^{Q}L_k(\bm{x}_i)\,w_k\]

Binary Sufficient Statistics

For each binary item \(j\) and node \(k\), summed over respondents \(\mathcal{C}_j\) who answered item \(j\):

\[r_{jk}=\sum_{i\in\mathcal{C}_j}h_{ik}\,x_{ij},\qquad f_{jk}=\sum_{i\in\mathcal{C}_j}h_{ik}\]

\(r_{jk}\) is the expected number of correct responses at node \(k\); \(f_{jk}\) is the expected total count.

Polytomous Sufficient Statistics

For polytomous item \(j\), the sufficient statistic at node \(k\) for category \(s\) is

\[\tilde{n}_{jsk}=\sum_{i\in\mathcal{C}_j}h_{ik}\,\mathbf{1}[x_{ij}=s]\]

In the pattern-based implementation, the posterior weight for unique pattern \(p\) at node \(k\) is \(r_{p,k}=c_p\cdot w_k\cdot L_k/m_p\) where \(c_p\) is the pattern count and \(m_p=\sum_k L_k w_k\).

M-Step: Binary Items (2PL / 1PL / 3PL)

The M-step maximizes the expected complete-data log-likelihood for item \(j\):

\[Q_j(a_j,b_j,c_j)=\sum_{k=1}^{Q}\bigl[r_{jk}\ln P_{jk}+(f_{jk}-r_{jk})\ln(1-P_{jk})\bigr]\]

PACER computes analytical gradients for the 2PL. The gradients with respect to \(a_j\) and \(b_j\) are:

\[\frac{\partial Q_j}{\partial a_j}=\sum_k\frac{L_k\,w_k}{m}\cdot\frac{x_j-P_{jk}}{P_{jk}(1-P_{jk})}\cdot D(\theta_k-b_j)P_{jk}(1-P_{jk})\]

\[\frac{\partial Q_j}{\partial b_j}=\sum_k\frac{L_k\,w_k}{m}\cdot\frac{x_j-P_{jk}}{P_{jk}(1-P_{jk})}\cdot(-Da_j)P_{jk}(1-P_{jk})\]

In the multigroup path, the gradient accumulates across all groups \(g\) using group-scaled nodes \(\theta_{gk}=\mu_g+\sigma_g z_k\):

\[\frac{\partial(-\ell_j)}{\partial a_j}=-\sum_g\sum_k D(\theta_{gk}-b_j)\bigl(r_g[j,k]-f_g[j,k]\,P_{jk}^{(g)}\bigr)\]

For the 3PL, \(c_j\) is parameterized on the logit scale internally; its gradient is computed by forward differences.

M-Step: GPCM and PCM

The complete-data log-likelihood for GPCM item \(j\) is the multinomial log-likelihood in expected category counts:

\[Q_j^{\text{GPCM}}(a_j,\bm{d}_j)=\sum_{k=1}^{Q}\sum_{s=0}^{K_j}\tilde{n}_{jsk}\ln P(X_j=s\mid\theta_k;\,a_j,\bm{d}_j)\]

The analytical gradient with respect to \(a_j\) uses the conditional mean of the log-numerator:

\[\bar{\eta}_k=\sum_{s=0}^{K}P(X_j=s\mid\theta_k)\cdot\!\left(s\theta_k-\sum_{m=1}^{s}d_m\right)\]

\[\frac{\partial Q_j^{\text{GPCM}}}{\partial a_j}=D\sum_k\sum_{\text{pat}}r_{\text{pat},k}\Bigl[\Bigl(s_{\text{pat}}\theta_k-\sum_{m=1}^{s_{\text{pat}}}d_m\Bigr)-\bar{\eta}_k\Bigr]\]

For step parameter \(d_j\) (\(j=1,\ldots,K\)), let \(P_k^{\ge j}=\sum_{s\ge j}P(X=s\mid\theta_k)\) be the exceedance probability. Then:

\[\frac{\partial Q_j^{\text{GPCM}}}{\partial d_j}=Da_j\sum_k\sum_{\text{pat}}r_{\text{pat},k}\Bigl[P_k^{\ge j}-\mathbf{1}[s_{\text{pat}}\ge j]\Bigr]\]

This is a score-residual: model-expected exceedance minus the observed indicator. For the PCM, \(a_j\equiv1\) and the \(a\)-gradient is not computed.

M-Step: GRM

The complete-data log-likelihood for GRM item \(j\) is:

\[Q_j^{\text{GRM}}(a_j,\bm{d}_j)=\sum_k\sum_{\text{pat}}r_{\text{pat},k}\ln P(X_j=s_{\text{pat}}\mid\theta_k;\,a_j,\bm{d}_j)\]

Let \(f_k^{(m)}=P^*(X\ge m\mid\theta_k)\cdot[1-P^*(X\ge m\mid\theta_k)]\) be the logistic slope of the \(m\)-th boundary. The gradient with respect to \(a_j\) depends on which category was observed:

\[\frac{\partial Q_j^{\text{GRM}}}{\partial a_j}=\sum_k\frac{r_{\text{pat},k}}{P_{sk}}\cdot\Delta_k(s)\]

The term \(\Delta_k(s)\) depends on the observed category \(s\):

Category	\(\Delta_k(s)\)
\(s=0\)	\(D(d_1-\theta_k)\,f_k^{(1)}\)
\(0 < s < K\)	\(D(d_s-\theta_k)\,f_k^{(s)}-D(d_{s-1}-\theta_k)\,f_k^{(s-1)}\)
\(s=K\)	\(-D(d_K-\theta_k)\,f_k^{(K)}\)

The gradient with respect to threshold \(d_m\) is nonzero only for the two categories adjacent to it:

\[\frac{\partial Q_j^{\text{GRM}}}{\partial d_m}=\sum_k\frac{r_{\text{pat},k}}{P_{sk}}\Bigl[\mathbf{1}[s=m]\cdot Da_j f_k^{(m)}+\mathbf{1}[s=m-1]\cdot(-Da_j f_k^{(m)})\Bigr]\]

GRM item gradients are parallelized over items at the M-step.

M-Step: RSM

The RSM free parameter vector is \(\bm{\xi}=[\delta_1,\ldots,\delta_I,\,\tau_1,\ldots,\tau_K]\). Within the optimizer, item \(i\)'s step vector is \(\bm{d}_i=[0,\,\delta_i+\tau_1,\ldots,\delta_i+\tau_K]\) and probabilities are evaluated with the PCM formula (\(a=1\)). The gradients follow from the GPCM step gradient with \(a=1\):

\[\frac{\partial Q^{\text{RSM}}}{\partial\delta_i}=D\sum_k\sum_{\text{pat}}r_{\text{pat},k}\sum_{j=1}^{K}\Bigl[P_{ik}^{\ge j}-\mathbf{1}[s_i\ge j]\Bigr]\]

\[\frac{\partial Q^{\text{RSM}}}{\partial\tau_t}=\sum_{i=1}^{I}D\sum_k\sum_{\text{pat}}r_{\text{pat},k}\Bigl[P_{ik}^{\ge t}-\mathbf{1}[s_i\ge t]\Bigr]\]

The \(\tau_t\) gradient is the sum of per-item step gradients across all \(I\) items, since \(\partial d_{it}/\partial\tau_t=1\) for all \(i\). Item gradients are computed in parallel; \(\tau\) gradients are accumulated sequentially to avoid data races.

Parameter Priors

Priors are incorporated by augmenting the marginal log-likelihood with log-prior terms, one per item parameter:

\[\ln p(\bm{\gamma}\mid\bm{X})=\ln\mathcal{L}(\bm{\gamma}\mid\bm{X})+\sum_{j=1}^{J}\bigl[\ln f(a_j)+\ln f(b_j)+\ln f(c_j)\bigr]\]

Parameter	Prior family	Log-density contribution
a — discrimination	Log-normal \(\mathcal{LN}(\mu_a,\sigma_a^2)\)	\(-(\ln a-\mu_a)^2/(2\sigma_a^2)\) (default: \(\mu_a=0\), \(\sigma_a=1\))
b — difficulty	Normal \(\mathcal{N}(\mu_b,\sigma_b^2)\)	\(-(b-\mu_b)^2/(2\sigma_b^2)\) (default: \(\mu_b=0\), \(\sigma_b=1\))
c — guessing (beta, default)	Beta \(\mathcal{B}(\alpha,\beta)\)	\((\alpha-1)\ln c+(\beta-1)\ln(1-c)\) (default: \(\alpha=4\), \(\beta=16\))
c — guessing (logit-normal)	Normal on \(\lambda=\operatorname{logit}(c)\)	\(-\tfrac{1}{2}\bigl((\lambda-\mu_\lambda)/\sigma_\lambda\bigr)^2\) (default: \(\mu_\lambda=-1.386\), \(\sigma_\lambda=0.544\))

Priors on \(a\) and \(b\) are optional. A prior on \(c\) is applied by default when estimating the 3PL and can be disabled for very large samples.

Beta Prior for the Guessing Parameter

The beta distribution is parameterized by shape parameters \(\alpha\) and \(\beta\), but it is more natural to specify a prior mean and standard deviation. PACER converts these to shape parameters via the method of moments. Starting from

\[\mu=\frac{\alpha}{\alpha+\beta},\qquad\sigma^2=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\]

solving for \(\alpha\) and \(\beta\) gives

\[\alpha=\left(\frac{1-\mu}{\sigma^2}-\frac{1}{\mu}\right)\mu^2,\qquad\beta=\alpha\!\left(\frac{1}{\mu}-1\right)\]

The defaults \(\mu_c=0.20\) and \(\sigma_c=0.087\) correspond to \(\mathcal{B}(4,16)\), which places most mass below 0.35 and is appropriate for items with three to five response options. The logit-normal prior is described in the next section.

Logit-Normal Prior for the Guessing Parameter

The logit-normal prior places a normal distribution directly on the internal logit-scale parameter \(\lambda_j = \operatorname{logit}(c_j) = \ln(c_j/(1-c_j))\), which is the parameterization PACER uses internally during optimization. Because the optimizer works on \(\lambda_j\), no Jacobian correction is required in the objective function — the log-prior contribution is simply the log-density of a normal:

\[\ln f(\lambda_j) = -\frac{(\lambda_j - \mu_\lambda)^2}{2\sigma_\lambda^2} - \ln(\sigma_\lambda\sqrt{2\pi})\]

The gradient of the log-prior with respect to \(\lambda_j\) is therefore

\[\frac{\partial\ln f}{\partial\lambda_j} = -\frac{\lambda_j - \mu_\lambda}{\sigma_\lambda^2}\]

and the corresponding contribution to the negative log-likelihood gradient used by the optimizer is \((\lambda_j - \mu_\lambda)/\sigma_\lambda^2\).

The default hyperparameters \(\mu_\lambda = -1.386\) and \(\sigma_\lambda = 0.544\) are chosen so that the prior mean on the probability scale is \(\operatorname{logit}^{-1}(-1.386) \approx 0.20\), matching the beta prior default mean. The prior is specified in the UI by entering the mean and standard deviation on the logit scale directly.

The logit-normal is preferable to the beta prior when software interoperability requires a prior specified on the logit scale (for example, mirt's parprior = list(c(i, "norm", mu, sd)) applies a normal prior to the internally stored logit of \(c\), which is exactly equivalent to PACER's logit-normal prior). It may also be preferred when the guessing parameter is expected to cluster near zero, since the logit transform spreads mass more evenly in the tails of the probability scale.

Implementation note. The function LogitNormLogPdf(x, μ, σ) — which evaluates \(\ln\mathcal{N}(\operatorname{logit}(x);\mu,\sigma^2) - \ln[x(1-x)]\) and includes the change-of-variables Jacobian — is available internally but is not used for the prior during optimization. Since optimization proceeds on \(\lambda\) directly, only NormLogPdf(λ, μ, σ) is evaluated. LogitNormLogPdf would be appropriate if one needed the prior density on the original \(c\in(0,1)\) scale, for example when reporting marginal likelihoods.

Per-Item Prior Overrides

In addition to the global priors described above, PACER supports per-item prior overrides that selectively replace or suppress the global prior for any individual item–parameter pair. The resolution rule is applied independently for each combination of item \(j\) and parameter \(\psi\in\{a,b,c,d\}\):

If an explicit per-item prior is specified for item \(j\), parameter \(\psi\): use that prior family and hyperparameters.
If an explicit None override is set for item \(j\), parameter \(\psi\): apply no prior for that parameter (pure MLE).
Otherwise: inherit the global prior for \(\psi\) (if the global prior is enabled).

The table below lists the parameters that accept per-item priors and their associated prior families:

Model	Parameter	Prior family
1PL	\(b\)	Normal \(\mathcal{N}(\mu_b,\sigma_b^2)\)
2PL	\(a\), \(b\)	Log-normal / Normal
3PL	\(a\), \(b\), \(c\)	Log-normal / Normal / Beta or Logit-normal
GRM, GPCM	\(a\), \(\mathbf{d}\) thresholds	Log-normal / Normal (applied per threshold)
PCM	\(\mathbf{d}\) steps	Normal (applied per step)

Augmented Log-Posterior with Per-Item Priors

Replacing the global prior family with the resolved per-item family \(f_j^\psi\) for each item \(j\) and parameter \(\psi\), the augmented log-posterior becomes

\[\ln p(\bm{\gamma}\mid\bm{X})=\ln\mathcal{L}(\bm{\gamma}\mid\bm{X})+\sum_{j=1}^{J}\Bigl[\ln f_j^a(a_j)+\ln f_j^b(b_j)+\ln f_j^c(c_j)\Bigr]\]

where \(f_j^\psi\) evaluates to the per-item prior if one is set, the global prior if the item has no override, or 1 (no contribution) if an explicit None override is in effect for that pair. The gradient contributions to the optimizer's negative log-posterior objective are identical in form to the global-prior case — only the hyperparameters differ:

Parameter	Resolved prior	Gradient added to \(-\ln p\)
\(a_j\) — discrimination	Log-normal \(\mathcal{LN}(\mu_a,\sigma_a^2)\)	\(\dfrac{\ln a_j-\mu_a}{\sigma_a^2\,a_j}\)
\(b_j\) — difficulty	Normal \(\mathcal{N}(\mu_b,\sigma_b^2)\)	\(\dfrac{b_j-\mu_b}{\sigma_b^2}\)
\(c_j\) — guessing (beta)	Beta \(\mathcal{B}(\alpha,\beta)\)	\(-\dfrac{\alpha-1}{c_j}+\dfrac{\beta-1}{1-c_j}\)
\(c_j\) — guessing (logit-normal)	Normal on \(\lambda_j=\operatorname{logit}(c_j)\)	\(\dfrac{\lambda_j-\mu_\lambda}{\sigma_\lambda^2}\)

The hyperparameters \((\mu_a,\sigma_a)\), \((\mu_b,\sigma_b)\), \((\alpha,\beta)\), or \((\mu_\lambda,\sigma_\lambda)\) are taken from the per-item override when present, otherwise from the global prior setting.

Prior on Polytomous Threshold Parameters

For polytomous items (GRM, GPCM, PCM), a single normal prior \(\mathcal{N}(\mu_d,\sigma_d^2)\) is applied independently to each threshold or step parameter \(d_{jk}\). The joint log-prior contribution for item \(j\) with \(K_j\) categories is

\[\ln f(\mathbf{d}_j)=\sum_{k=1}^{K_j}\ln\mathcal{N}(d_{jk};\,\mu_d,\sigma_d^2)=-\frac{1}{2\sigma_d^2}\sum_{k=1}^{K_j}(d_{jk}-\mu_d)^2+\text{const}\]

The gradient contribution to the optimizer objective (negative log-posterior) for threshold \(d_{jk}\) is

\[\frac{\partial}{\partial d_{jk}}\bigl[-\ln f(\mathbf{d}_j)\bigr]=\frac{d_{jk}-\mu_d}{\sigma_d^2}\]

This term is added to the EM sufficient-statistic gradient before passing to L-BFGS-B, analogously to the \(a\) and \(b\) prior gradient contributions for binary items.

Implementation note. Per-item overrides are stored in IrtOptions.ItemPriors as a two-level dictionary keyed first by column name, then by parameter name ("a", "b", "c", or "d"). A missing outer key means the global prior applies. A missing inner key means the global prior applies for that parameter. A null inner value is an explicit None override — it disables the prior for that item–parameter pair even when the global prior is on. Resolution is performed by ResolvePrior(itemPriors, colName, param, globalEnabled, globalP1, globalP2) in ItemEstimation.cs.

Multigroup Calibration (Bock-Zimowski)

For data from multiple groups, PACER implements the Bock-Zimowski (1997) multigroup MML-EM. Each group \(g\) has distribution \(\theta\sim N(\mu_g,\sigma_g^2)\); the reference group (\(g=0\)) is fixed at \(N(0,1)\) for identification.

Group-Scaled Nodes

\[\theta_{gk}=\mu_g+\sigma_g z_k\]

Weights \(\{w_k\}\) are invariant because the Jacobian and density ratio cancel exactly.

Multigroup E-Step

For respondent \(i\) in group \(g\), probabilities are evaluated at \(\theta_{gk}\):

\[h_{ik}=\frac{L_k(\bm{x}_i\mid\theta_{gk})\,w_k}{\sum_{k'}L_{k'}(\bm{x}_i\mid\theta_{gk'})\,w_{k'}}\]

Sufficient statistics \(r_g[j,k]\) and \(f_g[j,k]\) are accumulated per group on the same quadrature index \(k\).

Group Distribution Update (Closed Form)

Let \(W_{gk}=\sum_{i\in g}h_{ik}\). After the item M-step, the general update for a group with both \(\mu\) and \(\sigma\) free is:

\[\hat{\mu}_g=\frac{\sum_k W_{gk}\,\theta_{gk}}{\sum_k W_{gk}},\qquad\hat{\sigma}_g=\sqrt{\frac{\sum_k W_{gk}\,\theta_{gk}^2}{\sum_k W_{gk}}-\hat{\mu}_g^2}\]

For fixed-discrimination models (1PL, PCM, RSM) calibrated without anchors, the mean is identified by convention and held at \(\mu_g=0\); only the scale is free. In this case the second moment around zero is used directly:

\[\hat{\sigma}_g=\sqrt{\frac{\sum_k W_{gk}\,\theta_{gk}^2}{\sum_k W_{gk}}}\]

Both updates are clamped to \(\hat{\sigma}_g\in[0.05,\,10]\) to prevent degenerate quadrature grids. The outer loop iterates until

\[\frac{|\ell^{(t+1)}-\ell^{(t)}|}{1+|\ell^{(t)}|}<\tau_\ell\quad\text{and}\quad\max_{g}\,\bigl(\,|\Delta\mu_g|,\;|\Delta\sigma_g|\,\bigr)<10^{-4}\]

where \(\tau_\ell=\max(\text{Tol},\,10^{-5})\). The convergence condition on the group parameters requires that both the mean and standard deviation shift by less than \(10^{-4}\) for every non-reference group.

Optimizer: L-BFGS-B

Item parameters are estimated with L-BFGS-B, a limited-memory quasi-Newton method with box constraints. The inverse Hessian approximation is built from the most recent \(m=10\) curvature pairs \(\{(\bm{s}_t,\bm{y}_t)\}\) where \(\bm{s}_t=\bm{x}_{t+1}-\bm{x}_t\) and \(\bm{y}_t=\nabla f_{t+1}-\nabla f_t\). The two-loop recursion computes the search direction \(\bm{p}_t=-H_t^{-1}\nabla f_t\) without explicitly forming \(H_t^{-1}\). Box constraints are enforced by projecting onto the active constraint set. Convergence: \(|\Delta f|/(0.1+|f|)<\tau\) (default \(\tau=10^{-8}\)). Convergence is evaluated relative to the current function value to ensure scale-invariant termination.

IRT Tools — Scoring

Ability Estimation

Given calibrated item parameters, PACER estimates each respondent's latent trait \(\hat\theta\) from their observed response vector. Four methods are available. All operate on sparse response vectors so that respondents with missing items are scored without imputation.

Maximum Likelihood Estimation (MLE)

\[\hat\theta_{\text{MLE}}=\arg\max_\theta\sum_j\ln P_j(x_j\mid\theta)\]

A dense grid from \(-8\) to \(+8\) in steps of 0.05 locates the global basin; gradient descent with backtracking then refines from the grid minimum. Extreme scores: perfect and null scores produce unbounded estimates. PACER subtracts (or adds) 0.5 from the response of the item with the smallest discrimination \(a_j\). Standard error from observed information via numerical second differentiation:

\[\widehat{SE}_{\text{MLE}}=\left[-\hat\ell''(\hat\theta)\right]^{-1/2},\quad\hat\ell''(\hat\theta)\approx\frac{\hat\ell(\hat\theta+h)-2\hat\ell(\hat\theta)+\hat\ell(\hat\theta-h)}{h^2},\quad h=10^{-5}\]

Maximum A Posteriori (MAP)

\[\hat\theta_{\text{MAP}}=\arg\max_\theta\left[\sum_j\ln P_j(x_j\mid\theta)+\ln\pi(\theta)\right],\quad\ln\pi(\theta)=-\tfrac{1}{2}\!\left(\tfrac{\theta-\mu}{\sigma}\right)^{\!2}-\ln(\sigma\sqrt{2\pi})\]

The normal prior \(N(\mu,\sigma^2)\) (defaults: \(\mu=0\), \(\sigma=1\)) is added to the log-likelihood and the same minimizer is used. MAP is always finite, including for extreme scores.

Expected A Posteriori (EAP)

EAP is the posterior mean under a normal prior, approximated on a uniform quadrature grid spanning \(\mu\pm4\sigma\) in \(Q\) steps with weights proportional to \(\phi((\theta_k-\mu)/\sigma)\) (normalized to sum to one):

\[\hat\theta_{\text{EAP}}=\frac{\sum_{k=1}^{Q}\theta_k\,L_k(\bm{x})\,w_k}{\sum_{k=1}^{Q}L_k(\bm{x})\,w_k}\]

The posterior standard deviation serves as the standard error:

\[\widehat{SE}_{\text{EAP}}=\sqrt{\frac{\sum_k(\theta_k-\hat\theta_{\text{EAP}})^2\,L_k(\bm{x})\,w_k}{\sum_k L_k(\bm{x})\,w_k}}\]

TCC Scoring

TCC scoring (Lord, 1980) inverts the Test Characteristic Curve \(\tau(\theta)=\sum_j E[X_j\mid\theta]\) to find

\[\tau(\hat\theta_{\text{TCC}})=X_{\text{obs}}\]

solved on \([-8,8]\) by Brent's root-finding method. The SE is \(1/\sqrt{I(\hat\theta_{\text{TCC}})}\) from the test information function.

Boundary Condition

A unique TCC estimate exists only when the raw score exceeds the sum of the guessing parameters across all binary items:

\[\sum_{j=1}^{J}c_j < X_{\text{obs}}\]

When this condition is not met — that is, when the observed score is no higher than what could be achieved by guessing alone — TCC assigns the same lower-bound \(\hat\theta\) to all such examinees. Similarly, a perfect raw score produces the upper-bound value defined by the search interval. The default interval \([-8,8]\) covers the range used in practice; if scores fall outside this range the interval endpoints can be widened.

Post-Calibration EAP Thetas

After calibration, PACER computes a \(\hat\theta\) for every respondent using 11-point GH quadrature with a diffuse prior \(N(0,10)\) (\(\sigma=\sqrt{10}\approx3.162\)). These thetas are used for DIF matching and item-fit analysis, not score reporting.

IRT Tools — Test Characteristics

Test Characteristic Curve, Information, and SEM

Test Characteristic Curve (TCC)

\[\text{TCC}(\theta)=\sum_{j=1}^{J}E[X_j\mid\theta]\]

Model	Expected score \(E[X_j\mid\theta]\)
1PL / 2PL / 3PL	\(P_j(\theta)\) — the ICC
GRM	\(\sum_{k=0}^{K}k\cdot P(X=k\mid\theta)\). Closed form: \(\sum_{k=1}^{K}P^*(X\ge k\mid\theta)\)
GPCM / PCM	\(\sum_{k=0}^{K}k\cdot P(X=k\mid\theta)\) with softmax probabilities
RSM	Same as PCM with \(a=1\) and \(d_{ij}=\delta_i+\tau_j\)

When proportional TCC is enabled, output is divided by \(X_{\max}=\sum_j\max_k k\) to give a 0–1 scale.

Binary Item Information

For 1PL and 2PL:

\[I_j(\theta)=D^2a_j^2\,P_j(\theta)[1-P_j(\theta)]\]

For 3PL:

\[I_j(\theta)=D^2a_j^2\,\frac{[P_j(\theta)-c_j]^2[1-P_j(\theta)]}{P_j(\theta)(1-c_j)^2}\]

GRM Item Information

Samejima (1969):

\[I_j(\theta)=D^2a_j^2\sum_{k=0}^{K}\frac{[f_k(\theta)-f_{k+1}(\theta)]^2}{P(X=k\mid\theta)}\]

where \(f_k(\theta)=P^*(X\ge k\mid\theta)[1-P^*(X\ge k\mid\theta)]\) is the logistic slope of the \(k\)-th boundary, with sentinels \(f_0\equiv0\) and \(f_{K+1}\equiv0\). Concretely, consecutive logistic-slope values are differenced and each squared difference is divided by the corresponding category probability, then summed and scaled by \(D^2a_j^2\).

GPCM / PCM / RSM Item Information

Information equals \(D^2a^2\) times the conditional variance of the item score (Muraki, 1992):

\[I_j(\theta)=D^2a_j^2\,\text{Var}(X_j\mid\theta)=D^2a_j^2\Bigl[E[X_j^2\mid\theta]-\bigl(E[X_j\mid\theta]\bigr)^2\Bigr]\]

The test-level SEM at each \(\theta\) is

\[\text{SEM}(\theta)=\frac{1}{\sqrt{I(\theta)}}=\frac{1}{\sqrt{\sum_j I_j(\theta)}}\]

IRT Tools — Equating

IRT Equating and Linking

IRT linking places parameter estimates from a new test form onto the scale of an existing item bank using common anchor items. PACER supports six methods in three families. Mixed binary and polytomous items are supported; GRM uses the cumulative-logistic expected score; GPCM/PCM/RSM use the adjacent-category expected score.

The Linear Scale Transformation

All methods estimate \((A,B)\) mapping the new-form theta scale to the bank scale: \(\theta_{\text{bank}}=A\theta_{\text{form}}+B\). Item parameters transform as

\[a_{\text{linked}}=\frac{a_{\text{form}}}{A},\quad b_{\text{linked}}=Ab_{\text{form}}+B,\quad c_{\text{linked}}=c_{\text{form}},\quad d_{\text{linked},k}=Ad_{\text{form},k}+B\]

The reverse transformation is \(a_{\text{rev}}=Aa\), \(b_{\text{rev}}=(b-B)/A\), \(d_{\text{rev},k}=(d_k-B)/A\).

Stocking-Lord Methods

Stocking and Lord (1983) minimize the weighted integral of squared TCC differences over \(Q\) GH nodes. Let \(\tau_1(\theta)\) be the bank TCC and \(\tau_2^*(A,B;\theta)\) the linked new-form TCC.

SL.uni — Unidirectional

\[F_{\text{uni}}(A,B)=\sum_k w_k\bigl[\tau_1(\theta_k)-\tau_2^*(A,B;\theta_k)\bigr]^2\]

SL.sym — Symmetric

\[F_{\text{sym}}(A,B)=F_{\text{uni}}(A,B)+\sum_k w_k\bigl[\tau_1^*(\theta_k)-\tau_2(\theta_k)\bigr]^2\]where \(\tau_1^*\) is the bank TCC reverse-transformed to the form scale.

Haebara Methods

Haebara (1980) replaces TCC comparison with item-by-item ICC comparison. For binary items \(P_{jk}\) is the 3PL ICC; for polytomous items it is \(E[X_j\mid\theta_k]\).

HB.uni — Unidirectional

\[F_{\text{HB,uni}}(A,B)=\sum_k w_k\sum_j\bigl[P_{jk}^{\text{bank}}-P_{jk}^{*(A,B)}\bigr]^2\]

HB.sym — Symmetric

Adds the same sum in the reverse direction — bank ICC reverse-transformed to the form scale vs. the original form ICC.

All four characteristic-curve objectives are minimized by two-dimensional Nelder-Mead simplex (starting simplex side 0.5, up to 2,000 iterations, tolerance \(10^{-10}\)).

Mean/Sigma and Mean/Mean

Mean/Sigma (MS)

\[A=\frac{s_b^{\text{bank}}}{s_b^{\text{form}}},\qquad B=\bar{b}^{\text{bank}}-A\bar{b}^{\text{form}}\]

The location pool: \(b\)-parameters for binary items; all \(d\)-parameters pooled as a flat vector for GPCM/PCM (or averaged per item when the single-location option is selected); GRM boundary parameters; item location \(\delta_i\) only for RSM items.

Mean/Mean (MM)

\[A=\frac{\bar{a}^{\text{bank}}}{\bar{a}^{\text{form}}},\qquad B=\bar{b}^{\text{bank}}-A\bar{b}^{\text{form}}\]

dropD² Iterative Anchor Purification

After the initial equating, a per-item discrepancy statistic is computed:

\[D^2_j=\sum_{k=1}^{Q}w_k\bigl[P_j^{\text{bank}}(\theta_k)-P_j^{\text{linked}}(\theta_k)\bigr]^2\]

For polytomous items \(P_j\) is the expected score \(E[X_j\mid\theta_k]\). Values are standardized to

\[\tilde{D}^2_j=\frac{D^2_j-\bar{D}^2}{s_{D^2}}\]

The item with the largest \(|\tilde{D}^2_j|\) exceeding the cutoff (default 2.5) is removed; equating is re-run on the reduced anchor set. The process repeats until no item exceeds the cutoff or fewer than two anchors remain.

CDM Tools

Cognitive Diagnostic Models

CDMs classify examinees into mastery profiles over \(K\) binary latent attributes \(\bm{\alpha}_i=(\alpha_{i1},\ldots,\alpha_{iK})\in\{0,1\}^K\). The Q-matrix \(\bm{Q}\in\{0,1\}^{J\times K}\) specifies which attributes are required by each item.

DINA Model

The Deterministic Inputs, Noisy And-gate (DINA) model sets the ideal response indicator as \(\xi_{ij}=\prod_{k=1}^{K}\alpha_{ik}^{q_{jk}}\). A respondent achieves \(\xi_{ij}=1\) only if they have mastered all required attributes. The response probability accommodates slipping (\(s_j\)) and guessing (\(g_j\)):

\[P(X_{ij}=1\mid\bm{\alpha}_i)=(1-s_j)^{\xi_{ij}}\,g_j^{1-\xi_{ij}}\]

DINO Model

The Deterministic Inputs, Noisy Or-gate (DINO) model requires mastery of at least one required attribute:

\[\omega_{ij}=1-\prod_{k=1}^{K}(1-\alpha_{ik})^{q_{jk}},\qquad P(X_{ij}=1\mid\bm{\alpha}_i)=(1-s_j)^{\omega_{ij}}\,g_j^{1-\omega_{ij}}\]

EM Estimation of Item Parameters

Item parameters (\(s_j\), \(g_j\)) and class probabilities (\(\pi_c\)) are estimated jointly by marginal maximum likelihood over the \(L=2^K\) latent classes \(\bm{\alpha}^{(c)}\) with prior class probabilities \(\pi_c\ge0\), \(\sum_c\pi_c=1\). The algorithm is initialized with equal class probabilities (\(\pi_c=1/L\)) and uniform starting values for \(g_j\) and \(s_j\). The population distribution is re-estimated at each step, so the posteriors are empirical Bayes estimates.

E-Step

Let \(\Pr(\bm{\alpha}_l\mid\bm{x}_i)\) be the posterior probability that examinee \(i\) has skill profile \(\bm{\alpha}_l\). The expected number of examinees with profile \(\bm{\alpha}_l\) is

\[I_l=\sum_{i=1}^{N}\Pr(\bm{\alpha}_l\mid\bm{x}_i)\]

and the expected number of those examinees who answered item \(j\) correctly is

\[R_{jl}=\sum_{i=1}^{N}\Pr(\bm{\alpha}_l\mid\bm{x}_i)\,x_{ij}\]

Let \(I^0_{jl}\) be the expected count of examinees who lack at least one attribute required for item \(j\) (i.e., \(\xi_{ijl}=0\)) and \(R^0_{jl}\) be the subset of those who answered correctly. Analogously, \(I^1_{jl}\) and \(R^1_{jl}\) correspond to examinees who possess all required attributes (\(\xi_{ijl}=1\)).

M-Step

With these expected counts, the guessing and slip parameters are updated in closed form:

\[\hat{g}_j=\frac{R^0_{jl}}{I^0_{jl}},\qquad\hat{s}_j=\frac{I^1_{jl}-R^1_{jl}}{I^1_{jl}}\]

Class probabilities are updated as \(\hat\pi_c=N^{-1}\sum_i\tau_{ic}\). The EM loop alternates between E-step and M-step until the change in parameters across iterations falls below the convergence tolerance.

Skill Profile Estimation

Once item parameters are calibrated, each examinee's skill profile is estimated from the likelihood of their response pattern \(\bm{x}_i\) given each of the \(L=2^K\) possible profiles:

\[\mathcal{L}(\bm{\alpha}_l;\bm{x}_i)=\prod_{j=1}^{J}\Pr(x_{ij}=1\mid\bm{\alpha}_l,g_j,s_j)^{x_{ij}}\left[1-\Pr(x_{ij}=1\mid\bm{\alpha}_l,g_j,s_j)\right]^{1-x_{ij}}\]

Three estimators are available.

Maximum Likelihood (MLE)

The MLE selects the profile that maximizes the likelihood across all \(L\) possible classifications:

\[\hat{\bm{\alpha}}^{\text{MLE}}_i=\underset{\bm{\alpha}_l}{\operatorname{argmax}}\;\mathcal{L}(\bm{\alpha}_l;\bm{x}_i)\]

Because skill profiles are discrete, this is computed by exhaustive evaluation over all \(L\) profiles. The MLE may be multimodal — two or more profiles can share the same maximum likelihood. PACER flags such cases and reports all profiles achieving the maximum.

Maximum A Posteriori (MAP)

The MAP incorporates a prior distribution \(\Pr(\bm{\alpha}_l)\) over skill profiles (supplied by the user), maximizing the posterior:

\[\hat{\bm{\alpha}}^{\text{MAP}}_i=\underset{\bm{\alpha}_l}{\operatorname{argmax}}\;\mathcal{L}(\bm{\alpha}_l;\bm{x}_i)\,\Pr(\bm{\alpha}_l)\]

where \(\Pr(\bm{\alpha}_l)\) is a discrete prior such that \(\sum_l\Pr(\bm{\alpha}_l)=1\). Like the MLE, the MAP can be multimodal.

Expected A Posteriori (EAP)

The EAP uses the full posterior distribution over profiles. The posterior for profile \(\bm{\alpha}_l\) given \(\bm{x}_i\) is

\[\Pr(\bm{\alpha}_l\mid\bm{x}_i)=\frac{\mathcal{L}(\bm{\alpha}_l;\bm{x}_i)\,\Pr(\bm{\alpha}_l)}{\displaystyle\sum_{l'=1}^{L}\mathcal{L}(\bm{\alpha}_{l'};\bm{x}_i)\,\Pr(\bm{\alpha}_{l'})}\]

Let \(\bm{\tau}_i = [\Pr(\bm{\alpha}_1\mid\bm{x}_i),\ldots,\Pr(\bm{\alpha}_L\mid\bm{x}_i)]\) be the \(1\times L\) posterior vector and \(\bm{P}\) the \(L\times K\) binary matrix of all possible skill profiles. The EAP estimate is

\[\hat{\bm{\alpha}}^{\text{EAP}}_i=\bm{\tau}_i\,\bm{P}\]

This is a \(1\times K\) vector where each element is the marginal posterior probability of mastery for the corresponding attribute — equivalent to \(\sum_{l:\,\alpha_k^{(l)}=1}\Pr(\bm{\alpha}_l\mid\bm{x}_i)\). Unlike the MLE and MAP, the EAP yields continuous mastery probabilities between 0 and 1 and is always unique.

Examinee Skill Profile Estimation

Once item parameters are calibrated, each examinee's mastery profile is characterized by their full posterior distribution over the \(2^K\) latent classes. Three quantities are reported.

Modal Profile (MAP Classification)

The modal-class estimate assigns the examinee to the latent class with the highest posterior probability:

\[\hat{c}_i = \arg\max_{c\in\{1,\ldots,2^K\}}\;\tau_{ic}\]

The estimated attribute profile is \(\hat{\bm{\alpha}}_i = \bm{\alpha}^{(\hat{c}_i)}\), a binary vector of length \(K\) indicating mastery (1) or non-mastery (0) of each attribute. This is the primary classification output.

Marginal Attribute Mastery Probabilities

The marginal probability of mastering attribute \(k\) is obtained by summing the posterior over all latent classes in which attribute \(k\) is mastered:

\[P(\alpha_{ik}=1\mid\bm{x}_i)=\sum_{c:\,\alpha_k^{(c)}=1}\tau_{ic}\]

These \(K\) probabilities provide a continuous measure of mastery confidence. A value near 1 indicates strong evidence of mastery; near 0 indicates strong evidence of non-mastery; intermediate values indicate classification uncertainty for that attribute.

Classification Certainty

The posterior entropy of the class distribution for examinee \(i\) quantifies overall classification certainty:

\[H_i = -\sum_{c=1}^{2^K}\tau_{ic}\ln\tau_{ic}\]

Low entropy indicates that most posterior weight is concentrated on a single class (high certainty). High entropy indicates spread across multiple classes, flagging examinees whose profiles should be interpreted cautiously. The minimum possible entropy is 0 (perfect certainty); the maximum is \(\ln(2^K)=K\ln 2\) (uniform uncertainty).

Item Analysis — DIF

Differential Item Functioning

DIF occurs when respondents from different groups who have the same ability have systematically different probabilities of answering an item correctly. PACER implements the Mantel-Haenszel procedure (Holland & Thayer, 1988) for binary items and the Liu-Agresti (1996) generalization for polytomous items. All procedures first stratify respondents into \(K\) equal-frequency bins (default \(K=10\)) using a matching variable — either total score or IRT theta. Binning uses linear interpolation on the empirical quantile function so that equal proportions of respondents fall in each stratum; the lowest observed score is included in the first bin.

Mantel-Haenszel Procedure (Binary Items)

Within each stratum \(s\), a \(2\times2\) contingency table is formed:

\[\begin{array}{c|cc|c}&\text{Correct}&\text{Incorrect}&\text{Total}\\\hline\text{Reference}&A_s&B_s&n_{R,s}\\\text{Focal}&C_s&D_s&n_{F,s}\\\hline&n_{1,s}&n_{2,s}&n_s\end{array}\]

The MH log-odds ratio pools across strata:

\[\hat\alpha_{\text{MH}}=\frac{\displaystyle\sum_s A_s D_s/n_s}{\displaystyle\sum_s B_s C_s/n_s}\]

The ETS D-DIF statistic:

\[\text{MH D-DIF}=-2.35\ln\hat\alpha_{\text{MH}}\]

Negative D-DIF means the reference group has higher odds of answering correctly at the same ability — the item favors the reference group. Note: some published implementations invert this sign; PACER applies the convention stated above.

Chi-square with Yates continuity correction (1 df):

\[\chi^2_{\text{MH}}=\frac{\bigl(|\sum_s A_s-\sum_s E(A_s)|-0.5\bigr)^2}{\sum_s\mathrm{Var}(A_s)},\quad E(A_s)=\frac{n_{R,s}\,n_{1,s}}{n_s},\quad\mathrm{Var}(A_s)=\frac{n_{R,s}\,n_{F,s}\,n_{1,s}\,n_{2,s}}{n_s^2(n_s-1)}\]

The p-value is computed from the regularized incomplete gamma function via series and continued-fraction expansions (accurate to \(\approx10^{-10}\)).

Generalized MH Procedure (Polytomous Items)

For polytomous items with score values \(v_1\lt\cdots\lt v_S\), let \(N_{g,k}(v)\) be the count of respondents in group \(g\) with score \(\le v\) in stratum \(k\). The Liu-Agresti (1996) one-df score test:

\[T=\sum_k\sum_v\frac{N_{R,k}(v)\,n_{F,k}-N_{F,k}(v)\,n_{R,k}}{n_k}\]

\[\mathrm{Var}(T)=\sum_k\frac{n_{R,k}\,n_{F,k}}{n_k^2(n_k-1)}\Bigl[n_k\!\sum_v n_{\cdot k}(v)\bigl(N_{\cdot k}(v)\bigr)^2-\Bigl(\sum_v n_{\cdot k}(v)\,N_{\cdot k}(v)\Bigr)^2\Bigr]\]

The test statistic \(T^2/\mathrm{Var}(T)\) is chi-squared with one degree of freedom.

Standardized Mean Difference (SMD)

\[\text{SMD}=\bar\mu_{\text{focal}}-\bar\mu_{\text{reference}}=\sum_k\frac{n_{F,k}}{N_F}\!\left(\bar X_{F,k}-\bar X_{R,k}\right)\]

Positive SMD means the focal group scores higher (item favors focal); negative means the item favors the reference group. Some published implementations use \(|\text{SMD}|\) for the direction determination; PACER uses the raw signed value as defined above. The standardized SMD is

\[\tilde\delta=\frac{|\text{SMD}|}{s_{\text{pooled}}},\qquad s_{\text{pooled}}=\sqrt{\frac{\sum_k(n_k-1)s_k^2}{N-K}}\]

ETS DIF Classification

Class	Binary (MH D-DIF)	Polytomous (\(\tilde\delta\))	Meaning
A	\(p>.05\) or \(\|\text{MH}\|<1.0\)	\(p>.05\) or \(\tilde\delta\le0.17\)	Negligible DIF; item acceptable.
B	Between A and C	Between A and C	Moderate DIF; item warrants review.
C	\(p<.05\) and \(\|\text{MH}\|\ge1.5\)	\(p<.05\) and \(\tilde\delta\ge0.25\)	Large DIF; item should likely be removed or revised.

Implementation NotesPACER uses signed MH D-DIF (negative values favor the reference group), the raw signed SMD for direction determination, and infers item score coding from actual observed values — so items coded 1–4 or any other range are handled correctly without recoding.

IRT Tools — Item Fit

Item Fit: Yen's Q1

After calibration, PACER assesses item fit using Yen's (1981) Q1 statistic. Q1 partitions respondents into score groups ordered by estimated ability, then compares observed item responses within each group to the values expected under the fitted IRT model. Systematic discrepancies between observed and expected values indicate that the model does not adequately account for response behavior across the ability continuum.

Person ability estimates are EAP scores computed under a diffuse \(N(0,10)\) prior, which approximates maximum likelihood estimates and avoids boundary problems for extreme scores. Respondents with missing responses on a given item are excluded from that item's computation.

Binary Items

Score Group Formation

Let \(G\) denote the number of score groups (default \(G=10\), following Yen, 1981). Respondents with valid responses to item \(j\) are sorted by their EAP estimate \(\hat\theta_i\) and divided into \(G\) equal-frequency groups. Let \(n_g\) be the number of respondents in group \(g\) and \(\bar\theta_g\) be their mean EAP estimate.

Observed and Expected Proportions

Within group \(g\), the observed proportion correct is

\[\bar{x}_g = \frac{1}{n_g}\sum_{i \in g} x_{ij}\]

The expected proportion correct at the group mean ability is

\[E_g = P(X=1 \mid \bar\theta_g,\, a_j,\, b_j,\, c_j)\]

where \(P(\cdot)\) is the appropriate binary IRF — 3PL for items with a nonzero guessing parameter, 2PL or 1PL otherwise.

The Q1 Statistic

The statistic follows a weighted chi-square form, with each bin's contribution scaled by the variance of a Bernoulli variable with mean \(E_g\):

\[Q_1 = \sum_{g=1}^{G^*} \frac{n_g\,(\bar{x}_g - E_g)^2}{E_g(1-E_g)}\]

where \(G^*\) is the number of non-empty bins actually formed (which may be less than \(G\) when many respondents share the same EAP estimate). Under the null hypothesis that the item fits the model, \(Q_1\) is approximately chi-square distributed with degrees of freedom

\[\mathrm{df} = \begin{cases} G^* - 2 & \text{1PL or 2PL} \\ G^* - 3 & \text{3PL} \end{cases}\]

The reduction by one additional degree of freedom for the 3PL reflects the extra guessing parameter estimated from the data.

Polytomous Items

Score Group Formation

Score group formation is identical to the binary case: respondents are sorted by EAP and divided into \(G\) equal-frequency bins. Let \(K\) denote the number of response categories (0 through \(K-1\)).

Observed and Expected Category Proportions

Within group \(g\), the observed proportion in category \(k\) is

\[\hat{O}_{gk} = \frac{1}{n_g}\sum_{i \in g} \mathbf{1}[x_{ij} = k]\]

The expected proportion in category \(k\) at the group mean ability is

\[E_{gk} = P(X = k \mid \bar\theta_g,\, \bm{\psi}_j)\]

where \(\bm{\psi}_j\) denotes the calibrated parameters for item \(j\) and \(P(\cdot)\) is evaluated using the model-appropriate category response function. For the GRM:

\[P(X=k\mid\theta) = P^*(X \ge k \mid\theta) - P^*(X \ge k+1 \mid\theta)\]

with sentinels \(P^*(X \ge 0)=1\) and \(P^*(X \ge K)=0\). For the GPCM and PCM:

\[P(X=k\mid\theta) = \frac{\exp\!\left(Da\sum_{m=0}^{k}(\theta - d_m)\right)}{\displaystyle\sum_{s=0}^{K-1}\exp\!\left(Da\sum_{m=0}^{s}(\theta - d_m)\right)}\]

with the convention that the empty sum (\(k=0\)) equals zero.

The Q1 Statistic for Polytomous Items

The statistic generalizes the binary form to a sum over both bins and categories:

\[Q_1 = \sum_{g=1}^{G^*} n_g \sum_{k=0}^{K-1} \frac{(\hat{O}_{gk} - E_{gk})^2}{E_{gk}}\]

Under the null hypothesis of model fit, \(Q_1\) is approximately chi-square distributed with

\[\mathrm{df} = G^*(K-1) - K\]

The \(-K\) correction accounts for the \(K\) item parameters estimated from the data.

Standardization and Flagging

Standardized Q1z

To place Q1 on a common scale across items with different numbers of categories and bins, the statistic is standardized using the mean and variance of the chi-square distribution under the null:

\[Q_{1z} = \frac{Q_1 - \mathrm{df}}{\sqrt{2\,\mathrm{df}}}\]

Under correct model specification, \(Q_{1z}\) is approximately standard normal. Large positive values indicate misfit.

Practical Flagging Criterion

PACER uses Yen's (1981) sample-size-adjusted practical cutoff rather than a fixed critical value. An item is flagged for review when

\[Q_{1z} > \frac{N}{1500} \times 4\]

where \(N\) is the number of respondents with valid responses on that item. This heuristic recognizes that with large samples even trivially small model-data discrepancies become statistically detectable. At \(N=1500\) the cutoff equals 4.0; at \(N=500\) it equals approximately 1.33.

Note on Statistical PowerQ1 has high power to detect misfit at large sample sizes. Near-universal flagging on large datasets (\(N \gtrsim 1000\)) does not necessarily indicate a poorly fitting model — it often reflects the sensitivity of the chi-square statistic to minor, practically unimportant discrepancies. Visual inspection of the ICC with observed score-group proportions (available via the Fit button on each item) is recommended alongside Q1z when interpreting flagged items.

Implementation DetailsThe model type for df determination is taken from the calibrated item model (1PL, 2PL, or 3PL), not from the estimated guessing parameter value. Items with fewer than \(2G\) valid responses return missing fit statistics. All category probability computations use the same IRF implementations as the calibration engine, ensuring internal consistency between estimated parameters and fit evaluation.

Numerical Methods — Quadrature

Gauss-Hermite Quadrature

All integrals over the latent-trait distribution are approximated by Gauss-Hermite quadrature with \(Q\) nodes (default \(Q=21\)):

\[\int_{-\infty}^{\infty}f(\theta)\,\phi(\theta)\,d\theta\approx\sum_{k=1}^{Q}w_k\,f(z_k)\]

where \(\phi\) is the standard normal density. Nodes \(\{z_k\}\) are roots of the probabilists' Hermite polynomial \(\text{He}_Q(x)\); weights satisfy \(w_k=v_{k,1}^2\sqrt{2\pi}\) where \(v_{k,1}\) is the first component of the \(k\)-th eigenvector. For all values of \(Q\), nodes and weights are computed on demand via the symmetric QL algorithm applied to the tridiagonal Jacobi matrix whose off-diagonal elements are \(\beta_i=\sqrt{i+1}\) for \(i=0,1,\ldots,Q-2\) (the probabilists' He-polynomial recurrence). Eigenvalues of this matrix are the GH nodes; weights follow from the squared first eigenvector components scaled by \(\sqrt{2\pi}\). For multigroup calibration, nodes are shifted and scaled: \(\theta_{gk}=\mu_g+\sigma_g z_k\). Weights are invariant because the Jacobian and density ratio cancel exactly.

Numerical Methods — Optimization

Optimization Routines

L-BFGS-B (Item Calibration)

Item parameters are estimated by L-BFGS-B. The two-loop recursion computes the quasi-Newton direction \(\bm{p}_t=-H_t^{-1}\nabla f_t\) from \(m=10\) curvature pairs without forming \(H_t^{-1}\) explicitly. Box constraints are enforced by projecting onto the active constraint set. Step length satisfies the strong Wolfe conditions via backtracking. Convergence: \(|\Delta f|/(0.1+|f|)<\tau\) (default \(\tau=10^{-8}\)).

Gradient Descent with Backtracking (Theta Estimation)

MLE and MAP theta estimation use gradient descent with backtracking. At each step: \(\theta_{\text{new}}=\theta-\alpha[-\ell'(\theta)]\) where \(\ell'(\theta)\approx[\ell(\theta+h)-\ell(\theta-h)]/(2h)\), \(h=10^{-6}\). Step size \(\alpha\) is increased by 1.2 on progress and halved on rejection; convergence when \(\alpha<10^{-10}\).

Brent's Method (TCC Inversion)

TCC scoring inverts \(\tau(\theta)=X_{\text{obs}}\) on \([-8,8]\). Brent's method combines bisection (guaranteed progress), secant (linear interpolation), and inverse quadratic interpolation (parabolic fit through three points). The method falls back to bisection whenever the other two strategies would not make sufficient progress, giving guaranteed \(O(\log\varepsilon^{-1})\) worst-case convergence. Tolerance: \(10^{-10}\).

Nelder-Mead Simplex (Equating Objectives)

The two-parameter SL and HB equating objectives are minimized over \((A,B)\) by Nelder-Mead simplex. The simplex of three vertices is updated by reflection, expansion, contraction, or shrinkage at each iteration. Initial simplex side: 0.5; maximum iterations: 2,000; convergence tolerance on objective range: \(10^{-10}\).

IRT Tools — Regression

Latent Regression

Latent regression extends standard IRT calibration by replacing the unconditional population distribution \(\theta_i\sim N(\mu,\sigma^2)\) with a person-specific conditional distribution driven by observed person covariates. Item parameters are held fixed (taken from a prior calibration), and the regression coefficients that link person background variables to the latent-trait mean are estimated via marginal maximum likelihood. The approach unifies structural equation modeling ideas with IRT, allowing researchers to ask how demographic or contextual variables explain individual differences in the construct being measured.

Relationship to Standard IRT CalibrationStandard PACER calibration estimates item parameters with a free population distribution \(N(\mu,\sigma^2)\). Latent regression fixes the item parameters and models the population mean as a linear function of person covariates \(\mu_i = \mathbf{x}_i^\top\boldsymbol{\gamma}\). With no covariates and a free intercept the model reduces to a simple estimation of the population mean and standard deviation, which is equivalent to the fixed-items phase of a two-stage IRT scaling design.

Population Model

Let \(\mathbf{x}_i=(1, x_{i1},\ldots,x_{iP})^\top\) be the \((P{+}1)\)-vector of person covariates for examinee \(i\) (the leading 1 is the intercept) and \(\boldsymbol{\gamma}=(\gamma_0,\gamma_1,\ldots,\gamma_P)^\top\) be the unknown regression weights. The latent trait is assumed normally distributed with a person-specific mean:

\[\theta_i \mid \mathbf{x}_i \;\sim\; N\!\bigl(\mu_i,\;\sigma^2\bigr), \qquad \mu_i = \mathbf{x}_i^\top\boldsymbol{\gamma}\]

The prior density at the quadrature node \(\theta_q\) for person \(i\) is therefore

\[\phi(\theta_q;\mu_i,\sigma) = \frac{1}{\sigma}\,\phi_0\!\left(\frac{\theta_q-\mu_i}{\sigma}\right)\]

where \(\phi_0\) is the standard normal density. The item parameters \(\{a_j,b_j,c_j\}\) (or their polytomous equivalents) are known constants throughout estimation.

Marginal Likelihood

Conditional on the latent trait, item responses are independent. The likelihood contribution of examinee \(i\) integrates out the latent trait using rectangular quadrature over \(Q\) equally-spaced nodes \(\{\theta_q\}\) spanning \(\mu_i \pm 4\sigma\), with spacing \(\Delta = 8\sigma/(Q-1)\):

\[P(\mathbf{x}_i \mid \boldsymbol{\gamma},\sigma) = \int_{-\infty}^{\infty} \prod_{j=1}^{J} P(x_{ij}\mid\theta)\;\phi(\theta;\mu_i,\sigma)\,d\theta \;\approx\; \sum_{q=1}^{Q} \prod_{j=1}^{J} P(x_{ij}\mid\theta_q)\;\phi(\theta_q;\mu_i,\sigma)\,\Delta\]

Because \(\phi(\theta_q;\mu_i,\sigma)\,\Delta\) plays the role of the quadrature weight, the marginal likelihood varies across examinees (unlike the standard calibration case, where the prior is the same for everyone). The observed-data log-likelihood summed over all \(N\) examinees is

\[\ell(\boldsymbol{\gamma},\sigma) = \sum_{i=1}^{N}\log P(\mathbf{x}_i \mid \boldsymbol{\gamma},\sigma)\]

EM Algorithm

The parameters \((\boldsymbol{\gamma},\sigma)\) are estimated by the EM algorithm. Define the item-response likelihood at node \(q\) for examinee \(i\) as

\[L_{qi} = \prod_{j=1}^{J}P(x_{ij}\mid\theta_q)\]

and the marginal denominator as \(f_i = \sum_q L_{qi}\,\phi(\theta_q;\mu_i,\sigma)\,\Delta\).

E-Step

Compute the posterior probability that examinee \(i\) has latent trait at node \(\theta_q\):

\[r_{qi} = \frac{L_{qi}\;\phi(\theta_q;\mu_i,\sigma)\,\Delta}{f_i}\]

These posterior weights satisfy \(\sum_q r_{qi}=1\) for each \(i\).

M-Step

Holding the posterior weights fixed, maximize the expected complete-data log-likelihood with respect to \(\boldsymbol{\gamma}\) and \(\sigma\). This is equivalent to a weighted normal-distribution regression. Define the posterior mean and variance of the latent trait for examinee \(i\):

\[\hat\theta_i = \sum_q r_{qi}\,\theta_q, \qquad \hat v_i = \sum_q r_{qi}\,(\theta_q - \hat\theta_i)^2\]

The M-step objective is

\[Q(\boldsymbol{\gamma},\sigma) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}\left[\hat v_i + (\hat\theta_i - \mu_i)^2\right]\]

where \(\mu_i = \mathbf{x}_i^\top\boldsymbol{\gamma}\). Setting partial derivatives to zero yields closed-form updates. Let \(\mathbf{X}\) be the \(N\times(P{+}1)\) design matrix and \(\hat{\boldsymbol\theta}=(\hat\theta_1,\ldots,\hat\theta_N)^\top\).

Regression weights (WLS with posterior variances as residual variances):

\[\hat{\boldsymbol\gamma} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\hat{\boldsymbol\theta}\]

PACER implements this via L-BFGS on the negative expected log-likelihood to handle the constraint \(\sigma>0\) and possible non-diagonal weight structures.

Population variance:

\[\hat\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}\left[\hat v_i + (\hat\theta_i - \hat\mu_i)^2\right]\]

The E- and M-steps alternate until the relative change in \(\ell\) falls below the convergence tolerance (default \(10^{-8}\)). Gauss-Hermite quadrature with \(Q=21\) rectangular nodes is used; the R-convention (omitting the \(\Delta\) weight factor from the marginal sum) matches the output of R's mml package.

Standard Errors

Standard errors for \(\hat{\boldsymbol\gamma}\) and \(\hat\sigma\) are obtained from the observed information matrix, approximated by numerical differentiation of the gradient of the negative log-likelihood at convergence:

\[\hat{\bm{V}} = \left[-\frac{\partial^2\ell}{\partial\boldsymbol\psi\,\partial\boldsymbol\psi^\top}\right]^{-1}\bigg|_{\hat{\boldsymbol\psi}}\]

where \(\boldsymbol\psi=(\boldsymbol\gamma^\top,\log\sigma)^\top\). The delta method converts the SE of \(\log\sigma\) to the SE of \(\sigma\). Wald z-statistics \(\hat\psi_k/\widehat{\text{SE}}(\hat\psi_k)\) and two-tailed p-values are reported for each coefficient.

Plausible ValuesAfter convergence, PACER can generate \(K\) plausible values per examinee by sampling from the discrete posterior \(\{r_{qi}\}\). The sampling uses inverse-CDF (probability integral transform) on the ordered quadrature nodes, with uniform jitter within each node interval for continuity. Plausible values are useful for secondary analyses that must propagate measurement uncertainty (e.g., subgroup mean comparisons, regression of latent trait on outcomes). The population mean and SD of the plausible values should agree with \(\hat\mu_i\) and \(\hat\sigma\) up to Monte Carlo error.

IRT Tools — Regression

Explanatory IRT: The Linear Logistic Test Model

The Linear Logistic Test Model (LLTM; Fischer, 1973) is the item-side analogue of latent regression: instead of explaining person ability by person covariates, the LLTM explains item difficulty by observable item properties. Item difficulty is no longer a free parameter per item but is constrained to equal a weighted sum of item-attribute effects. This turns a measurement model into a structural model that tests hypotheses about why items differ in difficulty — for example, whether the number of inference steps, the presence of a negation, or the response format drives item difficulty.

Relationship to Latent RegressionBoth models share the same EM skeleton: an E-step computing posterior weights \(r_{qi}\) and an M-step updating structural parameters. In latent regression the E-step uses person-specific priors \(N(\mu_i,\sigma^2)\); in the LLTM the E-step uses a common prior \(N(0,\sigma^2)\) and the M-step updates the attribute-weight vector \(\boldsymbol\beta\) that determines item difficulty. A doubly-explanatory model combining both is a natural extension: \(b_j = \mathbf{w}_j^\top\boldsymbol\beta\) and \(\mu_i = \mathbf{x}_i^\top\boldsymbol\gamma\) are estimated simultaneously.

Adjacent-Category Coding for Polytomous Items

The LLTM in PACER applies to binary-response items. Polytomous items with ordered categories (e.g., no / perhaps / yes scored 0/1/2) are accommodated by transforming them into pairs of binary pseudo-items using the adjacent-category partial credit parameterization (Masters, 1982).

Transformation

For a 3-category item scored \(0, 1, 2\), two binary pseudo-items are created, indexed adj1 and adj2:

Original response	adj1 response	adj2 response
0 (No)	0	NaN (missing)
1 (Perhaps)	1	0
2 (Yes)	NaN (missing)	1

The NaN entries indicate that the pseudo-item is not administered when the observed response does not involve that transition. This coding follows directly from the adjacent-categories logit factorisation of the PCM. For an item with \(M+1\) categories the PCM probability of category \(m\) given ability \(\theta\) is

\[P(X=m\mid\theta) = \frac{\exp\!\left(\sum_{l=1}^{m}(\theta - \delta_l)\right)}{\displaystyle\sum_{h=0}^{M}\exp\!\left(\sum_{l=1}^{h}(\theta-\delta_l)\right)}, \qquad m=0,1,\ldots,M\]

where \(\delta_l\) is the step difficulty for transition \(l-1\to l\). The adjacent-categories logit for the \(m\)-th transition is

\[\eta_m = \ln\frac{P(X=m\mid\theta)}{P(X=m-1\mid\theta)} = \theta - \delta_m\]

Each \(\eta_m\) has exactly the form of a binary 1PL item with difficulty \(\delta_m\). The adj-coding makes this explicit: adj1 has difficulty \(\delta_1\) (the 0→1 transition) and adj2 has difficulty \(\delta_2\) (the 1→2 transition). Because the LLTM assigns each pseudo-item a row in the design matrix \(\mathbf{W}\), both adj1 and adj2 of the same original item inherit the same item-attribute vector \(\mathbf{w}_j\). This constrains \(\delta_1 = \delta_2 = \mathbf{w}_j^\top\boldsymbol\beta\), which is appropriate when item properties are hypothesized to shift the overall item location uniformly across response thresholds.

Polytomous LLTM ExtensionsKim and Wilson (2019) describe two generalizations that relax the equal-threshold constraint. The item location explanatory many-facet Rasch model (MFRM) predicts the overall item location \(\beta_i = \mathbf{w}_i^\top\boldsymbol\gamma\) while freely estimating step-deviation parameters \(\tau_{im}\) per item (so \(\delta_{im}=\beta_i+\tau_{im}\)). The step difficulty explanatory linear partial credit model (LPCM) uses step-specific design matrices \(\mathbf{W}_m\), allowing attribute effects to differ across transitions: \(\delta_{im}=\mathbf{w}_i^\top\boldsymbol\omega_m\). PACER currently implements the simpler equal-threshold version, which is exact when item properties operate at the item level rather than the category level.

LLTM Item Difficulty Regression

Let \(J\) be the total number of items (binary and/or adj-coded pseudo-items) and \(P\) the number of item attributes. The design matrix \(\mathbf{W}\in\mathbb{R}^{J\times P}\) has one row per item, where \(w_{jp}\) encodes how much item \(j\) involves attribute \(p\). The LLTM constrains item difficulty to

\[b_j = \mathbf{w}_j^\top\boldsymbol\beta = \sum_{p=1}^{P}\beta_p\,w_{jp}\]

where \(\boldsymbol\beta=(\beta_1,\ldots,\beta_P)^\top\) are the unknown attribute weights. For categorical item properties, the W matrix is constructed using first-occurrence reference coding: the first level encountered in the data becomes the reference (omitted) category; each subsequent level generates a binary dummy column. An intercept column (all ones) is optionally included.

The IRT probability model for binary items under 1PL remains the standard logistic:

\[P(X_j=1\mid\theta) = \frac{1}{1+\exp(-D(\theta - b_j))}, \qquad b_j = \mathbf{w}_j^\top\boldsymbol\beta\]

For 2PL, each item also has a free discrimination \(a_j\). For GPCM, the step difficulties of each pseudo-item are predicted by the LLTM. The scaling constant \(D\) defaults to 1.0 to match the output of R's eirm/lme4 package (which uses the standard logistic without a scaling constant); set \(D=1.7\) to match PACER's default IRT parameterization.

The person distribution is fixed: \(\theta_i\sim N(0,\sigma^2)\) with \(\sigma\) freely estimated. This population standard deviation captures residual person-ability variance not accounted for by the item predictors.

Marginal Likelihood

Because the person distribution is common (not person-specific), the marginal likelihood factors over response patterns rather than individuals. Let \(\mathcal{P}\) be the set of unique response patterns, \(n_t\) the count of examinees with pattern \(t\), and \(L_{qt}\) the item-response likelihood for pattern \(t\) at node \(\theta_q\):

\[L_{qt} = \prod_{j=1}^{J}P(x_{tj}\mid\theta_q)^{\mathbf{1}[x_{tj}\neq\text{NA}}]\]

The log-likelihood is

\[\ell(\boldsymbol\beta,\sigma) = \sum_{t\in\mathcal{P}}n_t\log\!\left(\sum_{q=1}^{Q}L_{qt}\;w_q(\sigma)\right)\]

where \(w_q(\sigma) = \phi_0(z_q)\) is the standard GH weight (nodes \(\{z_q\}\) are roots of the Hermite polynomial \(\text{He}_Q\); see the Quadrature section). The population SD \(\sigma\) scales the nodes as \(\theta_q = \sigma z_q\). The pattern-based factorisation is computationally critical: with \(N=10{,}000\) examinees and \(|\mathcal{P}|\ll N\), each E-step pass over the \(Q\times|\mathcal{P}|\) likelihood table is much cheaper than a pass over all \(N\) examinees.

After subtracting the GH normalisation constant, the reported log-likelihood matches R's glmer (Laplace approximation) to within the quadrature–Laplace gap (\(\approx 1{-}2\) units for typical datasets at \(Q=21\)).

EM Algorithm

E-Step

Compute posterior pattern weights:

\[r_{qt} = \frac{n_t\;L_{qt}\;w_q(\sigma)}{\displaystyle\sum_{q'=1}^{Q}L_{q't}\;w_{q'}(\sigma)}\]

The numerator includes the pattern count \(n_t\) so that patterns with many examinees contribute proportionally to downstream M-step gradients. Without this weight, the M-step gradient effectively treats every unique pattern as a single observation, causing \(\sigma\) to collapse for data with few unique patterns.

M-Step: Attribute Weights \(\boldsymbol\beta\)

Holding \(r_{qt}\) and \(\sigma\) fixed, maximize the expected complete-data log-likelihood with respect to \(\boldsymbol\beta\) (and, for 2PL/3PL, the per-item discriminations \(\{a_j\}\) and guessing parameters \(\{c_j\}\)):

\[Q(\boldsymbol\beta) = \sum_{t,q}r_{qt}\sum_{j}\Bigl[x_{tj}\log P(x_{tj}\mid\theta_q) + (1-x_{tj})\log(1-P(x_{tj}\mid\theta_q))\Bigr]\]

This is a standard weighted binary logistic regression with the posterior weights \(r_{qt}\) playing the role of individual observation weights. L-BFGS is applied to the negated objective (see Optimization section). For the 1PL (Rasch) model, the gradient with respect to \(\boldsymbol\beta\) is

\[\frac{\partial Q}{\partial\beta_p} = \sum_{t,q}r_{qt}\sum_{j}w_{jp}\Bigl[x_{tj} - P(x_{tj}=1\mid\theta_q)\Bigr]\cdot(-D)\]

because \(\partial b_j/\partial\beta_p = w_{jp}\). For 2PL, an additional gradient term with respect to \(\log a_j\) is computed simultaneously.

M-Step: Population SD \(\sigma\)

The population variance is updated in closed form after each E-step:

\[\hat\sigma^2 = \frac{1}{N}\sum_{t,q}r_{qt}\,\theta_q^2\]

where \(N=\sum_t n_t\) is the total sample size. This follows from taking the derivative of the expected log-likelihood with respect to \(\sigma^2\) and setting it to zero, using the fact that the GH weights integrate the standard normal with mean zero.

E- and M-steps alternate until the relative LL change satisfies \(|\Delta\ell|/(0.1+|\ell|)<\tau\) (default \(\tau=10^{-8}\)).

Standard Errors for \(\boldsymbol\beta\)

Standard errors for the attribute weights are obtained from the expected complete-data information matrix, which is well-defined at the M-step optimum. PACER uses numerical second derivatives of the M-step gradient at convergence. For the Rasch LLTM, the Hessian of the M-step negative log-likelihood with respect to \(\boldsymbol\beta\) at the final E-step posterior is

\[[\mathbf{H}]_{pp'} = D^2\sum_{t,q}r_{qt}\sum_j w_{jp}w_{jp'}\,P_j(\theta_q)[1-P_j(\theta_q)]\]

The standard errors are \(\widehat{\text{SE}}(\hat\beta_p)=\sqrt{[\mathbf{H}^{-1}]_{pp}}\), scaled by \(D\) to convert from the D-scaled internal parameterization to the \(D=1\) logit scale. Wald z-statistics and two-tailed p-values are reported. For 2PL/3PL models the Hessian is augmented with rows and columns for \(\log a_j\) (and \(\text{logit}\, c_j\)) before inversion.

Pseudo R² for Item Difficulty Prediction

After the LLTM converges, PACER reports a pseudo R² that quantifies how well the item attribute matrix \(\mathbf{W}\) accounts for variability in item difficulty. The statistic is a variance-explained ratio defined over the predicted difficulties \(\hat{b}_j = \mathbf{w}_j^\top\hat{\boldsymbol\beta}\):

\[R^2 = \frac{\widehat{\text{Var}}(\hat{\mathbf{b}})}{\widehat{\text{Var}}(\hat{\mathbf{b}})+\hat\sigma^2}\]

where \(\widehat{\text{Var}}(\hat{\mathbf{b}}) = \frac{1}{J}\sum_j(\hat b_j - \bar{\hat b})^2\) is the empirical variance of the predicted item difficulties across all binary (or adj-coded) items, and \(\hat\sigma^2\) is the estimated residual ability variance from the EM. The denominator represents the total variability in difficulty: the component explained by the predictors plus the unexplained residual. The ratio is bounded to \([0, 0.999)\).

Interpretation follows the same logic as R² in ordinary regression: a value near 1 means the attribute weights \(\hat{\boldsymbol\beta}\) account for nearly all systematic variation in item difficulty; a value near 0 means the predictors contribute little spread to the predicted \(\hat b_j\) relative to the ability variance \(\hat\sigma^2\). Unlike McFadden's pseudo R² (which compares log-likelihoods of a full vs. null model), this statistic operates directly on the difficulty scale and is therefore interpretable without reference to a baseline model.

Plausible Values from LLTM

After the LLTM converges, plausible values are drawn from the individual posterior — the distribution of \(\theta_i\) given the person's responses and the converged structural model. For latent regression, the individual posterior uses the person-specific prior \(N(\mathbf{x}_i^\top\hat{\boldsymbol\gamma},\hat\sigma^2)\). For the LLTM, the prior is the common distribution \(N(0,\hat\sigma^2)\). In both cases the posterior at node \(\theta_q\) for person \(i\) is

\[r_{qi}^{(\text{pv})} = \frac{L_{qi}\;\phi(\theta_q;\mu_i,\hat\sigma)\,\Delta}{\displaystyle\sum_{q'}L_{q'i}\;\phi(\theta_{q'};\mu_i,\hat\sigma)\,\Delta}\]

where for the LLTM \(\mu_i=0\). Drawing \(K\) plausible values from this discrete distribution uses the probability-integral transform: the cumulative sum of \(r_{qi}^{(\text{pv})}\) forms a step-function CDF; a uniform \(U[0,1\)] draw is mapped to the corresponding node, plus uniform jitter \(\sim U[-\Delta/2,\Delta/2]\) within the interval for continuity. The population variance of the plausible values estimates \(\hat\sigma^2+\overline{V}_i\) where \(\overline{V}_i\) is the average posterior variance, capturing both true between-person variability and measurement uncertainty.

Contrast with Calibration EMIn PACER's standard calibration (Section 3), the E-step and M-step both operate over items: the E-step computes posterior weights \(r_{qt}\) and the M-step updates item parameters \(\{a_j,b_j,c_j\}\). In the LLTM, the M-step instead updates the structural vector \(\boldsymbol\beta\) that parametrizes all item difficulties simultaneously through the W matrix. This collapses from \(J\) free item parameters to \(P\ll J\) attribute-weight parameters, providing parsimony at the cost of model fit relative to the unconstrained calibration.

Documentation

References

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459.

Bock, R. D., & Zimowski, M. F. (1997). Multiple group IRT. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory. Springer.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.

Doran, H. (2023). A collection of numerical recipes useful for building scalable psychometric applications. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/10769986221116905

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144–149.

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity. Lawrence Erlbaum.

Liu, I. M., & Agresti, A. (1996). Mantel-Haenszel-type inference for cumulative odds ratios with a stratified ordinal response. Biometrics, 52, 1223–1234.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Erlbaum.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17.

Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245–262.

Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210.

De Boeck, P., & Wilson, M. (Eds.) (2004). Explanatory item response models: A generalized linear and nonlinear approach. Springer.

Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374.

Kim, J., & Wilson, M. (2019). Polytomous item explanatory item response theory models. Educational and Psychological Measurement, 80(4), 726–755. https://doi.org/10.1177/0013164419892667

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.

Statistical MethodsReference

Classical Test Theory

Item-Level Descriptive Statistics

Mean, SD, and Standard Error

Relative Difficulty (p-value)

Discrimination Indices

Adjusted Point-Biserial (Item-Rest Correlation)

Biserial and Polyserial Correlation

Reliability and Standard Error of Measurement

Cronbach's \(\alpha\)

Alpha-if-Item-Deleted

Standard Error of Measurement

Item Response Models

Binary Item Response Models

Three-Parameter Logistic (3PL)

2PL and 1PL / Rasch

Graded Response Model (GRM)

Generalized Partial Credit Model (GPCM) and PCM

Rating Scale Model (RSM)

Item Calibration: Marginal Maximum Likelihood via EM

The Marginal Likelihood

Model Identification

Starting Values

E-Step: Posterior Weights and Sufficient Statistics

Binary Sufficient Statistics

Polytomous Sufficient Statistics

M-Step: Binary Items (2PL / 1PL / 3PL)

M-Step: GPCM and PCM

M-Step: GRM

M-Step: RSM

Parameter Priors

Beta Prior for the Guessing Parameter

Logit-Normal Prior for the Guessing Parameter

Per-Item Prior Overrides

Augmented Log-Posterior with Per-Item Priors

Prior on Polytomous Threshold Parameters

Multigroup Calibration (Bock-Zimowski)

Group-Scaled Nodes

Multigroup E-Step

Group Distribution Update (Closed Form)

Optimizer: L-BFGS-B

Ability Estimation

Maximum Likelihood Estimation (MLE)

Maximum A Posteriori (MAP)

Expected A Posteriori (EAP)

TCC Scoring

Boundary Condition

Post-Calibration EAP Thetas

Test Characteristic Curve, Information, and SEM

Test Characteristic Curve (TCC)

Binary Item Information

GRM Item Information

GPCM / PCM / RSM Item Information

IRT Equating and Linking

The Linear Scale Transformation

Stocking-Lord Methods

Haebara Methods

Mean/Sigma and Mean/Mean

Mean/Sigma (MS)

Mean/Mean (MM)

dropD² Iterative Anchor Purification

Cognitive Diagnostic Models

DINA Model

DINO Model

EM Estimation of Item Parameters

E-Step

M-Step

Skill Profile Estimation

Maximum Likelihood (MLE)

Maximum A Posteriori (MAP)

Expected A Posteriori (EAP)

Examinee Skill Profile Estimation

Modal Profile (MAP Classification)

Marginal Attribute Mastery Probabilities

Classification Certainty

Differential Item Functioning

Mantel-Haenszel Procedure (Binary Items)

Generalized MH Procedure (Polytomous Items)

Standardized Mean Difference (SMD)

ETS DIF Classification

Statistical Methods
Reference