Return to PACER
Technical Reference

Statistical Methods
Reference

PACER is built on a fully transparent statistical foundation. This reference documents every method — the probability models, estimation algorithms, and numerical procedures — exactly as they are implemented. Nothing is a black box. Throughout, PACER employs modern computational methods including parallel processing, numerical stabilization, and efficient algorithms designed to perform well at scale (Doran, 2023).

Item Analysis

Classical Test Theory

CTT characterizes items and tests using observed-score statistics from an \(N\times K\) response matrix (\(N\) respondents, \(K\) items). Items may be dichotomous (0/1) or polytomous (non-negative integers). All variance estimates use the sample \((n-1)\) denominator throughout. Missing data are handled pairwise (default) or listwise.

Item-Level Descriptive Statistics

Mean, SD, and Standard Error

Let \(\mathcal{V}_j\) be the set of respondents with non-missing responses to item \(j\) and \(n_j=|\mathcal{V}_j|\). Then

\[\bar{x}_j=\frac{1}{n_j}\sum_{i\in\mathcal{V}_j}x_{ij},\quad s_j=\sqrt{\frac{\sum_{i\in\mathcal{V}_j}(x_{ij}-\bar{x}_j)^2}{n_j-1}},\quad \widehat{SE}_j=\frac{s_j}{\sqrt{n_j}}\]

Relative Difficulty (p-value)

\[p_j=\frac{\bar{x}_j}{x_j^{\max}}\]

Items with \(p_j<0.15\) or \(p_j>0.85\) are flagged. For dichotomous items this equals the classical p-value.

Discrimination Indices

Adjusted Point-Biserial (Item-Rest Correlation)

Item \(j\) is correlated with the sum of the remaining \(K-1\) items, eliminating the inflation from item \(j\) appearing in its own total:

\[T_i^{(-j)}=\sum_{k\neq j}x_{ik},\qquad r_{pb,j}=\frac{\sum_i(x_{ij}-\bar{x}_j)(T_i^{(-j)}-\overline{T^{(-j)}})}{\sqrt{\sum_i(x_{ij}-\bar{x}_j)^2\cdot\sum_i(T_i^{(-j)}-\overline{T^{(-j)}})^2}}\]

This is Pearson's \(r\), computed pairwise over non-missing observations. Items with \(r_{pb,j}<0.20\) are flagged.

Biserial and Polyserial Correlation

These indices estimate the latent linear association assuming the true variable is continuous and normal but observed only in coarsened categories. For an item with \(M\) categories, let \(\hat{F}_m\) be the cumulative proportion through category \(m\). The internal threshold is \(z_m=\Phi^{-1}(1-\hat{F}_m)\) and its standard normal ordinate is \(\varphi(z_m)\):

\[r_{\text{bis},j}=\frac{s_j\cdot r_{pb,j}}{\displaystyle\sum_{m=1}^{M-1}\varphi(z_m)}\]

The last category is excluded because its threshold is \(+\infty\). For \(M=2\) this reduces to the classical biserial formula. \(\Phi^{-1}\) is computed via the Beasley-Springer-Moro rational approximation (accuracy \(\approx10^{-9}\)).

Reliability and Standard Error of Measurement

Cronbach's \(\alpha\)

\[\alpha=\frac{K}{K-1}\!\left(1-\frac{\sum_{j=1}^K s_j^2}{s_T^2}\right)\]

where \(s_j^2\) is the sample variance of item \(j\) and \(s_T^2\) is the sample variance of total scores (Cronbach, 1951). Both use the \((n-1)\) denominator.

Alpha-if-Item-Deleted

For each item \(j\), \(\alpha\) is recomputed on the \((K-1)\)-item submatrix formed by removing column \(j\). An increase signals that item \(j\) is reducing internal consistency.

Standard Error of Measurement

\[\mathrm{SEM}=s_T\sqrt{1-\alpha}\]
IRT Tools — Models

Item Response Models

PACER supports binary and polytomous items in any mixture within a single test. The scaling constant \(D\) defaults to 1.7 (close to the normal-ogive metric; \(D=1\) gives the pure logistic metric). All probabilities are clamped to \([\varepsilon,\,1-\varepsilon]\) with \(\varepsilon\approx1.49\times10^{-8}\) to prevent \(\log 0\) in the likelihood.

Binary Item Response Models

1PL2PL3PL

Three-Parameter Logistic (3PL)

\[P(X=1\mid\theta,a,b,c)=c+(1-c)\,\frac{1}{1+e^{-Da(\theta-b)}}\]
ParamDescription
aDiscrimination — slope at the ICC inflection point; constrained \(a>0\).
bDifficulty — \(\theta\) at the inflection point; \(P(b)=(1+c)/2\).
cPseudo-guessing — lower asymptote on the probability scale; estimated on the logit scale internally.

2PL and 1PL / Rasch

Setting \(c=0\) gives the 2PL. The 1PL additionally fixes \(a=1\) and estimates a free population standard deviation \(\sigma\) that scales the quadrature nodes: \(\theta_k^*=\sigma z_k\), where \(\{z_k\}\) are the standard GH nodes.

Graded Response Model (GRM)

GRM

The GRM (Samejima, 1969) models items with \(K+1\) ordered categories \((0,1,\ldots,K)\). The boundary response function gives the cumulative probability of responding in category \(k\) or higher:

\[P^*(X\ge k\mid\theta)=\frac{1}{1+e^{-Da(\theta-d_k)}},\quad k=1,\ldots,K\]

with ordered thresholds \(d_1 < d_2 < \cdots < d_K\) and sentinels \(P^*(X\ge0)\equiv1\), \(P^*(X\ge K+1)\equiv0\). Category probabilities are differences of adjacent boundaries:

\[P(X=k\mid\theta)=P^*(X\ge k\mid\theta)-P^*(X\ge k+1\mid\theta)\]

Boundary probabilities use the sign convention \(p_{kq}=\sigma(Da(d_k-\theta_q))\), so larger \(d_k\) means a harder threshold. Category probabilities are formed by differencing consecutive rows of the \(K\times Q\) boundary matrix.

Generalized Partial Credit Model (GPCM) and PCM

GPCMPCM

The GPCM (Muraki, 1992) uses adjacent-category transitions. For an item with \(K+1\) categories \((0,\ldots,K)\), define the log-numerator of category \(k\):

\[\eta_k(\theta)=Da\!\left(k\theta-\sum_{j=1}^{k}d_j\right),\quad k=0,1,\ldots,K\]

with \(\eta_0\equiv0\) and a leading \(d_0=0\) prepended to the step-parameter vector. Category probabilities follow the multinomial logit (softmax):

\[P(X=k\mid\theta)=\frac{e^{\eta_k(\theta)}}{\sum_{j=0}^{K}e^{\eta_j(\theta)}}\]

Evaluated via log-sum-exp: \(P(X=k)=\exp(\eta_k-M)/\sum_j\exp(\eta_j-M)\) where \(M=\max_j\eta_j\), preventing overflow at extreme \(\theta\). The PCM (Masters, 1982) is the GPCM with \(a\equiv1\).

ConventionThe step-parameter vector \(\bm{d}\) always has length equal to the number of categories, with \(d_0=0\) prepended. Scores are 1-based internally; scores used in expected-value and information calculations are 0-based.

Rating Scale Model (RSM)

RSM

The RSM (Andrich, 1978) constrains the PCM so that all items share identical step-difficulty offsets \(\tau_1,\ldots,\tau_K\) (\(\tau_0=0\) prepended by convention). Each item has a unique location \(\delta_i\), and its step parameters are

\[d_{ij}=\delta_i+\tau_j,\quad j=0,1,\ldots,K\]

Response probabilities are evaluated with the PCM formula using these compound parameters. The free parameter vector is \([\delta_1,\ldots,\delta_I,\,\tau_1,\ldots,\tau_K]\).

IRT Tools — Calibration

Item Calibration: Marginal Maximum Likelihood via EM

PACER estimates item parameters by marginal maximum likelihood (MML), integrating the latent trait \(\theta\) out of the likelihood. Because \(\theta\) is unknown, direct maximization of the item parameters is not feasible; instead PACER forms the marginal likelihood and maximizes it using the Bock-Aitkin (1981) EM algorithm. Unique response patterns are identified before the E-step to eliminate redundant likelihood evaluations.

The Marginal Likelihood

The goal is to estimate the full vector of item parameters \(\bm{\gamma}\). Since the person parameters \(\theta_i\) are also unknown, they are integrated out under an assumed population distribution \(g(\theta\mid\mu,\sigma)\):

\[\mathcal{L}(\bm{\gamma}\mid\bm{X})=\prod_{i=1}^{N}\int_{-\infty}^{\infty}p(\bm{x}_i\mid\theta_i,\bm{\gamma})\,g(\theta\mid\mu,\sigma)\,d\theta\]

The integral has no closed form and is evaluated by Gauss-Hermite quadrature over \(Q\) nodes:

\[\mathcal{L}(\bm{\gamma}\mid\bm{X})=\prod_{i=1}^{N}\sum_{k=1}^{Q}p(\bm{x}_i\mid\theta_k,\bm{\gamma})\,w_k\]

In practice, logs of individual pattern likelihoods are accumulated rather than computing products, ensuring numerical stability. When the data contain both binary and polytomous items (\(J\) binary, \(M\) polytomous), the pattern likelihood factors as a product over both item types:

\[p(\bm{x}_i\mid\theta,\bm{\gamma})=\prod_{j=1}^{J}P_j(x_{ij}\mid\theta)\cdot\prod_{l=1}^{M}P_l(x_{il}\mid\theta)\]

where \(P_j\) is the binary ICC (1PL/2PL/3PL) and \(P_l\) is the polytomous category probability (GPCM or GRM). Both item types are estimated within the same EM loop.

Model Identification

For the 2PL and 3PL, the population distribution is normally fixed at \(N(0,1)\) to identify the latent scale. For the 1PL (Rasch model), only the mean need be fixed; PACER therefore estimates \(N(0,\sigma)\), with \(\sigma\) as a free parameter interpretable as the common discrimination across all items.

When anchor items are present in a single-group calibration, the scale is externally defined by the anchor constraints and the \(N(0,1)\) identification constraint is no longer required. In this case PACER automatically estimates \(\mu\) and \(\sigma\) for the single group using the same closed-form M-step update as the multigroup focal groups (see §Multigroup below). This behavior can be overridden by explicitly setting EstimatePopulation = false in the request.

Starting Values

Because the EM algorithm is iterative, starting values are required. Three strategies are available. Random draws each parameter from a uniform distribution over the admissible parameter space — useful when the likelihood may be multimodal; running the analysis multiple times with different random starts is a recommended diagnostic. Fixed sets \(a_j=1\), \(b_j=0\), \(c_j=0.25\) for all binary items and uses ordered random starting values for polytomous step parameters. Logistic uses logistic regression intercepts to initialize the difficulty parameters (and proportional-odds intercepts for polytomous step parameters), with \(a\) and \(c\) initialized as in the fixed option — this is the default and generally performs best.

E-Step: Posterior Weights and Sufficient Statistics

Given current item parameters, the likelihood of response pattern \(\bm{x}_i\) at node \(\theta_k\) is

\[L_k(\bm{x}_i)=\prod_{j=1}^{J}P_j(X_j=x_{ij}\mid\theta_k)\]

The posterior weight is

\[h_{ik}=\frac{L_k(\bm{x}_i)\,w_k}{\displaystyle\sum_{k'=1}^{Q}L_{k'}(\bm{x}_i)\,w_{k'}}\]

The marginal log-likelihood used to monitor convergence is

\[\ell=\sum_{i=1}^{N}\log\sum_{k=1}^{Q}L_k(\bm{x}_i)\,w_k\]

Binary Sufficient Statistics

For each binary item \(j\) and node \(k\), summed over respondents \(\mathcal{C}_j\) who answered item \(j\):

\[r_{jk}=\sum_{i\in\mathcal{C}_j}h_{ik}\,x_{ij},\qquad f_{jk}=\sum_{i\in\mathcal{C}_j}h_{ik}\]

\(r_{jk}\) is the expected number of correct responses at node \(k\); \(f_{jk}\) is the expected total count.

Polytomous Sufficient Statistics

For polytomous item \(j\), the sufficient statistic at node \(k\) for category \(s\) is

\[\tilde{n}_{jsk}=\sum_{i\in\mathcal{C}_j}h_{ik}\,\mathbf{1}[x_{ij}=s]\]

In the pattern-based implementation, the posterior weight for unique pattern \(p\) at node \(k\) is \(r_{p,k}=c_p\cdot w_k\cdot L_k/m_p\) where \(c_p\) is the pattern count and \(m_p=\sum_k L_k w_k\).

M-Step: Binary Items (2PL / 1PL / 3PL)

The M-step maximizes the expected complete-data log-likelihood for item \(j\):

\[Q_j(a_j,b_j,c_j)=\sum_{k=1}^{Q}\bigl[r_{jk}\ln P_{jk}+(f_{jk}-r_{jk})\ln(1-P_{jk})\bigr]\]

PACER computes analytical gradients for the 2PL. The gradients with respect to \(a_j\) and \(b_j\) are:

\[\frac{\partial Q_j}{\partial a_j}=\sum_k\frac{L_k\,w_k}{m}\cdot\frac{x_j-P_{jk}}{P_{jk}(1-P_{jk})}\cdot D(\theta_k-b_j)P_{jk}(1-P_{jk})\]
\[\frac{\partial Q_j}{\partial b_j}=\sum_k\frac{L_k\,w_k}{m}\cdot\frac{x_j-P_{jk}}{P_{jk}(1-P_{jk})}\cdot(-Da_j)P_{jk}(1-P_{jk})\]

In the multigroup path, the gradient accumulates across all groups \(g\) using group-scaled nodes \(\theta_{gk}=\mu_g+\sigma_g z_k\):

\[\frac{\partial(-\ell_j)}{\partial a_j}=-\sum_g\sum_k D(\theta_{gk}-b_j)\bigl(r_g[j,k]-f_g[j,k]\,P_{jk}^{(g)}\bigr)\]

For the 3PL, \(c_j\) is parameterized on the logit scale internally; its gradient is computed by forward differences.

M-Step: GPCM and PCM

The complete-data log-likelihood for GPCM item \(j\) is the multinomial log-likelihood in expected category counts:

\[Q_j^{\text{GPCM}}(a_j,\bm{d}_j)=\sum_{k=1}^{Q}\sum_{s=0}^{K_j}\tilde{n}_{jsk}\ln P(X_j=s\mid\theta_k;\,a_j,\bm{d}_j)\]

The analytical gradient with respect to \(a_j\) uses the conditional mean of the log-numerator:

\[\bar{\eta}_k=\sum_{s=0}^{K}P(X_j=s\mid\theta_k)\cdot\!\left(s\theta_k-\sum_{m=1}^{s}d_m\right)\]
\[\frac{\partial Q_j^{\text{GPCM}}}{\partial a_j}=D\sum_k\sum_{\text{pat}}r_{\text{pat},k}\Bigl[\Bigl(s_{\text{pat}}\theta_k-\sum_{m=1}^{s_{\text{pat}}}d_m\Bigr)-\bar{\eta}_k\Bigr]\]

For step parameter \(d_j\) (\(j=1,\ldots,K\)), let \(P_k^{\ge j}=\sum_{s\ge j}P(X=s\mid\theta_k)\) be the exceedance probability. Then:

\[\frac{\partial Q_j^{\text{GPCM}}}{\partial d_j}=Da_j\sum_k\sum_{\text{pat}}r_{\text{pat},k}\Bigl[P_k^{\ge j}-\mathbf{1}[s_{\text{pat}}\ge j]\Bigr]\]

This is a score-residual: model-expected exceedance minus the observed indicator. For the PCM, \(a_j\equiv1\) and the \(a\)-gradient is not computed.

M-Step: GRM

The complete-data log-likelihood for GRM item \(j\) is:

\[Q_j^{\text{GRM}}(a_j,\bm{d}_j)=\sum_k\sum_{\text{pat}}r_{\text{pat},k}\ln P(X_j=s_{\text{pat}}\mid\theta_k;\,a_j,\bm{d}_j)\]

Let \(f_k^{(m)}=P^*(X\ge m\mid\theta_k)\cdot[1-P^*(X\ge m\mid\theta_k)]\) be the logistic slope of the \(m\)-th boundary. The gradient with respect to \(a_j\) depends on which category was observed:

\[\frac{\partial Q_j^{\text{GRM}}}{\partial a_j}=\sum_k\frac{r_{\text{pat},k}}{P_{sk}}\cdot\Delta_k(s)\]

The term \(\Delta_k(s)\) depends on the observed category \(s\):

Category\(\Delta_k(s)\)
\(s=0\)\(D(d_1-\theta_k)\,f_k^{(1)}\)
\(0 < s < K\)\(D(d_s-\theta_k)\,f_k^{(s)}-D(d_{s-1}-\theta_k)\,f_k^{(s-1)}\)
\(s=K\)\(-D(d_K-\theta_k)\,f_k^{(K)}\)

The gradient with respect to threshold \(d_m\) is nonzero only for the two categories adjacent to it:

\[\frac{\partial Q_j^{\text{GRM}}}{\partial d_m}=\sum_k\frac{r_{\text{pat},k}}{P_{sk}}\Bigl[\mathbf{1}[s=m]\cdot Da_j f_k^{(m)}+\mathbf{1}[s=m-1]\cdot(-Da_j f_k^{(m)})\Bigr]\]

GRM item gradients are parallelized over items at the M-step.

M-Step: RSM

The RSM free parameter vector is \(\bm{\xi}=[\delta_1,\ldots,\delta_I,\,\tau_1,\ldots,\tau_K]\). Within the optimizer, item \(i\)'s step vector is \(\bm{d}_i=[0,\,\delta_i+\tau_1,\ldots,\delta_i+\tau_K]\) and probabilities are evaluated with the PCM formula (\(a=1\)). The gradients follow from the GPCM step gradient with \(a=1\):

\[\frac{\partial Q^{\text{RSM}}}{\partial\delta_i}=D\sum_k\sum_{\text{pat}}r_{\text{pat},k}\sum_{j=1}^{K}\Bigl[P_{ik}^{\ge j}-\mathbf{1}[s_i\ge j]\Bigr]\]
\[\frac{\partial Q^{\text{RSM}}}{\partial\tau_t}=\sum_{i=1}^{I}D\sum_k\sum_{\text{pat}}r_{\text{pat},k}\Bigl[P_{ik}^{\ge t}-\mathbf{1}[s_i\ge t]\Bigr]\]

The \(\tau_t\) gradient is the sum of per-item step gradients across all \(I\) items, since \(\partial d_{it}/\partial\tau_t=1\) for all \(i\). Item gradients are computed in parallel; \(\tau\) gradients are accumulated sequentially to avoid data races.

Parameter Priors

Priors are incorporated by augmenting the marginal log-likelihood with log-prior terms, one per item parameter:

\[\ln p(\bm{\gamma}\mid\bm{X})=\ln\mathcal{L}(\bm{\gamma}\mid\bm{X})+\sum_{j=1}^{J}\bigl[\ln f(a_j)+\ln f(b_j)+\ln f(c_j)\bigr]\]
ParameterPrior familyLog-density contribution
a — discriminationLog-normal \(\mathcal{LN}(\mu_a,\sigma_a^2)\)\(-(\ln a-\mu_a)^2/(2\sigma_a^2)\)   (default: \(\mu_a=0\), \(\sigma_a=1\))
b — difficultyNormal \(\mathcal{N}(\mu_b,\sigma_b^2)\)\(-(b-\mu_b)^2/(2\sigma_b^2)\)   (default: \(\mu_b=0\), \(\sigma_b=1\))
c — guessing (beta, default)Beta \(\mathcal{B}(\alpha,\beta)\)\((\alpha-1)\ln c+(\beta-1)\ln(1-c)\)   (default: \(\alpha=4\), \(\beta=16\))
c — guessing (logit-normal)Normal on \(\lambda=\operatorname{logit}(c)\)\(-\tfrac{1}{2}\bigl((\lambda-\mu_\lambda)/\sigma_\lambda\bigr)^2\)   (default: \(\mu_\lambda=-1.386\), \(\sigma_\lambda=0.544\))

Priors on \(a\) and \(b\) are optional. A prior on \(c\) is applied by default when estimating the 3PL and can be disabled for very large samples.

Beta Prior for the Guessing Parameter

The beta distribution is parameterized by shape parameters \(\alpha\) and \(\beta\), but it is more natural to specify a prior mean and standard deviation. PACER converts these to shape parameters via the method of moments. Starting from

\[\mu=\frac{\alpha}{\alpha+\beta},\qquad\sigma^2=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\]

solving for \(\alpha\) and \(\beta\) gives

\[\alpha=\left(\frac{1-\mu}{\sigma^2}-\frac{1}{\mu}\right)\mu^2,\qquad\beta=\alpha\!\left(\frac{1}{\mu}-1\right)\]

The defaults \(\mu_c=0.20\) and \(\sigma_c=0.087\) correspond to \(\mathcal{B}(4,16)\), which places most mass below 0.35 and is appropriate for items with three to five response options. The logit-normal prior is described in the next section.

Logit-Normal Prior for the Guessing Parameter

The logit-normal prior places a normal distribution directly on the internal logit-scale parameter \(\lambda_j = \operatorname{logit}(c_j) = \ln(c_j/(1-c_j))\), which is the parameterization PACER uses internally during optimization. Because the optimizer works on \(\lambda_j\), no Jacobian correction is required in the objective function — the log-prior contribution is simply the log-density of a normal:

\[\ln f(\lambda_j) = -\frac{(\lambda_j - \mu_\lambda)^2}{2\sigma_\lambda^2} - \ln(\sigma_\lambda\sqrt{2\pi})\]

The gradient of the log-prior with respect to \(\lambda_j\) is therefore

\[\frac{\partial\ln f}{\partial\lambda_j} = -\frac{\lambda_j - \mu_\lambda}{\sigma_\lambda^2}\]

and the corresponding contribution to the negative log-likelihood gradient used by the optimizer is \((\lambda_j - \mu_\lambda)/\sigma_\lambda^2\).

The default hyperparameters \(\mu_\lambda = -1.386\) and \(\sigma_\lambda = 0.544\) are chosen so that the prior mean on the probability scale is \(\operatorname{logit}^{-1}(-1.386) \approx 0.20\), matching the beta prior default mean. The prior is specified in the UI by entering the mean and standard deviation on the logit scale directly.

The logit-normal is preferable to the beta prior when software interoperability requires a prior specified on the logit scale (for example, mirt's parprior = list(c(i, "norm", mu, sd)) applies a normal prior to the internally stored logit of \(c\), which is exactly equivalent to PACER's logit-normal prior). It may also be preferred when the guessing parameter is expected to cluster near zero, since the logit transform spreads mass more evenly in the tails of the probability scale.

Implementation note. The function LogitNormLogPdf(x, μ, σ) — which evaluates \(\ln\mathcal{N}(\operatorname{logit}(x);\mu,\sigma^2) - \ln[x(1-x)]\) and includes the change-of-variables Jacobian — is available internally but is not used for the prior during optimization. Since optimization proceeds on \(\lambda\) directly, only NormLogPdf(λ, μ, σ) is evaluated. LogitNormLogPdf would be appropriate if one needed the prior density on the original \(c\in(0,1)\) scale, for example when reporting marginal likelihoods.

Per-Item Prior Overrides

In addition to the global priors described above, PACER supports per-item prior overrides that selectively replace or suppress the global prior for any individual item–parameter pair. The resolution rule is applied independently for each combination of item \(j\) and parameter \(\psi\in\{a,b,c,d\}\):

  1. If an explicit per-item prior is specified for item \(j\), parameter \(\psi\): use that prior family and hyperparameters.
  2. If an explicit None override is set for item \(j\), parameter \(\psi\): apply no prior for that parameter (pure MLE).
  3. Otherwise: inherit the global prior for \(\psi\) (if the global prior is enabled).

The table below lists the parameters that accept per-item priors and their associated prior families:

ModelParameterPrior family
1PL\(b\)Normal \(\mathcal{N}(\mu_b,\sigma_b^2)\)
2PL\(a\), \(b\)Log-normal / Normal
3PL\(a\), \(b\), \(c\)Log-normal / Normal / Beta or Logit-normal
GRM, GPCM\(a\), \(\mathbf{d}\) thresholdsLog-normal / Normal (applied per threshold)
PCM\(\mathbf{d}\) stepsNormal (applied per step)

Augmented Log-Posterior with Per-Item Priors

Replacing the global prior family with the resolved per-item family \(f_j^\psi\) for each item \(j\) and parameter \(\psi\), the augmented log-posterior becomes

\[\ln p(\bm{\gamma}\mid\bm{X})=\ln\mathcal{L}(\bm{\gamma}\mid\bm{X})+\sum_{j=1}^{J}\Bigl[\ln f_j^a(a_j)+\ln f_j^b(b_j)+\ln f_j^c(c_j)\Bigr]\]

where \(f_j^\psi\) evaluates to the per-item prior if one is set, the global prior if the item has no override, or 1 (no contribution) if an explicit None override is in effect for that pair. The gradient contributions to the optimizer's negative log-posterior objective are identical in form to the global-prior case — only the hyperparameters differ:

ParameterResolved priorGradient added to \(-\ln p\)
\(a_j\) — discriminationLog-normal \(\mathcal{LN}(\mu_a,\sigma_a^2)\)\(\dfrac{\ln a_j-\mu_a}{\sigma_a^2\,a_j}\)
\(b_j\) — difficultyNormal \(\mathcal{N}(\mu_b,\sigma_b^2)\)\(\dfrac{b_j-\mu_b}{\sigma_b^2}\)
\(c_j\) — guessing (beta)Beta \(\mathcal{B}(\alpha,\beta)\)\(-\dfrac{\alpha-1}{c_j}+\dfrac{\beta-1}{1-c_j}\)
\(c_j\) — guessing (logit-normal)Normal on \(\lambda_j=\operatorname{logit}(c_j)\)\(\dfrac{\lambda_j-\mu_\lambda}{\sigma_\lambda^2}\)

The hyperparameters \((\mu_a,\sigma_a)\), \((\mu_b,\sigma_b)\), \((\alpha,\beta)\), or \((\mu_\lambda,\sigma_\lambda)\) are taken from the per-item override when present, otherwise from the global prior setting.

Prior on Polytomous Threshold Parameters

For polytomous items (GRM, GPCM, PCM), a single normal prior \(\mathcal{N}(\mu_d,\sigma_d^2)\) is applied independently to each threshold or step parameter \(d_{jk}\). The joint log-prior contribution for item \(j\) with \(K_j\) categories is

\[\ln f(\mathbf{d}_j)=\sum_{k=1}^{K_j}\ln\mathcal{N}(d_{jk};\,\mu_d,\sigma_d^2)=-\frac{1}{2\sigma_d^2}\sum_{k=1}^{K_j}(d_{jk}-\mu_d)^2+\text{const}\]

The gradient contribution to the optimizer objective (negative log-posterior) for threshold \(d_{jk}\) is

\[\frac{\partial}{\partial d_{jk}}\bigl[-\ln f(\mathbf{d}_j)\bigr]=\frac{d_{jk}-\mu_d}{\sigma_d^2}\]

This term is added to the EM sufficient-statistic gradient before passing to L-BFGS-B, analogously to the \(a\) and \(b\) prior gradient contributions for binary items.

Implementation note. Per-item overrides are stored in IrtOptions.ItemPriors as a two-level dictionary keyed first by column name, then by parameter name ("a", "b", "c", or "d"). A missing outer key means the global prior applies. A missing inner key means the global prior applies for that parameter. A null inner value is an explicit None override — it disables the prior for that item–parameter pair even when the global prior is on. Resolution is performed by ResolvePrior(itemPriors, colName, param, globalEnabled, globalP1, globalP2) in ItemEstimation.cs.

Multigroup Calibration (Bock-Zimowski)

For data from multiple groups, PACER implements the Bock-Zimowski (1997) multigroup MML-EM. Each group \(g\) has distribution \(\theta\sim N(\mu_g,\sigma_g^2)\); the reference group (\(g=0\)) is fixed at \(N(0,1)\) for identification.

Group-Scaled Nodes

\[\theta_{gk}=\mu_g+\sigma_g z_k\]

Weights \(\{w_k\}\) are invariant because the Jacobian and density ratio cancel exactly.

Multigroup E-Step

For respondent \(i\) in group \(g\), probabilities are evaluated at \(\theta_{gk}\):

\[h_{ik}=\frac{L_k(\bm{x}_i\mid\theta_{gk})\,w_k}{\sum_{k'}L_{k'}(\bm{x}_i\mid\theta_{gk'})\,w_{k'}}\]

Sufficient statistics \(r_g[j,k]\) and \(f_g[j,k]\) are accumulated per group on the same quadrature index \(k\).

Group Distribution Update (Closed Form)

Let \(W_{gk}=\sum_{i\in g}h_{ik}\). After the item M-step, the general update for a group with both \(\mu\) and \(\sigma\) free is:

\[\hat{\mu}_g=\frac{\sum_k W_{gk}\,\theta_{gk}}{\sum_k W_{gk}},\qquad\hat{\sigma}_g=\sqrt{\frac{\sum_k W_{gk}\,\theta_{gk}^2}{\sum_k W_{gk}}-\hat{\mu}_g^2}\]

For fixed-discrimination models (1PL, PCM, RSM) calibrated without anchors, the mean is identified by convention and held at \(\mu_g=0\); only the scale is free. In this case the second moment around zero is used directly:

\[\hat{\sigma}_g=\sqrt{\frac{\sum_k W_{gk}\,\theta_{gk}^2}{\sum_k W_{gk}}}\]

Both updates are clamped to \(\hat{\sigma}_g\in[0.05,\,10]\) to prevent degenerate quadrature grids. The outer loop iterates until

\[\frac{|\ell^{(t+1)}-\ell^{(t)}|}{1+|\ell^{(t)}|}<\tau_\ell\quad\text{and}\quad\max_{g}\,\bigl(\,|\Delta\mu_g|,\;|\Delta\sigma_g|\,\bigr)<10^{-4}\]

where \(\tau_\ell=\max(\text{Tol},\,10^{-5})\). The convergence condition on the group parameters requires that both the mean and standard deviation shift by less than \(10^{-4}\) for every non-reference group.

Optimizer: L-BFGS-B

Item parameters are estimated with L-BFGS-B, a limited-memory quasi-Newton method with box constraints. The inverse Hessian approximation is built from the most recent \(m=10\) curvature pairs \(\{(\bm{s}_t,\bm{y}_t)\}\) where \(\bm{s}_t=\bm{x}_{t+1}-\bm{x}_t\) and \(\bm{y}_t=\nabla f_{t+1}-\nabla f_t\). The two-loop recursion computes the search direction \(\bm{p}_t=-H_t^{-1}\nabla f_t\) without explicitly forming \(H_t^{-1}\). Box constraints are enforced by projecting onto the active constraint set. Convergence: \(|\Delta f|/(0.1+|f|)<\tau\) (default \(\tau=10^{-8}\)). Convergence is evaluated relative to the current function value to ensure scale-invariant termination.

IRT Tools — Scoring

Ability Estimation

Given calibrated item parameters, PACER estimates each respondent's latent trait \(\hat\theta\) from their observed response vector. Four methods are available. All operate on sparse response vectors so that respondents with missing items are scored without imputation.

Maximum Likelihood Estimation (MLE)

\[\hat\theta_{\text{MLE}}=\arg\max_\theta\sum_j\ln P_j(x_j\mid\theta)\]

A dense grid from \(-8\) to \(+8\) in steps of 0.05 locates the global basin; gradient descent with backtracking then refines from the grid minimum. Extreme scores: perfect and null scores produce unbounded estimates. PACER subtracts (or adds) 0.5 from the response of the item with the smallest discrimination \(a_j\). Standard error from observed information via numerical second differentiation:

\[\widehat{SE}_{\text{MLE}}=\left[-\hat\ell''(\hat\theta)\right]^{-1/2},\quad\hat\ell''(\hat\theta)\approx\frac{\hat\ell(\hat\theta+h)-2\hat\ell(\hat\theta)+\hat\ell(\hat\theta-h)}{h^2},\quad h=10^{-5}\]

Maximum A Posteriori (MAP)

\[\hat\theta_{\text{MAP}}=\arg\max_\theta\left[\sum_j\ln P_j(x_j\mid\theta)+\ln\pi(\theta)\right],\quad\ln\pi(\theta)=-\tfrac{1}{2}\!\left(\tfrac{\theta-\mu}{\sigma}\right)^{\!2}-\ln(\sigma\sqrt{2\pi})\]

The normal prior \(N(\mu,\sigma^2)\) (defaults: \(\mu=0\), \(\sigma=1\)) is added to the log-likelihood and the same minimizer is used. MAP is always finite, including for extreme scores.

Expected A Posteriori (EAP)

EAP is the posterior mean under a normal prior, approximated on a uniform quadrature grid spanning \(\mu\pm4\sigma\) in \(Q\) steps with weights proportional to \(\phi((\theta_k-\mu)/\sigma)\) (normalized to sum to one):

\[\hat\theta_{\text{EAP}}=\frac{\sum_{k=1}^{Q}\theta_k\,L_k(\bm{x})\,w_k}{\sum_{k=1}^{Q}L_k(\bm{x})\,w_k}\]

The posterior standard deviation serves as the standard error:

\[\widehat{SE}_{\text{EAP}}=\sqrt{\frac{\sum_k(\theta_k-\hat\theta_{\text{EAP}})^2\,L_k(\bm{x})\,w_k}{\sum_k L_k(\bm{x})\,w_k}}\]

TCC Scoring

TCC scoring (Lord, 1980) inverts the Test Characteristic Curve \(\tau(\theta)=\sum_j E[X_j\mid\theta]\) to find

\[\tau(\hat\theta_{\text{TCC}})=X_{\text{obs}}\]

solved on \([-8,8]\) by Brent's root-finding method. The SE is \(1/\sqrt{I(\hat\theta_{\text{TCC}})}\) from the test information function.

Boundary Condition

A unique TCC estimate exists only when the raw score exceeds the sum of the guessing parameters across all binary items:

\[\sum_{j=1}^{J}c_j < X_{\text{obs}}\]

When this condition is not met — that is, when the observed score is no higher than what could be achieved by guessing alone — TCC assigns the same lower-bound \(\hat\theta\) to all such examinees. Similarly, a perfect raw score produces the upper-bound value defined by the search interval. The default interval \([-8,8]\) covers the range used in practice; if scores fall outside this range the interval endpoints can be widened.

Post-Calibration EAP Thetas

After calibration, PACER computes a \(\hat\theta\) for every respondent using 11-point GH quadrature with a diffuse prior \(N(0,10)\) (\(\sigma=\sqrt{10}\approx3.162\)). These thetas are used for DIF matching and item-fit analysis, not score reporting.

IRT Tools — Test Characteristics

Test Characteristic Curve, Information, and SEM

Test Characteristic Curve (TCC)

\[\text{TCC}(\theta)=\sum_{j=1}^{J}E[X_j\mid\theta]\]
ModelExpected score \(E[X_j\mid\theta]\)
1PL / 2PL / 3PL\(P_j(\theta)\) — the ICC
GRM\(\sum_{k=0}^{K}k\cdot P(X=k\mid\theta)\). Closed form: \(\sum_{k=1}^{K}P^*(X\ge k\mid\theta)\)
GPCM / PCM\(\sum_{k=0}^{K}k\cdot P(X=k\mid\theta)\) with softmax probabilities
RSMSame as PCM with \(a=1\) and \(d_{ij}=\delta_i+\tau_j\)

When proportional TCC is enabled, output is divided by \(X_{\max}=\sum_j\max_k k\) to give a 0–1 scale.

Binary Item Information

For 1PL and 2PL:

\[I_j(\theta)=D^2a_j^2\,P_j(\theta)[1-P_j(\theta)]\]

For 3PL:

\[I_j(\theta)=D^2a_j^2\,\frac{[P_j(\theta)-c_j]^2[1-P_j(\theta)]}{P_j(\theta)(1-c_j)^2}\]

GRM Item Information

Samejima (1969):

\[I_j(\theta)=D^2a_j^2\sum_{k=0}^{K}\frac{[f_k(\theta)-f_{k+1}(\theta)]^2}{P(X=k\mid\theta)}\]

where \(f_k(\theta)=P^*(X\ge k\mid\theta)[1-P^*(X\ge k\mid\theta)]\) is the logistic slope of the \(k\)-th boundary, with sentinels \(f_0\equiv0\) and \(f_{K+1}\equiv0\). Concretely, consecutive logistic-slope values are differenced and each squared difference is divided by the corresponding category probability, then summed and scaled by \(D^2a_j^2\).

GPCM / PCM / RSM Item Information

Information equals \(D^2a^2\) times the conditional variance of the item score (Muraki, 1992):

\[I_j(\theta)=D^2a_j^2\,\text{Var}(X_j\mid\theta)=D^2a_j^2\Bigl[E[X_j^2\mid\theta]-\bigl(E[X_j\mid\theta]\bigr)^2\Bigr]\]

The test-level SEM at each \(\theta\) is

\[\text{SEM}(\theta)=\frac{1}{\sqrt{I(\theta)}}=\frac{1}{\sqrt{\sum_j I_j(\theta)}}\]
IRT Tools — Equating

IRT Equating and Linking

IRT linking places parameter estimates from a new test form onto the scale of an existing item bank using common anchor items. PACER supports six methods in three families. Mixed binary and polytomous items are supported; GRM uses the cumulative-logistic expected score; GPCM/PCM/RSM use the adjacent-category expected score.

The Linear Scale Transformation

All methods estimate \((A,B)\) mapping the new-form theta scale to the bank scale: \(\theta_{\text{bank}}=A\theta_{\text{form}}+B\). Item parameters transform as

\[a_{\text{linked}}=\frac{a_{\text{form}}}{A},\quad b_{\text{linked}}=Ab_{\text{form}}+B,\quad c_{\text{linked}}=c_{\text{form}},\quad d_{\text{linked},k}=Ad_{\text{form},k}+B\]

The reverse transformation is \(a_{\text{rev}}=Aa\), \(b_{\text{rev}}=(b-B)/A\), \(d_{\text{rev},k}=(d_k-B)/A\).

Stocking-Lord Methods

Stocking and Lord (1983) minimize the weighted integral of squared TCC differences over \(Q\) GH nodes. Let \(\tau_1(\theta)\) be the bank TCC and \(\tau_2^*(A,B;\theta)\) the linked new-form TCC.

SL.uni — Unidirectional

\[F_{\text{uni}}(A,B)=\sum_k w_k\bigl[\tau_1(\theta_k)-\tau_2^*(A,B;\theta_k)\bigr]^2\]

SL.sym — Symmetric

\[F_{\text{sym}}(A,B)=F_{\text{uni}}(A,B)+\sum_k w_k\bigl[\tau_1^*(\theta_k)-\tau_2(\theta_k)\bigr]^2\]where \(\tau_1^*\) is the bank TCC reverse-transformed to the form scale.

Haebara Methods

Haebara (1980) replaces TCC comparison with item-by-item ICC comparison. For binary items \(P_{jk}\) is the 3PL ICC; for polytomous items it is \(E[X_j\mid\theta_k]\).

HB.uni — Unidirectional

\[F_{\text{HB,uni}}(A,B)=\sum_k w_k\sum_j\bigl[P_{jk}^{\text{bank}}-P_{jk}^{*(A,B)}\bigr]^2\]

HB.sym — Symmetric

Adds the same sum in the reverse direction — bank ICC reverse-transformed to the form scale vs. the original form ICC.

All four characteristic-curve objectives are minimized by two-dimensional Nelder-Mead simplex (starting simplex side 0.5, up to 2,000 iterations, tolerance \(10^{-10}\)).

Mean/Sigma and Mean/Mean

Mean/Sigma (MS)

\[A=\frac{s_b^{\text{bank}}}{s_b^{\text{form}}},\qquad B=\bar{b}^{\text{bank}}-A\bar{b}^{\text{form}}\]

The location pool: \(b\)-parameters for binary items; all \(d\)-parameters pooled as a flat vector for GPCM/PCM (or averaged per item when the single-location option is selected); GRM boundary parameters; item location \(\delta_i\) only for RSM items.

Mean/Mean (MM)

\[A=\frac{\bar{a}^{\text{bank}}}{\bar{a}^{\text{form}}},\qquad B=\bar{b}^{\text{bank}}-A\bar{b}^{\text{form}}\]

dropD² Iterative Anchor Purification

After the initial equating, a per-item discrepancy statistic is computed:

\[D^2_j=\sum_{k=1}^{Q}w_k\bigl[P_j^{\text{bank}}(\theta_k)-P_j^{\text{linked}}(\theta_k)\bigr]^2\]

For polytomous items \(P_j\) is the expected score \(E[X_j\mid\theta_k]\). Values are standardized to

\[\tilde{D}^2_j=\frac{D^2_j-\bar{D}^2}{s_{D^2}}\]

The item with the largest \(|\tilde{D}^2_j|\) exceeding the cutoff (default 2.5) is removed; equating is re-run on the reduced anchor set. The process repeats until no item exceeds the cutoff or fewer than two anchors remain.

CDM Tools

Cognitive Diagnostic Models

CDMs classify examinees into mastery profiles over \(K\) binary latent attributes \(\bm{\alpha}_i=(\alpha_{i1},\ldots,\alpha_{iK})\in\{0,1\}^K\). The Q-matrix \(\bm{Q}\in\{0,1\}^{J\times K}\) specifies which attributes are required by each item.

DINA Model

The Deterministic Inputs, Noisy And-gate (DINA) model sets the ideal response indicator as \(\xi_{ij}=\prod_{k=1}^{K}\alpha_{ik}^{q_{jk}}\). A respondent achieves \(\xi_{ij}=1\) only if they have mastered all required attributes. The response probability accommodates slipping (\(s_j\)) and guessing (\(g_j\)):

\[P(X_{ij}=1\mid\bm{\alpha}_i)=(1-s_j)^{\xi_{ij}}\,g_j^{1-\xi_{ij}}\]

DINO Model

The Deterministic Inputs, Noisy Or-gate (DINO) model requires mastery of at least one required attribute:

\[\omega_{ij}=1-\prod_{k=1}^{K}(1-\alpha_{ik})^{q_{jk}},\qquad P(X_{ij}=1\mid\bm{\alpha}_i)=(1-s_j)^{\omega_{ij}}\,g_j^{1-\omega_{ij}}\]

EM Estimation of Item Parameters

Item parameters (\(s_j\), \(g_j\)) and class probabilities (\(\pi_c\)) are estimated jointly by marginal maximum likelihood over the \(L=2^K\) latent classes \(\bm{\alpha}^{(c)}\) with prior class probabilities \(\pi_c\ge0\), \(\sum_c\pi_c=1\). The algorithm is initialized with equal class probabilities (\(\pi_c=1/L\)) and uniform starting values for \(g_j\) and \(s_j\). The population distribution is re-estimated at each step, so the posteriors are empirical Bayes estimates.

E-Step

Let \(\Pr(\bm{\alpha}_l\mid\bm{x}_i)\) be the posterior probability that examinee \(i\) has skill profile \(\bm{\alpha}_l\). The expected number of examinees with profile \(\bm{\alpha}_l\) is

\[I_l=\sum_{i=1}^{N}\Pr(\bm{\alpha}_l\mid\bm{x}_i)\]

and the expected number of those examinees who answered item \(j\) correctly is

\[R_{jl}=\sum_{i=1}^{N}\Pr(\bm{\alpha}_l\mid\bm{x}_i)\,x_{ij}\]

Let \(I^0_{jl}\) be the expected count of examinees who lack at least one attribute required for item \(j\) (i.e., \(\xi_{ijl}=0\)) and \(R^0_{jl}\) be the subset of those who answered correctly. Analogously, \(I^1_{jl}\) and \(R^1_{jl}\) correspond to examinees who possess all required attributes (\(\xi_{ijl}=1\)).

M-Step

With these expected counts, the guessing and slip parameters are updated in closed form:

\[\hat{g}_j=\frac{R^0_{jl}}{I^0_{jl}},\qquad\hat{s}_j=\frac{I^1_{jl}-R^1_{jl}}{I^1_{jl}}\]

Class probabilities are updated as \(\hat\pi_c=N^{-1}\sum_i\tau_{ic}\). The EM loop alternates between E-step and M-step until the change in parameters across iterations falls below the convergence tolerance.

Skill Profile Estimation

Once item parameters are calibrated, each examinee's skill profile is estimated from the likelihood of their response pattern \(\bm{x}_i\) given each of the \(L=2^K\) possible profiles:

\[\mathcal{L}(\bm{\alpha}_l;\bm{x}_i)=\prod_{j=1}^{J}\Pr(x_{ij}=1\mid\bm{\alpha}_l,g_j,s_j)^{x_{ij}}\left[1-\Pr(x_{ij}=1\mid\bm{\alpha}_l,g_j,s_j)\right]^{1-x_{ij}}\]

Three estimators are available.

Maximum Likelihood (MLE)

The MLE selects the profile that maximizes the likelihood across all \(L\) possible classifications:

\[\hat{\bm{\alpha}}^{\text{MLE}}_i=\underset{\bm{\alpha}_l}{\operatorname{argmax}}\;\mathcal{L}(\bm{\alpha}_l;\bm{x}_i)\]

Because skill profiles are discrete, this is computed by exhaustive evaluation over all \(L\) profiles. The MLE may be multimodal — two or more profiles can share the same maximum likelihood. PACER flags such cases and reports all profiles achieving the maximum.

Maximum A Posteriori (MAP)

The MAP incorporates a prior distribution \(\Pr(\bm{\alpha}_l)\) over skill profiles (supplied by the user), maximizing the posterior:

\[\hat{\bm{\alpha}}^{\text{MAP}}_i=\underset{\bm{\alpha}_l}{\operatorname{argmax}}\;\mathcal{L}(\bm{\alpha}_l;\bm{x}_i)\,\Pr(\bm{\alpha}_l)\]

where \(\Pr(\bm{\alpha}_l)\) is a discrete prior such that \(\sum_l\Pr(\bm{\alpha}_l)=1\). Like the MLE, the MAP can be multimodal.

Expected A Posteriori (EAP)

The EAP uses the full posterior distribution over profiles. The posterior for profile \(\bm{\alpha}_l\) given \(\bm{x}_i\) is

\[\Pr(\bm{\alpha}_l\mid\bm{x}_i)=\frac{\mathcal{L}(\bm{\alpha}_l;\bm{x}_i)\,\Pr(\bm{\alpha}_l)}{\displaystyle\sum_{l'=1}^{L}\mathcal{L}(\bm{\alpha}_{l'};\bm{x}_i)\,\Pr(\bm{\alpha}_{l'})}\]

Let \(\bm{\tau}_i = [\Pr(\bm{\alpha}_1\mid\bm{x}_i),\ldots,\Pr(\bm{\alpha}_L\mid\bm{x}_i)]\) be the \(1\times L\) posterior vector and \(\bm{P}\) the \(L\times K\) binary matrix of all possible skill profiles. The EAP estimate is

\[\hat{\bm{\alpha}}^{\text{EAP}}_i=\bm{\tau}_i\,\bm{P}\]

This is a \(1\times K\) vector where each element is the marginal posterior probability of mastery for the corresponding attribute — equivalent to \(\sum_{l:\,\alpha_k^{(l)}=1}\Pr(\bm{\alpha}_l\mid\bm{x}_i)\). Unlike the MLE and MAP, the EAP yields continuous mastery probabilities between 0 and 1 and is always unique.

Examinee Skill Profile Estimation

Once item parameters are calibrated, each examinee's mastery profile is characterized by their full posterior distribution over the \(2^K\) latent classes. Three quantities are reported.

Modal Profile (MAP Classification)

The modal-class estimate assigns the examinee to the latent class with the highest posterior probability:

\[\hat{c}_i = \arg\max_{c\in\{1,\ldots,2^K\}}\;\tau_{ic}\]

The estimated attribute profile is \(\hat{\bm{\alpha}}_i = \bm{\alpha}^{(\hat{c}_i)}\), a binary vector of length \(K\) indicating mastery (1) or non-mastery (0) of each attribute. This is the primary classification output.

Marginal Attribute Mastery Probabilities

The marginal probability of mastering attribute \(k\) is obtained by summing the posterior over all latent classes in which attribute \(k\) is mastered:

\[P(\alpha_{ik}=1\mid\bm{x}_i)=\sum_{c:\,\alpha_k^{(c)}=1}\tau_{ic}\]

These \(K\) probabilities provide a continuous measure of mastery confidence. A value near 1 indicates strong evidence of mastery; near 0 indicates strong evidence of non-mastery; intermediate values indicate classification uncertainty for that attribute.

Classification Certainty

The posterior entropy of the class distribution for examinee \(i\) quantifies overall classification certainty:

\[H_i = -\sum_{c=1}^{2^K}\tau_{ic}\ln\tau_{ic}\]

Low entropy indicates that most posterior weight is concentrated on a single class (high certainty). High entropy indicates spread across multiple classes, flagging examinees whose profiles should be interpreted cautiously. The minimum possible entropy is 0 (perfect certainty); the maximum is \(\ln(2^K)=K\ln 2\) (uniform uncertainty).

Item Analysis — DIF

Differential Item Functioning

DIF occurs when respondents from different groups who have the same ability have systematically different probabilities of answering an item correctly. PACER implements the Mantel-Haenszel procedure (Holland & Thayer, 1988) for binary items and the Liu-Agresti (1996) generalization for polytomous items. All procedures first stratify respondents into \(K\) equal-frequency bins (default \(K=10\)) using a matching variable — either total score or IRT theta. Binning uses linear interpolation on the empirical quantile function so that equal proportions of respondents fall in each stratum; the lowest observed score is included in the first bin.

Mantel-Haenszel Procedure (Binary Items)

Within each stratum \(s\), a \(2\times2\) contingency table is formed:

\[\begin{array}{c|cc|c}&\text{Correct}&\text{Incorrect}&\text{Total}\\\hline\text{Reference}&A_s&B_s&n_{R,s}\\\text{Focal}&C_s&D_s&n_{F,s}\\\hline&n_{1,s}&n_{2,s}&n_s\end{array}\]

The MH log-odds ratio pools across strata:

\[\hat\alpha_{\text{MH}}=\frac{\displaystyle\sum_s A_s D_s/n_s}{\displaystyle\sum_s B_s C_s/n_s}\]

The ETS D-DIF statistic:

\[\text{MH D-DIF}=-2.35\ln\hat\alpha_{\text{MH}}\]

Negative D-DIF means the reference group has higher odds of answering correctly at the same ability — the item favors the reference group. Note: some published implementations invert this sign; PACER applies the convention stated above.

Chi-square with Yates continuity correction (1 df):

\[\chi^2_{\text{MH}}=\frac{\bigl(|\sum_s A_s-\sum_s E(A_s)|-0.5\bigr)^2}{\sum_s\mathrm{Var}(A_s)},\quad E(A_s)=\frac{n_{R,s}\,n_{1,s}}{n_s},\quad\mathrm{Var}(A_s)=\frac{n_{R,s}\,n_{F,s}\,n_{1,s}\,n_{2,s}}{n_s^2(n_s-1)}\]

The p-value is computed from the regularized incomplete gamma function via series and continued-fraction expansions (accurate to \(\approx10^{-10}\)).

Generalized MH Procedure (Polytomous Items)

For polytomous items with score values \(v_1\lt\cdots\lt v_S\), let \(N_{g,k}(v)\) be the count of respondents in group \(g\) with score \(\le v\) in stratum \(k\). The Liu-Agresti (1996) one-df score test:

\[T=\sum_k\sum_v\frac{N_{R,k}(v)\,n_{F,k}-N_{F,k}(v)\,n_{R,k}}{n_k}\]
\[\mathrm{Var}(T)=\sum_k\frac{n_{R,k}\,n_{F,k}}{n_k^2(n_k-1)}\Bigl[n_k\!\sum_v n_{\cdot k}(v)\bigl(N_{\cdot k}(v)\bigr)^2-\Bigl(\sum_v n_{\cdot k}(v)\,N_{\cdot k}(v)\Bigr)^2\Bigr]\]

The test statistic \(T^2/\mathrm{Var}(T)\) is chi-squared with one degree of freedom.

Standardized Mean Difference (SMD)

\[\text{SMD}=\bar\mu_{\text{focal}}-\bar\mu_{\text{reference}}=\sum_k\frac{n_{F,k}}{N_F}\!\left(\bar X_{F,k}-\bar X_{R,k}\right)\]

Positive SMD means the focal group scores higher (item favors focal); negative means the item favors the reference group. Some published implementations use \(|\text{SMD}|\) for the direction determination; PACER uses the raw signed value as defined above. The standardized SMD is

\[\tilde\delta=\frac{|\text{SMD}|}{s_{\text{pooled}}},\qquad s_{\text{pooled}}=\sqrt{\frac{\sum_k(n_k-1)s_k^2}{N-K}}\]

ETS DIF Classification

ClassBinary (MH D-DIF)Polytomous (\(\tilde\delta\))Meaning
A\(p>.05\) or \(|\text{MH}|<1.0\)\(p>.05\) or \(\tilde\delta\le0.17\)Negligible DIF; item acceptable.
BBetween A and CBetween A and CModerate DIF; item warrants review.
C\(p<.05\) and \(|\text{MH}|\ge1.5\)\(p<.05\) and \(\tilde\delta\ge0.25\)Large DIF; item should likely be removed or revised.
Implementation NotesPACER uses signed MH D-DIF (negative values favor the reference group), the raw signed SMD for direction determination, and infers item score coding from actual observed values — so items coded 1–4 or any other range are handled correctly without recoding.
IRT Tools — Item Fit

Item Fit: Yen's Q1

After calibration, PACER assesses item fit using Yen's (1981) Q1 statistic. Q1 partitions respondents into score groups ordered by estimated ability, then compares observed item responses within each group to the values expected under the fitted IRT model. Systematic discrepancies between observed and expected values indicate that the model does not adequately account for response behavior across the ability continuum.

Person ability estimates are EAP scores computed under a diffuse \(N(0,10)\) prior, which approximates maximum likelihood estimates and avoids boundary problems for extreme scores. Respondents with missing responses on a given item are excluded from that item's computation.

Binary Items

Score Group Formation

Let \(G\) denote the number of score groups (default \(G=10\), following Yen, 1981). Respondents with valid responses to item \(j\) are sorted by their EAP estimate \(\hat\theta_i\) and divided into \(G\) equal-frequency groups. Let \(n_g\) be the number of respondents in group \(g\) and \(\bar\theta_g\) be their mean EAP estimate.

Observed and Expected Proportions

Within group \(g\), the observed proportion correct is

\[\bar{x}_g = \frac{1}{n_g}\sum_{i \in g} x_{ij}\]

The expected proportion correct at the group mean ability is

\[E_g = P(X=1 \mid \bar\theta_g,\, a_j,\, b_j,\, c_j)\]

where \(P(\cdot)\) is the appropriate binary IRF — 3PL for items with a nonzero guessing parameter, 2PL or 1PL otherwise.

The Q1 Statistic

The statistic follows a weighted chi-square form, with each bin's contribution scaled by the variance of a Bernoulli variable with mean \(E_g\):

\[Q_1 = \sum_{g=1}^{G^*} \frac{n_g\,(\bar{x}_g - E_g)^2}{E_g(1-E_g)}\]

where \(G^*\) is the number of non-empty bins actually formed (which may be less than \(G\) when many respondents share the same EAP estimate). Under the null hypothesis that the item fits the model, \(Q_1\) is approximately chi-square distributed with degrees of freedom

\[\mathrm{df} = \begin{cases} G^* - 2 & \text{1PL or 2PL} \\ G^* - 3 & \text{3PL} \end{cases}\]

The reduction by one additional degree of freedom for the 3PL reflects the extra guessing parameter estimated from the data.

Polytomous Items

Score Group Formation

Score group formation is identical to the binary case: respondents are sorted by EAP and divided into \(G\) equal-frequency bins. Let \(K\) denote the number of response categories (0 through \(K-1\)).

Observed and Expected Category Proportions

Within group \(g\), the observed proportion in category \(k\) is

\[\hat{O}_{gk} = \frac{1}{n_g}\sum_{i \in g} \mathbf{1}[x_{ij} = k]\]

The expected proportion in category \(k\) at the group mean ability is

\[E_{gk} = P(X = k \mid \bar\theta_g,\, \bm{\psi}_j)\]

where \(\bm{\psi}_j\) denotes the calibrated parameters for item \(j\) and \(P(\cdot)\) is evaluated using the model-appropriate category response function. For the GRM:

\[P(X=k\mid\theta) = P^*(X \ge k \mid\theta) - P^*(X \ge k+1 \mid\theta)\]

with sentinels \(P^*(X \ge 0)=1\) and \(P^*(X \ge K)=0\). For the GPCM and PCM:

\[P(X=k\mid\theta) = \frac{\exp\!\left(Da\sum_{m=0}^{k}(\theta - d_m)\right)}{\displaystyle\sum_{s=0}^{K-1}\exp\!\left(Da\sum_{m=0}^{s}(\theta - d_m)\right)}\]

with the convention that the empty sum (\(k=0\)) equals zero.

The Q1 Statistic for Polytomous Items

The statistic generalizes the binary form to a sum over both bins and categories:

\[Q_1 = \sum_{g=1}^{G^*} n_g \sum_{k=0}^{K-1} \frac{(\hat{O}_{gk} - E_{gk})^2}{E_{gk}}\]

Under the null hypothesis of model fit, \(Q_1\) is approximately chi-square distributed with

\[\mathrm{df} = G^*(K-1) - K\]

The \(-K\) correction accounts for the \(K\) item parameters estimated from the data.

Standardization and Flagging

Standardized Q1z

To place Q1 on a common scale across items with different numbers of categories and bins, the statistic is standardized using the mean and variance of the chi-square distribution under the null:

\[Q_{1z} = \frac{Q_1 - \mathrm{df}}{\sqrt{2\,\mathrm{df}}}\]

Under correct model specification, \(Q_{1z}\) is approximately standard normal. Large positive values indicate misfit.

Practical Flagging Criterion

PACER uses Yen's (1981) sample-size-adjusted practical cutoff rather than a fixed critical value. An item is flagged for review when

\[Q_{1z} > \frac{N}{1500} \times 4\]

where \(N\) is the number of respondents with valid responses on that item. This heuristic recognizes that with large samples even trivially small model-data discrepancies become statistically detectable. At \(N=1500\) the cutoff equals 4.0; at \(N=500\) it equals approximately 1.33.

Note on Statistical PowerQ1 has high power to detect misfit at large sample sizes. Near-universal flagging on large datasets (\(N \gtrsim 1000\)) does not necessarily indicate a poorly fitting model — it often reflects the sensitivity of the chi-square statistic to minor, practically unimportant discrepancies. Visual inspection of the ICC with observed score-group proportions (available via the Fit button on each item) is recommended alongside Q1z when interpreting flagged items.
Implementation DetailsThe model type for df determination is taken from the calibrated item model (1PL, 2PL, or 3PL), not from the estimated guessing parameter value. Items with fewer than \(2G\) valid responses return missing fit statistics. All category probability computations use the same IRF implementations as the calibration engine, ensuring internal consistency between estimated parameters and fit evaluation.
Numerical Methods — Quadrature

Gauss-Hermite Quadrature

All integrals over the latent-trait distribution are approximated by Gauss-Hermite quadrature with \(Q\) nodes (default \(Q=21\)):

\[\int_{-\infty}^{\infty}f(\theta)\,\phi(\theta)\,d\theta\approx\sum_{k=1}^{Q}w_k\,f(z_k)\]

where \(\phi\) is the standard normal density. Nodes \(\{z_k\}\) are roots of the probabilists' Hermite polynomial \(\text{He}_Q(x)\); weights satisfy \(w_k=v_{k,1}^2\sqrt{2\pi}\) where \(v_{k,1}\) is the first component of the \(k\)-th eigenvector. For all values of \(Q\), nodes and weights are computed on demand via the symmetric QL algorithm applied to the tridiagonal Jacobi matrix whose off-diagonal elements are \(\beta_i=\sqrt{i+1}\) for \(i=0,1,\ldots,Q-2\) (the probabilists' He-polynomial recurrence). Eigenvalues of this matrix are the GH nodes; weights follow from the squared first eigenvector components scaled by \(\sqrt{2\pi}\). For multigroup calibration, nodes are shifted and scaled: \(\theta_{gk}=\mu_g+\sigma_g z_k\). Weights are invariant because the Jacobian and density ratio cancel exactly.

Numerical Methods — Optimization

Optimization Routines

L-BFGS-B (Item Calibration)

Item parameters are estimated by L-BFGS-B. The two-loop recursion computes the quasi-Newton direction \(\bm{p}_t=-H_t^{-1}\nabla f_t\) from \(m=10\) curvature pairs without forming \(H_t^{-1}\) explicitly. Box constraints are enforced by projecting onto the active constraint set. Step length satisfies the strong Wolfe conditions via backtracking. Convergence: \(|\Delta f|/(0.1+|f|)<\tau\) (default \(\tau=10^{-8}\)).

Gradient Descent with Backtracking (Theta Estimation)

MLE and MAP theta estimation use gradient descent with backtracking. At each step: \(\theta_{\text{new}}=\theta-\alpha[-\ell'(\theta)]\) where \(\ell'(\theta)\approx[\ell(\theta+h)-\ell(\theta-h)]/(2h)\), \(h=10^{-6}\). Step size \(\alpha\) is increased by 1.2 on progress and halved on rejection; convergence when \(\alpha<10^{-10}\).

Brent's Method (TCC Inversion)

TCC scoring inverts \(\tau(\theta)=X_{\text{obs}}\) on \([-8,8]\). Brent's method combines bisection (guaranteed progress), secant (linear interpolation), and inverse quadratic interpolation (parabolic fit through three points). The method falls back to bisection whenever the other two strategies would not make sufficient progress, giving guaranteed \(O(\log\varepsilon^{-1})\) worst-case convergence. Tolerance: \(10^{-10}\).

Nelder-Mead Simplex (Equating Objectives)

The two-parameter SL and HB equating objectives are minimized over \((A,B)\) by Nelder-Mead simplex. The simplex of three vertices is updated by reflection, expansion, contraction, or shrinkage at each iteration. Initial simplex side: 0.5; maximum iterations: 2,000; convergence tolerance on objective range: \(10^{-10}\).

Documentation

References

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459.

Bock, R. D., & Zimowski, M. F. (1997). Multiple group IRT. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory. Springer.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.

Doran, H. (2023). A collection of numerical recipes useful for building scalable psychometric applications. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/10769986221116905

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144–149.

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity. Lawrence Erlbaum.

Liu, I. M., & Agresti, A. (1996). Mantel-Haenszel-type inference for cumulative odds ratios with a stratified ordinal response. Biometrics, 52, 1223–1234.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Erlbaum.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17.

Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245–262.

Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210.