Roovet Articles

Psychometrics

Roovet article quality
Standard article
Last updated Recently · Reviewed through Roovet Articles editorial standards.
Source quality: Strong86 citations detected



no

Psychometrics

Psychometrics is the scientific field devoted to the measurement of psychological attributes—such as abilities, knowledge, skills, traits, attitudes, interests, symptoms, and quality of life—using formal models, carefully designed items, and statistical theory. In education, clinical practice, health outcomes, and personnel selection, Psychometrics provides the tools to design tests and questionnaires, evaluate their reliability and validity, calibrate items on common scales, equate scores across forms, monitor fairness and bias, and report interpretable results with quantified uncertainty. Because the focus keyword Psychometrics is used across disciplines, the field integrates classical test theory, factor analysis, item response theory (IRT), generalizability theory, structural equation modeling (SEM), computerized adaptive testing (CAT), and modern Bayesian and computational methods.[1][2][3]

Psychometricians seek **construct-relevant** measurement: the evidence-based inference from observed responses to underlying attributes. Core concerns include: (a) **reliability**—the consistency or precision of scores; (b) **validity**—the appropriateness, meaningfulness, and consequences of score interpretations for specific uses; (c) **fairness** and **invariance**—whether scores carry the same meaning across groups or occasions; and (d) **utility**—how measurement supports decisions and improves outcomes.[4][5]

Psychometrics
Bell curve and test score illustration
Also called Psychological measurement; educational measurement
Part of StatisticsPsychologyEducationHealth outcomesIndustrial and organizational psychology
Core aims Design, analyse, and validate tests and scales; ensure reliability, validity, invariance, fairness; support decisions
Major models Classical test theory • Factor analysis • Item response theory (1PL/Rasch, 2PL, 3PL, graded/partial credit, MIRT) • Generalizability theory • SEM
Common methods Pilot testing • Item analysis • Differential item functioning • Equating • Standard setting • CAT • Bayesian calibration
Applications Admissions & licensure • Clinical assessment • PROs/HRQoL • Employee selection • Survey research • Program evaluation
Key journals/societies PsychometrikaApplied Psychological MeasurementEducational Measurement: Issues and Practice • Psychometric Society • NCME

Historical development

Psychometrics emerged from 19th–20th century efforts to quantify individual differences and mental abilities. **Galton** and **Pearson** introduced correlation, regression, and early scaling. **Spearman** proposed the general factor g and **factor analysis**, while **Thurstone** developed multiple-factor methods and attitude scaling. **Guttman** designed cumulative scaling; **Rasch** formalised the 1-parameter logistic model; **Birnbaum** generalized to 2PL/3PL families. Post-war work unified **classical test theory** (CTT) and factor analysis; later decades integrated **SEM**, **IRT**, and **generalizability theory**, as well as **equating**, **standard setting**, and **adaptive testing** for large-scale assessments.[6][7][8][9][10]

Classical test theory (CTT)

CTT models an **observed score** X as X = T + E, where T is the **true score** and E random error, assuming linearity and independence under specified conditions. Key indices include **reliability coefficients** (e.g., **Cronbach’s alpha** for internal consistency, **test–retest**, and **parallel forms**), the **standard error of measurement** (SEM), and **item difficulty/discrimination** computed from proportions and correlations.[11][12]

  • Alpha vs. omega—Alpha assumes tau-equivalence; **McDonald’s omega** (hierarchical and total) offers more accurate reliability under congeneric models.[13][14]
  • Score interpretation—CTT supports **norm-referenced** and **criterion-referenced** uses with **confidence intervals** around scores via SEM.

CTT remains valuable for small-sample scale development and as a baseline descriptive framework, but item and sample dependence limit its generality.

Factor analysis and structural equation modeling

    • Exploratory factor analysis** (EFA) identifies latent dimensions underlying inter-item correlations; **confirmatory factor analysis** (CFA) tests prespecified factor structures with model fit indexes (e.g., CFI/TLI/RMSEA/SRMR). **Bifactor** and **higher-order** models separate general and group factors; **ESEM** blends EFA flexibility with CFA constraints; **SEM** integrates measurement and structural relations among latent variables while modeling measurement error explicitly.[15][16][17]
    • Measurement invariance** testing examines whether factor loadings, intercepts/thresholds, and residuals are equivalent across groups/time. Establishing configural → metric → scalar invariance supports **fair comparisons** of latent means and growth.[18]

Item response theory (IRT)

IRT models the probability that a respondent with latent trait level θ selects/endorses an item response as a function of item parameters. For dichotomous items:

  • **Rasch/1PL**: difficulty b only; equal discrimination; specific objectivity and sufficiency of raw scores under model fit.
  • **2PL**: item discrimination a and difficulty b.
  • **3PL**: adds lower asymptote (guessing) c for multiple-choice.[19]

For polytomous items:

  • **Graded response model (GRM)** for ordered categories.
  • **Partial credit model (PCM)** and **generalized PCM** for step-wise processes.
  • **Nominal response model (NRM)** for unordered categories.[20][21]

Key products include **item characteristic curves** (ICCs), **item/test information functions** (precision by θ), **standard errors of estimation**, and **ability estimation** via MLE, EAP, or MAP. **Multidimensional IRT (MIRT)** extends to multiple traits; **testlet** and **bi-factor IRT** address local dependence; **equating** links scales across forms; and **CAT** selects items adaptively to maximize information, shortening tests while maintaining precision.[22][23]

Generalizability theory

    • G-theory** decomposes score variance into multiple facets (persons, items, raters, occasions) and their interactions, extending CTT beyond a single undifferentiated error term. **G-studies** estimate variance components; **D-studies** project reliability-like indexes (G and Φ) for alternative designs (e.g., more raters or items).[24]

Scale construction and item writing

Psychometric scale development typically follows:

  1. **Construct definition** grounded in theory and use-case (e.g., selection, diagnosis, progress monitoring).
  2. **Blueprint/specifications** mapping content to items and desired score precision across the trait range.
  3. **Item writing** using clear stems, plausible distractors, and bias/sensitivity reviews; response formats (Likert, semantic differential, forced-choice).
  4. **Pilot testing** for exploratory item analysis (difficulty, discrimination, option performance).
  5. **Dimensionality checks** (EFA/CFA) and **IRT calibration**.
  6. **Refinement** (remove misfitting or biased items; optimize information).
  7. **Validation** for the claim and use (see below).[25]
    • Forced-choice** designs with IRT-based ranking models (e.g., Thurstonian IRT) mitigate social desirability and reference-group effects in noncognitive assessment.[26]

Reliability and precision

Reliability quantifies the proportion of observed variance attributable to true score variance.

  • **Internal consistency**: alpha, omega (total/hierarchical), greatest lower bound (GLB).
  • **Stability**: test–retest correlations with interval-appropriate modeling.
  • **Equivalence**: parallel/alternate-form correlations.
  • **Interrater**: intraclass correlations (ICCs) for ratings.
  • **Conditional precision**: in IRT, the **test information function** and conditional SEM vary by θ; reporting precision profiles improves interpretation.[27]

Validity: evidence and argument

Contemporary validity is an **evidential–argument** framework: the test developer/user makes a claim about what scores mean for a use; multiple strands of evidence support or challenge that claim.[28]

  • **Content**: representativeness and relevance of items to the construct and use (blueprints, expert review).
  • **Internal structure**: dimensionality, factor loadings, local dependence.
  • **Relations to other variables**: convergent/discriminant (multitrait–multimethod), criterion-related (predictive/concurrent), **incremental validity** beyond existing measures.
  • **Response processes**: cognitive labs, think-aloud, DIF explaining construct-irrelevant variance.
  • **Consequences**: intended/unintended effects (e.g., teaching to the test, access equity).[29]

Fairness, bias, and invariance

    • Fairness** requires that individuals with the same standing on the construct have **comparable expected scores** regardless of group membership (e.g., gender, ethnicity, disability status).
  • **Differential item functioning (DIF)** detects items with group-specific parameters after conditioning on the trait (Mantel–Haenszel, logistic regression DIF, IRT-based likelihood-ratio tests).
  • **Measurement invariance** in CFA (configural/metric/scalar) supports unbiased latent mean comparisons.
  • **Accessibility and accommodations**: design for all, with evidence that accommodations remove construct-irrelevant barriers without altering the score meaning.[30][31]

Scaling, linking, and equating

When multiple forms or administrations must yield comparable scores, **equating** places them on a common scale using common-item or common-person designs. Methods include **linear**, **equipercentile**, and **IRT true-score** equating; **linking** is a looser family of transformations used when constructs or populations differ; **concordance** relates scores across non-equivalent tests but does not confer interchangeability.[32]

Standard setting and passing scores

For criterion-referenced interpretations (licensure, proficiency levels), **standard setting** converts test scores into performance levels via expert judgment: **Angoff**, **bookmark**, **Nedelsky**, **Hofstee**, and **body of work** methods, often with impact data and validity evidence to support policy decisions.[33]

Computerized adaptive testing (CAT) and automated scoring

CAT administers items dynamically based on provisional θ estimates, targeting information at the examinee’s ability and reducing test length/seat time. Operational CATs require secure calibrated banks, exposure control, and continual monitoring. Automated scoring (e.g., NLP-based essay scoring, automated short-answer scoring) must demonstrate reliability, validity, detectability of gaming, and fairness comparable to human raters.[34][35]

Patient-reported outcomes (PROs) and health measurement

Health psychometrics emphasizes **patient-reported outcomes** (symptoms, functioning, quality of life), often using IRT-calibrated banks (e.g., PROMIS) to deliver **CATs** with interpretable **T-scores** anchored to reference populations. Emphasis on **responsiveness**, **minimally important difference** (MID), and longitudinal invariance supports clinical decision-making and trials.[36]

Selection, utility, and legal context

In employment testing, **predictive validity**, **adverse impact** analyses, and **utility models** (e.g., Taylor–Russell, Brogden–Cronbach–Gleser) guide instrument selection and cut-scores; documentation must satisfy professional **Standards** and relevant laws/guidelines on fairness and validation. Adverse impact alone does not imply bias; evidence must address **job relatedness** and alternative measures.[37][38]

Modern directions

  • **Bayesian psychometrics**: hierarchical models for small-sample calibration, complex response processes, and incorporating prior information.
  • **Network psychometrics**: symptoms/behaviors modeled as mutually interacting nodes rather than effects of a latent variable; complimentary to latent approaches.[39]
  • **Nonparametric IRT and Mokken scaling** for ordinal data when parametric assumptions are questionable.[40]
  • **Response time modeling** and speed–accuracy tradeoffs.
  • **Automated item generation** combining cognitive models, templates, and NLP/LLMs (with rigorous post-hoc calibration).
  • **Fairness auditing** with multiple-criteria optimization (accuracy, equity, privacy).
  • **Learning analytics** and **cognitive diagnostic models** (CDMs) that classify mastery of fine-grained skills.[41]

Reporting and use of scores

Transparent reporting includes:

  • **Intended interpretation and use**; score scales and transformations (e.g., raw → scaled T-score).
  • **Precision**: reliability indices and conditional SEM bands; **confidence intervals** for individuals, **CIs** for group means; **classification accuracy/consistency** for cut-scores.
  • **Norms and comparators**: sample frames, recency, percentile ranks, and subgroup norms.
  • **Validity evidence** linked to the decision context (selection, diagnosis, monitoring).
  • **Accessibility and accommodations** policies with supporting evidence.
  • **Monitoring**: item drift, security, DIF surveillance, and re-calibration triggers.[42]

Common pitfalls and remedies

  • **Alpha misuse** as a sole index of quality → report omega/GLB and conditional precision; verify dimensionality.[43]
  • **Data-driven hunting** for factor structures → preregistered models; cross-validation; theory-driven items.
  • **Ignoring invariance** → routine DIF and multi-group CFA/IRT.
  • **Overfitting in SEM** → parsimony, cross-validated fit, theory constraints.
  • **Using total scores on multidimensional banks** → MIRT scoring or bifactor-appropriate composites.
  • **Unstable cut-scores** → robust standard setting with impact/validity evidence and periodic review.

Representative timeline

Year Milestone Significance
1904 Spearman introduces g and factor analysis Birth of latent variable modeling
1931 Thurstone develops multiple-factor analysis Multidimensional constructs
1951 Cronbach’s alpha Internal consistency index
1960 Rasch model Specific objectivity; invariant measurement
1968 Lord & Novick unify test theory CTT foundations formalized
1970s Generalizability theory; equating advances Multi-facet reliability; population linking
1980s–1990s IRT widespread; CAT pilots Item-level modeling; adaptive administration
1990s–2000s SEM mainstream; DIF/invariance standards Valid comparisons across groups/time
2007–present PROMIS and health CATs Clinical measurement at scale
2010s–present Network models; Bayesian/ML integration New representations and computation

Comparison of model families

Family Units Strengths Limitations Typical uses
CTT Total scores, items Simple; small samples Item/sample dependence Classroom tests; early pilots
Factor/CFA/SEM Latent variables Theory testing; error modeled Fit sensitivity; requires expertise Scale validation; structural research
IRT/Rasch Item–person on latent scale Invariance; conditional precision; CAT Model fit; larger samples Large-scale tests; PROs; item banks
G-theory Persons × facets Multifacet reliability planning Design/estimation complexity Performance tasks; ratings
CDMs Attribute mastery Diagnostic feedback Calibration, Q-matrix quality Learning analytics; formative use

Glossary

**Construct**
The theoretical attribute a test is intended to measure.
**Reliability**
Consistency/precision of scores; proportion of true variance.
**Validity**
Degree to which evidence and theory support score interpretations for intended uses.
**DIF**
Differential item functioning—items behave differently across groups at the same trait level.
**Equating**
Statistical process to place scores from different forms on a common scale.
**CAT**
Computerized adaptive testing—algorithm selects items to maximize information.
**SEM**
Standard error of measurement; uncertainty around an observed score.
**Omega**
Reliability coefficient under congeneric measurement models.
**Bifactor model**
General factor plus specific factors, allowing interpretation of total and subscale scores.
**MIRT**
Multidimensional item response theory; multiple latent traits.

See also

References

  1. Statistical Theories of Mental Test Scores, Addison–Wesley, 1968
  2. Psychometric Theory (3rd ed.), McGraw–Hill, 1994
  3. Item Response Theory for Psychologists, Lawrence Erlbaum, 2000
  4. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning, American Psychologist, 1995
  5. Standards for Educational and Psychological Testing, AERA, 2014
  6. General intelligence, objectively determined and measured, American Journal of Psychology, 1904
  7. Multiple factor analysis, Psychological Review, 1931
  8. Probabilistic Models for Some Intelligence and Attainment Tests, University of Chicago Press, 1960
  9. Fundamentals of Item Response Theory, Sage, 1991
  10. Test Theory: A Unified Treatment, Lawrence Erlbaum, 1999
  11. Coefficient alpha and the internal structure of tests, Psychometrika, 1951
  12. Statistical Theories of Mental Test Scores, 1968
  13. Test Theory: A Unified Treatment, Lawrence Erlbaum, 1999
  14. Coefficient alpha, reliability, and psychometric theory, Psychometrika, 2009
  15. A general approach to confirmatory maximum likelihood factor analysis, Psychometrika, 1969
  16. Confirmatory Factor Analysis for Applied Research (2nd ed.), Guilford, 2015
  17. Bifactor models and rotations, Journal of Personality Assessment, 2010
  18. Measurement invariance, factor analysis and factorial invariance, Psychometrika, 1993
  19. Fundamentals of Item Response Theory, 1991
  20. Item Response Theory for Psychologists, 2000
  21. Rating Scale Analysis, MESA Press, 1982
  22. Multidimensional Item Response Theory, Springer, 2009
  23. Computerized Adaptive Testing: Theory and Practice, Kluwer, 2000
  24. Generalizability Theory, Sage, 1991
  25. Scale Development: Theory and Applications (4th ed.), Sage, 2017
  26. Item response modeling of forced-choice questionnaires, Educational and Psychological Measurement, 2011
  27. On the use, the misuse, and the very limited usefulness of Cronbach’s alpha, Psychometrika, 2009
  28. Validating the interpretations and uses of test scores, Journal of Educational Measurement, 2013
  29. Standards for Educational and Psychological Testing, 2014
  30. A handbook on the theory and methods of differential item functioning (DIF), National Defense Headquarters, 1999
  31. Statistical Approaches to Measurement Invariance, Routledge, 2011
  32. Test Equating, Scaling, and Linking (3rd ed.), Springer, 2014
  33. Standard Setting: A Guide to Establishing and Evaluating Performance Standards, Sage, 2007
  34. Computerized Adaptive Testing: Theory and Practice, 2000
  35. Automated scoring of complex tasks in computer-based testing, Lawrence Erlbaum, 2006
  36. The Patient-Reported Outcomes Measurement Information System (PROMIS), Medical Care, 2007
  37. The validity and utility of selection methods in personnel psychology, Psychological Bulletin, 1998
  38. Employee Selection (2nd ed.), Routledge, 2021
  39. The graphical LASSO and estimating psychological networks, Behavior Research Methods, 2018
  40. Introduction to Nonparametric Item Response Theory, Sage, 2002
  41. Toward an integration of item-response theory and cognitive error diagnosis, Applied Measurement in Education, 1990
  42. Standards for Educational and Psychological Testing, 2014
  43. On the use... of Cronbach’s alpha, Psychometrika, 2009

Further reading

  • Psychological Testing (7th ed.), Prentice Hall, 1997
  • Essentials of Psychological Testing (5th ed.), HarperCollins, 1990
  • Principles and Practice of Structural Equation Modeling (4th ed.), Guilford, 2016
  • The Theory and Practice of Item Response Theory, Guilford, 2009
  • Applying the Rasch Model (3rd ed.), Routledge, 2015
  • Test Equating, Scaling, and Linking (3rd ed.), Springer, 2014
  • PARSS: An IRT Approach to Polytomous Scoring, Scientific Software, 2004
  • Three generations of DIF analyses, Statistical Methods for Health Care Research, 2007

External links

Article tools

Use and verify this page

Suggest correction
Cite this page Psychometrics. Roovet Articles. Retrieved from https://articles.roovet.com/Psychometrics