Roovet article quality

Standard article

Last updated Recently · Reviewed through Roovet Articles editorial standards.

Source quality: Strong86 citations detected

no

Psychometrics

Psychometrics
Also called	Psychological measurement; educational measurement
Part of	Statistics • Psychology • Education • Health outcomes • Industrial and organizational psychology
Core aims	Design, analyse, and validate tests and scales; ensure reliability, validity, invariance, fairness; support decisions
Major models	Classical test theory • Factor analysis • Item response theory (1PL/Rasch, 2PL, 3PL, graded/partial credit, MIRT) • Generalizability theory • SEM
Common methods	Pilot testing • Item analysis • Differential item functioning • Equating • Standard setting • CAT • Bayesian calibration
Applications	Admissions & licensure • Clinical assessment • PROs/HRQoL • Employee selection • Survey research • Program evaluation
Key journals/societies	Psychometrika • Applied Psychological Measurement • Educational Measurement: Issues and Practice • Psychometric Society • NCME

Psychometrics is the scientific field devoted to the measurement of psychological attributes—such as abilities, knowledge, skills, traits, attitudes, interests, symptoms, and quality of life—using formal models, carefully designed items, and statistical theory. In education, clinical practice, health outcomes, and personnel selection, Psychometrics provides the tools to design tests and questionnaires, evaluate their reliability and validity, calibrate items on common scales, equate scores across forms, monitor fairness and bias, and report interpretable results with quantified uncertainty. Because the focus keyword Psychometrics is used across disciplines, the field integrates classical test theory, factor analysis, item response theory (IRT), generalizability theory, structural equation modeling (SEM), computerized adaptive testing (CAT), and modern Bayesian and computational methods.^[1]^[2]^[3]

Psychometricians seek **construct-relevant** measurement: the evidence-based inference from observed responses to underlying attributes. Core concerns include: (a) **reliability**—the consistency or precision of scores; (b) **validity**—the appropriateness, meaningfulness, and consequences of score interpretations for specific uses; (c) **fairness** and **invariance**—whether scores carry the same meaning across groups or occasions; and (d) **utility**—how measurement supports decisions and improves outcomes.^[4]^[5]

Historical development

Psychometrics emerged from 19th–20th century efforts to quantify individual differences and mental abilities. **Galton** and **Pearson** introduced correlation, regression, and early scaling. **Spearman** proposed the general factor g and **factor analysis**, while **Thurstone** developed multiple-factor methods and attitude scaling. **Guttman** designed cumulative scaling; **Rasch** formalised the 1-parameter logistic model; **Birnbaum** generalized to 2PL/3PL families. Post-war work unified **classical test theory** (CTT) and factor analysis; later decades integrated **SEM**, **IRT**, and **generalizability theory**, as well as **equating**, **standard setting**, and **adaptive testing** for large-scale assessments.^[6]^[7]^[8]^[9]^[10]

Classical test theory (CTT)

CTT models an **observed score** X as X = T + E, where T is the **true score** and E random error, assuming linearity and independence under specified conditions. Key indices include **reliability coefficients** (e.g., **Cronbach’s alpha** for internal consistency, **test–retest**, and **parallel forms**), the **standard error of measurement** (SEM), and **item difficulty/discrimination** computed from proportions and correlations.^[11]^[12]

Alpha vs. omega—Alpha assumes tau-equivalence; **McDonald’s omega** (hierarchical and total) offers more accurate reliability under congeneric models.^[13]^[14]
Score interpretation—CTT supports **norm-referenced** and **criterion-referenced** uses with **confidence intervals** around scores via SEM.

CTT remains valuable for small-sample scale development and as a baseline descriptive framework, but item and sample dependence limit its generality.

Factor analysis and structural equation modeling

- Exploratory factor analysis** (EFA) identifies latent dimensions underlying inter-item correlations; **confirmatory factor analysis** (CFA) tests prespecified factor structures with model fit indexes (e.g., CFI/TLI/RMSEA/SRMR). **Bifactor** and **higher-order** models separate general and group factors; **ESEM** blends EFA flexibility with CFA constraints; **SEM** integrates measurement and structural relations among latent variables while modeling measurement error explicitly.^[15]^[16]^[17]

- Measurement invariance** testing examines whether factor loadings, intercepts/thresholds, and residuals are equivalent across groups/time. Establishing configural → metric → scalar invariance supports **fair comparisons** of latent means and growth.^[18]

Item response theory (IRT)

IRT models the probability that a respondent with latent trait level θ selects/endorses an item response as a function of item parameters. For dichotomous items:

**Rasch/1PL**: difficulty b only; equal discrimination; specific objectivity and sufficiency of raw scores under model fit.
**2PL**: item discrimination a and difficulty b.
**3PL**: adds lower asymptote (guessing) c for multiple-choice.^[19]

For polytomous items:

**Graded response model (GRM)** for ordered categories.
**Partial credit model (PCM)** and **generalized PCM** for step-wise processes.
**Nominal response model (NRM)** for unordered categories.^[20]^[21]

Key products include **item characteristic curves** (ICCs), **item/test information functions** (precision by θ), **standard errors of estimation**, and **ability estimation** via MLE, EAP, or MAP. **Multidimensional IRT (MIRT)** extends to multiple traits; **testlet** and **bi-factor IRT** address local dependence; **equating** links scales across forms; and **CAT** selects items adaptively to maximize information, shortening tests while maintaining precision.^[22]^[23]

Generalizability theory

- G-theory** decomposes score variance into multiple facets (persons, items, raters, occasions) and their interactions, extending CTT beyond a single undifferentiated error term. **G-studies** estimate variance components; **D-studies** project reliability-like indexes (G and Φ) for alternative designs (e.g., more raters or items).^[24]

Scale construction and item writing

Psychometric scale development typically follows:

**Construct definition** grounded in theory and use-case (e.g., selection, diagnosis, progress monitoring).
**Blueprint/specifications** mapping content to items and desired score precision across the trait range.
**Item writing** using clear stems, plausible distractors, and bias/sensitivity reviews; response formats (Likert, semantic differential, forced-choice).
**Pilot testing** for exploratory item analysis (difficulty, discrimination, option performance).
**Dimensionality checks** (EFA/CFA) and **IRT calibration**.
**Refinement** (remove misfitting or biased items; optimize information).
**Validation** for the claim and use (see below).^[25]

- Forced-choice** designs with IRT-based ranking models (e.g., Thurstonian IRT) mitigate social desirability and reference-group effects in noncognitive assessment.^[26]

Reliability and precision

Reliability quantifies the proportion of observed variance attributable to true score variance.

**Internal consistency**: alpha, omega (total/hierarchical), greatest lower bound (GLB).
**Stability**: test–retest correlations with interval-appropriate modeling.
**Equivalence**: parallel/alternate-form correlations.
**Interrater**: intraclass correlations (ICCs) for ratings.
**Conditional precision**: in IRT, the **test information function** and conditional SEM vary by θ; reporting precision profiles improves interpretation.^[27]

Validity: evidence and argument

Contemporary validity is an **evidential–argument** framework: the test developer/user makes a claim about what scores mean for a use; multiple strands of evidence support or challenge that claim.^[28]

**Content**: representativeness and relevance of items to the construct and use (blueprints, expert review).
**Internal structure**: dimensionality, factor loadings, local dependence.
**Relations to other variables**: convergent/discriminant (multitrait–multimethod), criterion-related (predictive/concurrent), **incremental validity** beyond existing measures.
**Response processes**: cognitive labs, think-aloud, DIF explaining construct-irrelevant variance.
**Consequences**: intended/unintended effects (e.g., teaching to the test, access equity).^[29]

Fairness, bias, and invariance

- Fairness** requires that individuals with the same standing on the construct have **comparable expected scores** regardless of group membership (e.g., gender, ethnicity, disability status).

**Differential item functioning (DIF)** detects items with group-specific parameters after conditioning on the trait (Mantel–Haenszel, logistic regression DIF, IRT-based likelihood-ratio tests).
**Measurement invariance** in CFA (configural/metric/scalar) supports unbiased latent mean comparisons.
**Accessibility and accommodations**: design for all, with evidence that accommodations remove construct-irrelevant barriers without altering the score meaning.^[30]^[31]

Scaling, linking, and equating

When multiple forms or administrations must yield comparable scores, **equating** places them on a common scale using common-item or common-person designs. Methods include **linear**, **equipercentile**, and **IRT true-score** equating; **linking** is a looser family of transformations used when constructs or populations differ; **concordance** relates scores across non-equivalent tests but does not confer interchangeability.^[32]

Standard setting and passing scores

For criterion-referenced interpretations (licensure, proficiency levels), **standard setting** converts test scores into performance levels via expert judgment: **Angoff**, **bookmark**, **Nedelsky**, **Hofstee**, and **body of work** methods, often with impact data and validity evidence to support policy decisions.^[33]

Computerized adaptive testing (CAT) and automated scoring

CAT administers items dynamically based on provisional θ estimates, targeting information at the examinee’s ability and reducing test length/seat time. Operational CATs require secure calibrated banks, exposure control, and continual monitoring. Automated scoring (e.g., NLP-based essay scoring, automated short-answer scoring) must demonstrate reliability, validity, detectability of gaming, and fairness comparable to human raters.^[34]^[35]

Patient-reported outcomes (PROs) and health measurement

Health psychometrics emphasizes **patient-reported outcomes** (symptoms, functioning, quality of life), often using IRT-calibrated banks (e.g., PROMIS) to deliver **CATs** with interpretable **T-scores** anchored to reference populations. Emphasis on **responsiveness**, **minimally important difference** (MID), and longitudinal invariance supports clinical decision-making and trials.^[36]

Selection, utility, and legal context

In employment testing, **predictive validity**, **adverse impact** analyses, and **utility models** (e.g., Taylor–Russell, Brogden–Cronbach–Gleser) guide instrument selection and cut-scores; documentation must satisfy professional **Standards** and relevant laws/guidelines on fairness and validation. Adverse impact alone does not imply bias; evidence must address **job relatedness** and alternative measures.^[37]^[38]

Modern directions

**Bayesian psychometrics**: hierarchical models for small-sample calibration, complex response processes, and incorporating prior information.
**Network psychometrics**: symptoms/behaviors modeled as mutually interacting nodes rather than effects of a latent variable; complimentary to latent approaches.^[39]
**Nonparametric IRT and Mokken scaling** for ordinal data when parametric assumptions are questionable.^[40]
**Response time modeling** and speed–accuracy tradeoffs.
**Automated item generation** combining cognitive models, templates, and NLP/LLMs (with rigorous post-hoc calibration).
**Fairness auditing** with multiple-criteria optimization (accuracy, equity, privacy).
**Learning analytics** and **cognitive diagnostic models** (CDMs) that classify mastery of fine-grained skills.^[41]

Reporting and use of scores

Transparent reporting includes:

**Intended interpretation and use**; score scales and transformations (e.g., raw → scaled T-score).
**Precision**: reliability indices and conditional SEM bands; **confidence intervals** for individuals, **CIs** for group means; **classification accuracy/consistency** for cut-scores.
**Norms and comparators**: sample frames, recency, percentile ranks, and subgroup norms.
**Validity evidence** linked to the decision context (selection, diagnosis, monitoring).
**Accessibility and accommodations** policies with supporting evidence.
**Monitoring**: item drift, security, DIF surveillance, and re-calibration triggers.^[42]

Common pitfalls and remedies

**Alpha misuse** as a sole index of quality → report omega/GLB and conditional precision; verify dimensionality.^[43]
**Data-driven hunting** for factor structures → preregistered models; cross-validation; theory-driven items.
**Ignoring invariance** → routine DIF and multi-group CFA/IRT.
**Overfitting in SEM** → parsimony, cross-validated fit, theory constraints.
**Using total scores on multidimensional banks** → MIRT scoring or bifactor-appropriate composites.
**Unstable cut-scores** → robust standard setting with impact/validity evidence and periodic review.

Representative timeline

Year	Milestone	Significance
1904	Spearman introduces g and factor analysis	Birth of latent variable modeling
1931	Thurstone develops multiple-factor analysis	Multidimensional constructs
1951	Cronbach’s alpha	Internal consistency index
1960	Rasch model	Specific objectivity; invariant measurement
1968	Lord & Novick unify test theory	CTT foundations formalized
1970s	Generalizability theory; equating advances	Multi-facet reliability; population linking
1980s–1990s	IRT widespread; CAT pilots	Item-level modeling; adaptive administration
1990s–2000s	SEM mainstream; DIF/invariance standards	Valid comparisons across groups/time
2007–present	PROMIS and health CATs	Clinical measurement at scale
2010s–present	Network models; Bayesian/ML integration	New representations and computation

Comparison of model families

Family	Units	Strengths	Limitations	Typical uses
CTT	Total scores, items	Simple; small samples	Item/sample dependence	Classroom tests; early pilots
Factor/CFA/SEM	Latent variables	Theory testing; error modeled	Fit sensitivity; requires expertise	Scale validation; structural research
IRT/Rasch	Item–person on latent scale	Invariance; conditional precision; CAT	Model fit; larger samples	Large-scale tests; PROs; item banks
G-theory	Persons × facets	Multifacet reliability planning	Design/estimation complexity	Performance tasks; ratings
CDMs	Attribute mastery	Diagnostic feedback	Calibration, Q-matrix quality	Learning analytics; formative use

Glossary

**Construct**: The theoretical attribute a test is intended to measure.
**Reliability**: Consistency/precision of scores; proportion of true variance.
**Validity**: Degree to which evidence and theory support score interpretations for intended uses.
**DIF**: Differential item functioning—items behave differently across groups at the same trait level.
**Equating**: Statistical process to place scores from different forms on a common scale.
**CAT**: Computerized adaptive testing—algorithm selects items to maximize information.
**SEM**: Standard error of measurement; uncertainty around an observed score.
**Omega**: Reliability coefficient under congeneric measurement models.
**Bifactor model**: General factor plus specific factors, allowing interpretation of total and subscale scores.
**MIRT**: Multidimensional item response theory; multiple latent traits.

References

↑ Statistical Theories of Mental Test Scores, Addison–Wesley, 1968
↑ Psychometric Theory (3rd ed.), McGraw–Hill, 1994
↑ Item Response Theory for Psychologists, Lawrence Erlbaum, 2000
↑ Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning, American Psychologist, 1995
↑ Standards for Educational and Psychological Testing, AERA, 2014
↑ General intelligence, objectively determined and measured, American Journal of Psychology, 1904
↑ Multiple factor analysis, Psychological Review, 1931
↑ Probabilistic Models for Some Intelligence and Attainment Tests, University of Chicago Press, 1960
↑ Fundamentals of Item Response Theory, Sage, 1991
↑ Test Theory: A Unified Treatment, Lawrence Erlbaum, 1999
↑ Coefficient alpha and the internal structure of tests, Psychometrika, 1951
↑ Statistical Theories of Mental Test Scores, 1968
↑ Test Theory: A Unified Treatment, Lawrence Erlbaum, 1999
↑ Coefficient alpha, reliability, and psychometric theory, Psychometrika, 2009
↑ A general approach to confirmatory maximum likelihood factor analysis, Psychometrika, 1969
↑ Confirmatory Factor Analysis for Applied Research (2nd ed.), Guilford, 2015
↑ Bifactor models and rotations, Journal of Personality Assessment, 2010
↑ Measurement invariance, factor analysis and factorial invariance, Psychometrika, 1993
↑ Fundamentals of Item Response Theory, 1991
↑ Item Response Theory for Psychologists, 2000
↑ Rating Scale Analysis, MESA Press, 1982
↑ Multidimensional Item Response Theory, Springer, 2009
↑ Computerized Adaptive Testing: Theory and Practice, Kluwer, 2000
↑ Generalizability Theory, Sage, 1991
↑ Scale Development: Theory and Applications (4th ed.), Sage, 2017
↑ Item response modeling of forced-choice questionnaires, Educational and Psychological Measurement, 2011
↑ On the use, the misuse, and the very limited usefulness of Cronbach’s alpha, Psychometrika, 2009
↑ Validating the interpretations and uses of test scores, Journal of Educational Measurement, 2013
↑ Standards for Educational and Psychological Testing, 2014
↑ A handbook on the theory and methods of differential item functioning (DIF), National Defense Headquarters, 1999
↑ Statistical Approaches to Measurement Invariance, Routledge, 2011
↑ Test Equating, Scaling, and Linking (3rd ed.), Springer, 2014
↑ Standard Setting: A Guide to Establishing and Evaluating Performance Standards, Sage, 2007
↑ Computerized Adaptive Testing: Theory and Practice, 2000
↑ Automated scoring of complex tasks in computer-based testing, Lawrence Erlbaum, 2006
↑ The Patient-Reported Outcomes Measurement Information System (PROMIS), Medical Care, 2007
↑ The validity and utility of selection methods in personnel psychology, Psychological Bulletin, 1998
↑ Employee Selection (2nd ed.), Routledge, 2021
↑ The graphical LASSO and estimating psychological networks, Behavior Research Methods, 2018
↑ Introduction to Nonparametric Item Response Theory, Sage, 2002
↑ Toward an integration of item-response theory and cognitive error diagnosis, Applied Measurement in Education, 1990
↑ Standards for Educational and Psychological Testing, 2014
↑ On the use... of Cronbach’s alpha, Psychometrika, 2009

External links

Article tools

Use and verify this page

Suggest correction

Page information Metadata, revision, and page details View history Full revision history Permanent link Stable link to this revision What links here Pages linking to this article Related changes Recent edits on linked pages Printable version Reader-friendly print view

Cite this page Psychometrics. Roovet Articles. Retrieved from https://articles.roovet.com/Psychometrics

[1] Statistical Theories of Mental Test Scores, Addison–Wesley, 1968

[2] Psychometric Theory (3rd ed.), McGraw–Hill, 1994

[3] Item Response Theory for Psychologists, Lawrence Erlbaum, 2000

[4] Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning, American Psychologist, 1995

[5] Standards for Educational and Psychological Testing, AERA, 2014

[6] General intelligence, objectively determined and measured, American Journal of Psychology, 1904

[7] Multiple factor analysis, Psychological Review, 1931

[8] Probabilistic Models for Some Intelligence and Attainment Tests, University of Chicago Press, 1960

[9] Fundamentals of Item Response Theory, Sage, 1991

[10] Test Theory: A Unified Treatment, Lawrence Erlbaum, 1999

[11] Coefficient alpha and the internal structure of tests, Psychometrika, 1951

[12] Statistical Theories of Mental Test Scores, 1968

[13] Test Theory: A Unified Treatment, Lawrence Erlbaum, 1999

[14] Coefficient alpha, reliability, and psychometric theory, Psychometrika, 2009

[15] A general approach to confirmatory maximum likelihood factor analysis, Psychometrika, 1969

[16] Confirmatory Factor Analysis for Applied Research (2nd ed.), Guilford, 2015

[17] Bifactor models and rotations, Journal of Personality Assessment, 2010

[18] Measurement invariance, factor analysis and factorial invariance, Psychometrika, 1993

[19] Fundamentals of Item Response Theory, 1991

[20] Item Response Theory for Psychologists, 2000

[21] Rating Scale Analysis, MESA Press, 1982

[22] Multidimensional Item Response Theory, Springer, 2009

[23] Computerized Adaptive Testing: Theory and Practice, Kluwer, 2000

[24] Generalizability Theory, Sage, 1991

[25] Scale Development: Theory and Applications (4th ed.), Sage, 2017

[26] Item response modeling of forced-choice questionnaires, Educational and Psychological Measurement, 2011

[27] On the use, the misuse, and the very limited usefulness of Cronbach’s alpha, Psychometrika, 2009

[28] Validating the interpretations and uses of test scores, Journal of Educational Measurement, 2013

[29] Standards for Educational and Psychological Testing, 2014

[30] A handbook on the theory and methods of differential item functioning (DIF), National Defense Headquarters, 1999

[31] Statistical Approaches to Measurement Invariance, Routledge, 2011

[32] Test Equating, Scaling, and Linking (3rd ed.), Springer, 2014

[33] Standard Setting: A Guide to Establishing and Evaluating Performance Standards, Sage, 2007

[34] Computerized Adaptive Testing: Theory and Practice, 2000

[35] Automated scoring of complex tasks in computer-based testing, Lawrence Erlbaum, 2006

[36] The Patient-Reported Outcomes Measurement Information System (PROMIS), Medical Care, 2007

[37] The validity and utility of selection methods in personnel psychology, Psychological Bulletin, 1998

[38] Employee Selection (2nd ed.), Routledge, 2021

[39] The graphical LASSO and estimating psychological networks, Behavior Research Methods, 2018

[40] Introduction to Nonparametric Item Response Theory, Sage, 2002

[41] Toward an integration of item-response theory and cognitive error diagnosis, Applied Measurement in Education, 1990

[42] Standards for Educational and Psychological Testing, 2014

[43] On the use... of Cronbach’s alpha, Psychometrika, 2009

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

Psychometrics

Contents

Psychometrics

Historical development

Classical test theory (CTT)

Factor analysis and structural equation modeling

Item response theory (IRT)

Generalizability theory

Scale construction and item writing

Reliability and precision

Validity: evidence and argument

Fairness, bias, and invariance

Scaling, linking, and equating

Standard setting and passing scores

Computerized adaptive testing (CAT) and automated scoring

Patient-reported outcomes (PROs) and health measurement

Selection, utility, and legal context

Modern directions

Reporting and use of scores

Common pitfalls and remedies

Representative timeline

Comparison of model families

Glossary

See also

References

Further reading

External links

Use and verify this page

Psychometrics

Historical development

Classical test theory (CTT)

Factor analysis and structural equation modeling

Item response theory (IRT)

Generalizability theory

Scale construction and item writing

Reliability and precision

Validity: evidence and argument

Fairness, bias, and invariance

Scaling, linking, and equating

Standard setting and passing scores

Computerized adaptive testing (CAT) and automated scoring

Patient-reported outcomes (PROs) and health measurement

Selection, utility, and legal context

Modern directions

Reporting and use of scores

Common pitfalls and remedies

Representative timeline

Comparison of model families

Glossary

See also

References

Further reading

External links

Use and verify this page

Related articles