Understanding Assessment
The Learning Journey
CTB supports educators and students throughout the learning process.
Psychometrics is the field of study concerned with the theory and methodology of educational and psychological measurement.
Psychometrics is the field of study concerned with the theory and methodology of educational and psychological measurement. At CTB, we focus mainly on measurement of the knowledge, skills, and abilities that comprise educational curricula. CTB's Research Department conducts psychometric research to support improved measurement. This work addresses a wide range of topics, including cheating detection, scaling and equating, computer-adaptive testing, validity studies, and other relevant issues.
Various reliability coefficients of internal consistency have been proposed that make different assumptions regarding the degree of test-part parallelism, with implications for mixed-format tests. This study compares the IRT model-derived coefficients and observed values for mixed-format tests.
Several screening methods have been proposed for identifying test items that should be retained or removed from the anchor set using the Rasch model. This simulation study explores the sample size that is needed to detect anchor item outliers using two communly used screening methods.
Vertical test scales, when properly constructed and maintained, represent units on a single, equal-interval scale applied across all grade levels. This study evaluates the performance of different vertical scale maintenance methods and their impact on scale properties.
To determine multidimensionality, the DIMTEST procedure is often used. The procedure uses to disjoint and dimensionally distinct clusters (AT and PT), and forms a test statistic based on the conditional covariance of each item pair. This work investigates alternative methods of subtest selection for DIMTEST.
The best method for creating IRT vertical scales is an open, researchable question. This study uses test data to investigate various combinations of calibration methods and proficiency estimators, and the effect of these methods on the resulting vertical scales.
Quality initial estimates are crucial for the maximum likelihood estimation of item parameters. This study extends the well-known classical test theory-based initial estimation of the unidimensional two-parameter normal ogive item response theory (IRT) model to the three-parameter case.
The student growth percentile (SGP) model may be used as a normative approach to evaluate growth in student achievement. The effect of sample size on the calculation of SGP, and the implications of various sample sizes, is investigated.
Vertical scaling is dependent on several factors during the scale development process. This simulation study investigates the effect of different linking methods, proficiency estimators, estimation software, and base grades on the constructed vertical scale in various test conditions.
This study uses residuals analysis to explore the comparability of test scores for students who take a test with an accommodation and those who do not. The measurement model functions in a similar model for both accommodated and non-accommodated students, as demonstrated using this analysis.
This study investigates the robustness of the Yao & Boughton (2005) MIRT model for measuring subscores, using a simulation study with varying sample size, correlation between subscales, number of items, and structure type.
This study evaluates the stability of item parameter estimates between field test and operational test administrations, and examines differences in student ability estimates and performance level classifications under pre- and post-equating designs using number correct and item pattern scoring using large-scale ELA and Mathematics assessment data.
As the first step to develop an algorithm for Multidimensional Item Response Theory (MIRT) model analysis with guessing, this study investigated four unidimensional methods and a two-dimensional MCMC method for estimating the guessing parameter in the MIRT model.
Rasch scoring complications arise when calibrating a single-prompt writing assessment scored on six polytomous traits in which sparse data are present. This paper compares empirical calibration results from Andrich Rating Scale and Masters Partial Credit models on field test data from one state in three grades.
Paper presented at the 2005 annual meeting of the National Council on Measurement in Education, Montreal, Quebec, Canada.
Differential item functioning (DIF) analyses are performed to address bias in tests, and two commonly used procedures to detect DIF are simultaneous item bias test (SIBTEST) and Mantel-Haenszel (MH). A simulation study was conducted to examine the power rates of two DIF procedures, SIBTEST and MH, under various conditions including the proportion of items favoring the reference and focal groups.
One of the major issues in item equating is the pre-screening of items for item drift. This paper examines the efficacy of the Robust Z procedure compared to Lord’s chi-squared and Mantel-Haenszel statistic through a simulation study, examining both the type I error case (where none of the items suffer from drift) and the power case (where at least one item has drifted).
Paper presented at the American Educational Research Association (AERA) in San Francisco, California, April 2006.
Yearly ProgressPro provides teachers and administrators with tools to support student achievement by measuring the effectiveness of instruction, and tracking both mastery and retention of grade-level skills.
CTB supports educators and students throughout the learning process.