|
An aptitude test predicts achievement in academic pursuits. Ideally, in
constructing this type of test, the developer tries to minimize the effect of
exposure to specific materials or courses of study on the examinee's score.
An assessment that measures a student's acquired knowledge and skills in one
or more common content areas (for example, reading, mathematics, or
language).
An assessment intended primarily for individuals 18 years old or older who are
no longer attending elementary or secondary school.
An assessment that differs from traditional achievement tests. For example, an
alternative assessment may require a student to generate or produce responses
or products rather than answer only selected-response items. This type of
assessment may include constructed-response activities, essays, portfolios,
interviews, teacher observations, work samples, and/or group projects.
A scoring procedure in which a student's work is evaluated for selected traits
or dimensions, with each dimension receiving a separate score.
A test consisting of items selected and standardized so that the test predicts
a person's future performance on tasks not obviously similar to those in the
test. Aptitude tests may or may not differ in content from achievement tests,
but they do differ in purpose. Aptitude tests consist of items that predict
future learning or performance; achievement tests consist of items that sample
the adequacy of past learning.
An assessment that measures a student's performance on tasks and situations
that occur in real life. This type of assessment is closely aligned with, and
models, what students do in the classroom.
A test battery is a set of several tests designed to be administered as a
unit. Individual subject-area tests measure different areas of content and may
be scored separately; scores from the subtests may also be combined into a
single score.
A situation that occurs in testing when items systematically measure
differently for different ethnic, gender, or age groups. Test developers reduce
bias by analyzing item data separately for each group, then identifying and
discarding items that appear to be biased.

The upper limit of performance that can be measured effectively by a test.
Individuals are said to have reached the ceiling of a test when they perform at
the top of the range in which the test can make reliable discriminations. If an
individual or group scores at the ceiling of a test, the next higher level of
the test should be administered, if available.
An assessment that is based on the examiner observing an individual or group
and indicating whether or not the assessed behavior is demonstrated.
A single score used to express the combination, by averaging or summation, of
the scores on several different tests.
See Equal-Interval Scale.
An assessment unit with directions, a question, or a problem that elicits a
written, pictorial, or graphic response from a student. Sometimes called an
"open-ended" item.
Content validity indicates the extent to which the content of the test samples
the subject matter or situation about which conclusions are to be drawn.
Methods used in determining content validity are textbook analysis, description
of the universe of items, adequacy of the sample, representativeness of the
test content, inter-correlations of subtest scores, and opinions of a jury of
experts.
Tables used to convert a student's test scores from scale score units to grade
equivalents, percentile ranks, and stanines.
A standard or judgment used as a basis for quantitative and qualitative
comparison; that variable to which a test is compared to constitute a measure
of the test's validity. For example, grade-point average and attainment of
curricular objectives are often used as criteria for judging the validity of an
academic aptitude test.
A test in which every item is directly identified with an explicitly stated
educational behavioral objective. The test is designed to determine which of
these objectives have been mastered by the examinee.
A test devised to exclude specific cultural stimuli so that persons from a
particular culture will not be penalized or rewarded on the basis of
differential familiarity with the stimuli.
A test score pertaining to a norm group (such as a percentile, stanine, or
grade equivalent) that is an outgrowth of the scale scores. Derived scores are
useful descriptors; however, they are not calibrated on an equal-interval
scale, so they cannot be added, subtracted, or averaged across test levels the
way scale scores can.
A test intended to locate learning difficulties or patterns of error. Such
tests yield measures of specific knowledge, skills, or abilities underlying
achievement within a broad subject. Thus, they provide a basis for remedial
instruction.
The property that indicates how accurately an item distinguishes between
examinees of high ability and those of low ability on the trait being measured.
An item that can be answered equally well by examinees of low and high ability
does not discriminate well and does not give any information about relative
levels of performance.

An assessment intended for students in kindergarten and grades 1 through
3.
A statement that defines an intended outcome of instruction. It describes what
a successful learner is able to do at the end of the lesson or course, defines
the conditions under which the behavior is to occur, and often specifies the
criterion or standard of acceptable performance.
A scale marked off in units of equal size that is applied to all groups taking
a given test, regardless of group characteristics or time of year. Each test
yields its own scale. On TABE, for example, scale scores are expressed in
numbers ranging from 0 to 999. The continuity of the scale among levels comes
from administering special test forms containing items from adjacent test
levels to random groups of students. This allows the TABE scales to be
calibrated so that a given adult learner is expected to obtain the same scale
score regardless of the form or level of the test he or she takes. However, the
standard error of measurement associated with that student's score will vary
systematically from level to level.
An evaluation of a test based on inspection only.
The opposite of ceiling, it is the lowest limit of performance that can be
measured effectively by a test. Individuals are said to have reached the floor
of a test when they perform at the bottom of the range in which the test can
make reliable discriminations. If an individual or group scores at the floor of
a test, the next lower level of the test, if available, should be
administered.
An ordered tabulation of individual scores (or groups of scores) showing the
number of persons who obtained each score or placed within each range of
scores.

A score on a scale developed to indicate the school grade (usually measured in
months) that corresponds to an average chronological age, mental age, test
score, or other characteristic of students. A grade equivalent of 6.4 is
interpreted as a score that is average for a group in the fourth month of Grade
6. Grade equivalents do not compose a scale of equal intervals and cannot be
added, subtracted, or averaged across test levels the way scale scores can.
The average test score obtained by students classified at a given grade
placement.
The probability that a student with very low ability on the trait being
measured will answer the item correctly. There is always some chance of
guessing the answer to a multiple-choice item, and this probability can vary
among items. The guessing parameter enables a model to account for these
factors.
A scoring procedure yielding a single score based on overall student
performance rather than on an accumulation of points. Holistic scoring uses
rubrics to evaluate student performance.

A test that measures the higher intellectual capacities of a person, such as
the ability to perceive and understand relationships and the ability to recall
associated meaning--in other words, measures the ability to learn.
The act of explaining test scores to students so they understand exactly what
each type of score means. For example, a percentile rank refers to the
percentage of students in the norm group who fall below a particular point, not
the percentage of items answered correctly.
A question or problem on a test.
An item is biased when it systematically measures differently for different
ethnic, cultural, regional, or gender groups.
The basis of various statistical models for analyzing item and test data. In
TABE, the three-parameter model was used in the selection and scaling of items.
This model takes into account discrimination, difficulty, and chance level of
success (guessing) to describe each item's statistical characteristics.

An assessment intended primarily for students in elementary and secondary
schools. CTB assessments may assess students in the entire K-12 range or just
in selected grades, e.g., Grades 2-12 .
Norms that have been obtained from data collected in a limited locale, such as
a school system, county, or state. They may be used instead of, or along with,
national norms to evaluate student performance.
A statistic from item response theory that pinpoints the ability level at
which an item discriminates, or measures, best.

The quotient obtained by dividing the sum of a set of scores by the number of
scores; also called "average." Mathematicians call it "arithmetic
mean."
The middle score in a set of ranked scores. Equal numbers of ranked scores lie
above and below the median. It corresponds to the 50th percentile and the 5th
decile.
The score or value that occurs most frequently in a distribution.
Assessments that measure student performance in a variety of ways. Multiple
measures may include standardized tests, teacher observations, classroom
performance assessments, and portfolios.
A question, problem, or statement (called a "stem") which appears on a test,
followed by two or more answer choices, called alternatives or response
choices. The incorrect choices, called distractors, usually reflect common
errors. The examinee's task is to choose from, among the alternatives provided,
the best answer to the question posed in the stem. These are also called
"selected-response items."
A bell-shaped curve representing a theoretical distribution of measurements
that is often approximated by a wide variety of actual data. It is often used
as a basis for scaling and statistical hypothesis testing and estimation in
psychology and education because it approximates the frequency distributions of
sets of measurements of human characteristics.
A standardized assessment, in which all students perform under the same
conditions. This type of test compares a student or group of students with a
specified reference group, usually others of the same grade and age for K-12
students, or for adults, those with similar characteristics, such as those in
an adult basic education class.
The average or typical scores on a test for members of a specified group. They
are usually presented in tabular form for a series of different homogeneous
groups.

A desired educational outcome such as "constructing meaning" or "adding whole
numbers." Usually several different objectives are measured in one
subtest.
A test for which a list of correct answers, one for each test item, can be
provided so that subjective opinion or judgment is eliminated from the scoring
procedure. Multiple-choice, true/false, and matching-item tests are purely
objective, while short answer and completion-item tests are less so.
One of the 99 point scores that divide a ranked distribution into groups, each
of which contains 1/100 of the scores. The 73rd percentile denotes the score or
point below which 73 percent of the scores fall in a particular distribution of
scores. (See also the table under "stanine.")
An assessment activity that requires students to construct a response, create
a product, or perform a demonstration. Usually there are multiple ways that an
examinee can approach a performance assessment and more than one correct
answer.
A level of performance on a test, established by education experts, as a goal
of student attainment.
A test that samples the range of an examinee's capacity in particular skills
or abilities and that places minimal emphasis on time limits. A "pure" power
test is sometimes defined as one in which every examinee has sufficient time to
complete the test.
The ability of a score on one test to forecast a student's probable
performance on another test of similar skills. Predictive validity is
determined by mathematically relating scores on the two different tests.

The first score obtained in scoring a test, which is often the number of
correct answers. Sometimes it is the number right minus a fraction of the
number wrong, the time required to complete the test, the number of errors, or
some other number obtained directly from the test's administration.
A test of ability to engage in a new type of specific learning. Level of
maturity, previous experience, and emotional and mental set are important
determinants of readiness.
The consistency of test scores obtained by the same individuals on different
occasions or with different sets of equivalent items; accuracy of scores.
A scoring tool, or set of criteria, used to evaluate a student's test
performance.

An organized set of measurements, all of which measure one property or
characteristic. Different types of test-score scales use different units, for
example, number correct, percentiles, or IRT scale scores.
Scores on a single scale with intervals of equal size. The scale can be
applied to all groups taking a given test, regardless of group characteristics
or time of year, making it possible to compare scores from different groups of
examinees. Scale scores are appropriate for various statistical purposes; for
example, they can be added, subtracted, and averaged across test levels. Such
computations permit educators to make direct comparisons among examinees,
compare individual scores to groups, or compare an individual's pre-test and
post-test scores in a way that is statistically valid. This cannot be done with
percentiles or grade level equivalents.
A question or incomplete statement that is followed by answer choices, one of
which is the correct or best answer. Also referred to as a "multiple-choice"
item.
A test of a student's ability to participate in special programs or advanced
learning situations. For example, an honors-level class or a magnet school may
require the attainment of high scores on an assessment for admission.
A test in which one aspect of performance is measured by the number of tasks
performed in a given time. A "pure" speed test is one in which examinees make
no errors and that cannot be completed by any examinee in the allotted
time.
A statistic used to express the extent of the divergence of a set of scores
from the average of all the scores in the group. In a normal distribution,
approximately two-thirds (68.3%) of the scores lie within the limits of one
standard deviation above and one standard deviation below the mean. One-sixth
of the scores lie more than one standard deviation above the mean, and
one-sixth lie more than one standard deviation below the mean.
A measure of the amount of error to be expected in a score from a particular
test. The smaller the standard error of measurement, the greater the accuracy
of the test score. The standard error of measurement is the standard deviation
of a theoretical distribution of a set of variations, each of which is the
difference between the obtained score and true score. Thus, if a standard error
of measurement is 5, the chances are two to one that an obtained score lies
within five units of the true score.
A derived score scaled to produce an arbitrarily assigned mean and standard
deviation. For example, deviation IQs are standard scores with a mean of 100
and, usually, a standard deviation of 16.
The process of administering a test to a nationally representative sample of
examinees using carefully defined directions, time limits, materials, and
scoring procedures. The results produce norms to which the performance of other
examinees can be compared, provided they took the test under the same
conditions.
That part of the population that is used in the norming of a test, i.e., the
reference population. The sample should represent the population in essential
characteristics, some of which may be geographical location, age, or grade for
K-12 students, or, for adults, participation in a specific type of program (for
example, adult basic education).
A test constructed of items that are appropriate in level of difficulty and
discriminating power for the intended examinees, and that fit the pre-planned
table of content specifications. The test is administered in accordance with
explicit directions for uniform administration and is interpreted using a
manual that contains reliable norms for the defined reference groups.
A unit of a standard score scale that divides the norm population into nine
groups with the mean at stanine 5. The word stanine draws its name from the
fact that it is a STAndard score on a scale of NINE units.
|
|
|
| 9 Highest Level |
96-99 |
4% |
| 8 High Level |
90-95 |
7% |
| 7 Well above average |
78-89 |
12% |
| 6 Slightly above average |
60-77 |
17% |
| 5 Average |
41-59 |
20% |
| 4 Slightly below average |
23-40 |
17% |
| 3 Well below average |
11-22 |
12% |
| 2 Low Level |
5-10 |
7% |
| 1 Lowest Level |
1-4 |
4% |
See "Battery."
See "Item."
See "Objective."

The capability of a test to measure what its authors or users intend it to
measure.
|