Test Components
Construction
Scale types
Item analysis
Selection tests
Traits
Latent traits
Cross-situational consistency
Tests to measure change
States
Aptitude-by-treatment interactions
Administering tests
Standardization
Response strategies
Rater errors
Administrator-respondent relationship
Scoring
Aggregation
Factor analysis
Interpretation
Norms
Measurement-related statistics
Criterion-referenced interpretations
Measurement and assessment can be considered in terms of its four major components. Construction refers to how the test was created; administration, how participants completed it; scoring, how participants' responses were turned into data; and interpretation, how the researcher made sense of those data. Each of these interrelated components has the potential to influence study results. How a particular measurement method is created (e.g., the selection of test items), how the method is administered, and how items are scored all influence the interpretation of measurement scores.
Construction --> Administration --> Scoring--> Interpretation
*Construction
Definition: Procedures employed to create a test or other measurement method.
Description. The rules employed to create a test have serious implications for the interpretation of any scores produced by that test. Given its importance, it is surprising that little consensus has developed over the best procedures for test construction. In this and the following sections we describe several sets of construction guidelines.
Gregory (1992) described five steps in test construction: (a) defining the test (e.g., purpose, content), (b) selecting a scaling method (i.e., rules by which numbers or categories are assigned to responses), (c) constructing the items (e.g., developing a table of specifications which describes the test's content areas by the process the content is measured), (d) testing the items (i.e., administer the items and then conduct an item analysis), and (e) revising the test (e.g., cross-validate it with another sample because validity shrinkage almost always occurs). A researcher evaluating a new mathematics curriculum, for example, might (a) desire a test that could show changes over time in mathematics skills, (b) assign a score of 1 to each math item correctly scored, (c) create a table of specifications indicating what kind of skills would be expected to be acquired, (d) run the study and determine which items were sensitive to change, and (e) repeat the process with the selected items with a new group of students.
Similarly, Burisch (1984) described three approaches to personality test construction representative of many domains:
1. External approaches which rely on criteria or empirical data to distinguish useful items. The content of the item is less important than its ability to meet a pre-established criteria. For example, Minnesota Multiphasic Personality Inventory (MMPI) items were chosen on the basis of their ability to distinguish between normal persons and those with a diagnosed psychopathology.
2. Inductive approaches which require the generation of a large item pool, which are then completed by a large number of subjects, with the resulting data subjected to a statistical procedure (such as factor analysis) designed to reveal an underlying structure. Many aptitude tests such as the General Aptitude Test Battery (GATB) were constructed in this fashion.
3. Deductive approaches which rely on a theory to generate items. Items which clearly convey the meaning of the trait to be measure and which measure specific (as opposed to global) traits are more likely to be useful. Items for the Myers-Briggs Type Indicator, for example, were originally derived from Jung's (1923) theory of types.
Burisch's (1984) review of the literature found no superiority for any of these approaches in producing reliable and valid scales. In fact, he suggests that it is more useful to simply ask individuals to rate themselves on a trait which they understand and for tasks in which they possess high motivation.
Researchers frequently wrestle with the question of whether they need to create a new scale for a study. In the psychological arena alone, however, estimates are that 20,000 new psychological, behavioral, and cognitive measures are developed each year (American Psychological Association, 1992). It is quite likely that a self-report scale, interview, or another operation has already been developed in your research area.
The question then becomes finding that operation. Most disciplines have books or databases that are good places to start. In education and psychology, for example, sources of information about published tests includes Tests in Print (Buros Institute for Mental Measurements), Mental Measurements Yearbook (Buros Institute for Mental Measurements), Tests (Pro-Ed, Inc.), and Test Critiques (Pro-Ed, Inc.). Sources of unpublished tests include the Directory of Unpublished Experimental Mental Measures (Wm. C. Brown), Measures for Psychological Assessment: A Guide to 3,000 Original Sources and Their Application (Institute for Social Research, University of Michigan), and Tests in Microfiche (Educational Testing Service). Some of these tests have been placed on computer; information about such applications can be found in Psychware Sourcebook (Pro-Ed, Inc.) and Computer Use in Psychology: A Directory of Software (American Psychological Association). Additionally, there is a database called Health and Psychosocial Instruments (HAPI) available through BRS Information Technologies which lists over 7,000 instruments. More information about HAPI can be found by contacting Evelyn Perloff, Behavior Measurement Database Services, PO Box 110287, Pittsburgh, PA, 15232-0787 (412 687-6850).
Example. Meier (1997, 1998) proposed an alternative set of rules for selecting change-sensitive items and tasks (see Table x below). Foremost among these guidelines is that an item should show change after an intervention but stability independent of an intervention (Tryon, 1991). Using archival data, Meier evaluated the items in an alcohol attitudes scale before and after an alcohol education intervention. Application of the traditional and intervention-based item selection rules resulted in scales with differing psychometric properties. Traditional items demonstrated greater internal consistency and variability, while intervention-based items detected pretest-posttest change. Intervention-based item selection procedures may provide a new method for investigating specific changes in resulting from researcher-designed interventions.
Summary of Intervention Item Selection Rules
Rule Content
1 Ground in theory
2 Aggregate items across individuals
3 Avoid ceiling or floor effects
4 Detect change in an item's
score after
an intervention
5 Change occurs in the expected
direction
6 Change occurs relative to
a comparison or
control group
7 No treatment-comparison group
differences before
an intervention
8 No correlation with systematic
error sources
9 Cross-validate
10 Conduct a power analysis
*Scale types
Definition. The type and amount of information contained in test scores.
Description. Traditionally, four types of measurement scales are commonly described:
(a) nominal scales which contain qualitative categories
(b) ordinal scales with rank information
(c) interval scales containing rank information with equal intervals
and (d) ratio scales which contain equal intervals with a meaningful zero point
Each successive type contains more information than the previous: ratio scales, for example, provide more information about a construct than interval, ordinal, or nominal scales. Ratio scales should be the most precise if they reflect the actual values present in a phenomenon.
Example. Diagnostic categories typically contain nominal information, that is, they distinguish between different types of phenomena but provide no information about differences within a particular phenomenon. Dihoff, Hetznecker, Brosvic, and Carpenter et al. (1993) developed ordinal diagnostic criteria with 20 autistic children aged 2-3 years. Subgroupings of the children were identified and found to differ on behavioral measures, standardized tests, and school achievement. Dihoff et al. (1993) reported that use of the ordinal criteria promoted diagnostic agreement among therapists.
*Item analysis
Definition: Methods for evaluating the usefulness of test items.
Description. Typically test developers perform item analysis during test construction to determine which items should be retained or dropped. Although items usually refers to questions or statements, here we use items to mean any distinct measurement measure, including an observation or behavioral performance.
As we noted earlier in this chapter, guidelines for item selection have been proposed by numerous authors (e.g., Burisch, 1984; Dawis, 1987; DeVellis, 1991; Epstein, 1979; Gronlund, 1988; Jackson, 1970). For example, Jackson (1970) proposed four general criteria, suggesting that scales: (a) be grounded in theory, (b) suppress response style variance, (c) demonstrate reliability, homogeneity, and generalizability, and (d) demonstrate convergent and discriminant validity. Criterion (a) can be evaluated by noting the degree to which the initial item pool was rationally constructed. The degree of response style or response set variance (b) could be assessed by correlating items with a measure of social desirability. Criterion (c) can be assessed by examining item-total correlations and by checking for ceiling and floor effects (i.e., participants' responses to an item cluster near the top or bottom of the possible range of scores). Correlations among scale items and related and different constructs can be computed to assess validity (d).
Test developers typically conduct test evaluations on the basis of aggregation across items and persons, but not occasions. Epstein (1979) found that aggregating items over occasions can dramatically increase reliability and validity coefficients for trait-based items. Aggregation increases test reliability (and subsequent validity analyses) because behavioral consistencies accumulate over multiple measurements while random errors do not (Rushton, Jackson, & Paunonen, 1981). Thus, prior to further analyses, trait test developers should aggregate across items, individuals, and occasions.
A story about how Thomas Edison invented the light bulb is illustrative of the item analysis and test construction process. Edison reportedly sorted through thousands of types of materials in the search for a filament that could conduct electricity, emit light and minimize heat, and endure for a long period of time. Similarly, test developers typically sort through dozens or hundreds of items in an attempt to find a number that exhibit the characteristics desired for that particular test.
Example. Musser and Malkus (1994) employed an item analysis to develop the Children's Attitudes Toward the Environment Scale (CATES), a measure designed to assess children's knowledge about the natural environment. They administered a pool of 90 items to 232 fourth and fifth grade students and subjected those items to analyses which evaluated their internal consistency (seeking items with high item-total correlation), mean level (with items showing ceiling or floor effects dropped), and variability (with items showing low variability dropped).
The 25 selected items were then administered to a new sample of 90 third, fourth, and fifth grade students; these items together displayed a coefficient alpha of .70. Finally, the 25 items were administered twice, from 4 to 8 weeks apart, to 171 third, fourth, and fifth grade students. Test-retest reliability was calculated at .68; coefficient alpha for the two administrations was .80 and .85.
These repeated waves of item
administration, analysis, and item selection typify most item analyses.
Also notice that the analyses Musser and Malkus employed, although standard,
are best used to select items that measure stable constructs. The resulting
items are likely to be less useful for studying constructs that change.
*Traits
Definition: Consistent personal characteristics often assumed to be of biological origin and resistant to environmental influences.
Description. Individual differences refers to the idea that individuals could behave differently on the same tasks or the same situations (Dawis, 1992). Stable individual differences are traits. Theorists usually assume that traits are normally distributed in the population; that is, a frequency distribution of any trait should resemble a bell-shaped curve.
Selection testers typically treat
measurement as nomothetic. That is, they are measuring traits--such as
neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness
(McCrae & Costa, 1987)
--presumed to presumed to be
present in every person. In contrast, idiographic assessors believe that
individuals are unique and that traits may or may not be present in different
individuals.
Probably the most significant step you can take to improve the measurement of trait constructs in your study is to aggregate measurement. Again, aggregation refers to administering measures two or more times and then summing or averaging the multiple administrations.
Example. The most significant contemporary work in the area of traits has to do with research on the Big Five. The Big Five refer to the consensus reached by personality researchers about five traits considered the basic structure of personality. These traits are neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness. Although this research remains open to alternative explanations (cf. Almagor, Tellegen, & Waller, 1995; Block, 1995), support for the Big Five interpretation (John, Angleitner & Ostendorf, 1988; McCrae and Costa, 1987) has been bolstered by factor analyses of trait descriptions produced by different methods (such as ratings by others and self-report) and different samples (i.e., cross-cultural).
*Latent traits
Definition: Unobservable characteristics that may be indicated by clusters of behaviors.
Description. If no single behavior can define a construct (i.e., no single operational definition exists), then clusters of behaviors may be able to do so. For example, no single behavior is assumed to be indicative of intelligence.
Perhaps the most prominent of contemporary measurement approaches, Item Response Theory (IRT), relies on the idea of latent traits (Hambleton, Swaminathan, & Rogers, 1991; Thissen & Steinberg, 1988). IRT proposes that item responses allow inferences about hidden or latent traits. IRT assumes that a relation exists between an observable item response and a single unobservable trait which underlies performance. This relation, at least in cognitive ability items, can be described by a mathematical function, a monotonically increasing curve.
For example, two different item characteristic curves can be expected for discriminating and non-discriminating verbal ability (e.g., spelling or reading comprehension) items. With a discriminating item, persons with good verbal skills should be more likely to correctly answer the question. A non-discriminating item, however, would show no difference between persons of high and low verbal ability. Similarly, persons at the same ability level should provide different answers to items of different difficulty levels. For example, two persons of moderate ability should both answer an easy item correctly and a difficult item incorrectly. Identification of poorly discriminating items allow their deletion at no loss of measurement precision. In addition, IRT analyses also permit the development of a calibrated item bank. From this bank may be drawn subsets of items which yield comparable latent trait scores, a useful benefit in many measurement applications.
The IRT test developer collects item data and compares it to the statistical model proposed for those items (Meier, 1994). If the fit is good, then: (a) the resulting ability estimates are not dependent upon a particular test (because they are based upon the underlying trait, not the test), and (b) the resulting item estimates are not dependent upon a particular sample (because the slope and intercept of the demonstrated item characteristic curve remain the same despite the value of a score contributed by any particular individual). Item and ability parameters produced through IRT analyses are invariant because they include information about items in the ability estimation process and about examinees in the item estimation process.
Example. Young (1991) employed IRT to investigate gender differences in predicting academic performance of college freshmen through the use of preadmission measures such as the Scholastic Aptitude Test and high school grade point average (GPA). Using 1,564 student participants, he found that women's cumulative GPA was significantly underpredicted by preadmission measures. Young developed an IRT-based measure of GPA which found no underprediction for men or women.
*Cross-situational consistency
Definition: Tendency of a person to behave consistently across situations or contexts.
Description. If traits are the dominant psychological phenomena, individuals should behave consistently across situations. In contrast, situational specificity refers to the tendency of individuals to behave according to the specific situation in which they find themselves.
Traits are assumed to be stable across situations. Thus, persons described as honest are expected to display honest behavior regardless of the situations in which they find themselves. For example, individuals who score low on a test of honesty may behave dishonestly in classrooms and stores, while more honest individuals behave honestly in those settings. In religious situations, however, both high and low honesty individuals may behave honestly. Honest behavior in this case is situation specific.
Use of the term trait implies that enough cross-situational stability occurs so that "useful statements about individual behavior can be made without having to specify the eliciting situations" (Epstein, 1979, p. 1122). Similarly, Campbell and Fiske (1959) stated that "any conceptual formulation of trait will usually include implicitly the proposition that this trait is a response tendency which can be observed under more than one experimental condition" (p. 100).
Magnusson and Endler (1977) discussed coherence, a type of consistency that results from the interaction between individuals' perception of a situation and individuals' disposition to react consistently in such perceived situation. The factors that influence this interaction, such as intelligence, skills, learning history, interests, attitudes, needs and values, may be quite stable within individuals.
Individuals C and D, who score highly on a test of honesty, may show more honest behavior across two situations than individuals A and B who obtain low scores. However, C and D may also display differences between themselves in honest behavior across situations--perhaps because of slight differences in their perceptions of those situations--even though their mean behavior score is the same across situations. From the perspective of the individual, the behavior appears coherent. From the perspective of the observer who looks only at group differences, the behavior appears consistent. From the perspective of the observer who looks at individuals across situations, the behavior appears inconsistent.
Tryon (1991) maintained that situational specificity simply involves different mean levels of behavior in one situation compared to another. As shown below, college students may demonstrate more anxiety in two testing situations (situations 2 and 4) than in two lectures (situations 1 and 3). Arguments for traits, Tryon suggested, involve persons maintaining their rank on a construct within the distribution. If Persons A, B, C, and D rank first, second, third, and fourth on the amount of anxiety they display in a lecture, and then maintain that ranking in other settings, trait arguments will be upheld even if overall levels of anxiety change.
Tryon (1991) believed that it is possible to hold both the situational specificity and trait positions since "activity is both very different across situations yet predictable from situation to situation" (p. 14). He concluded:
Situational differences are so
large that they stand
out immediately. Person consistency
is more subtle and
requires aggregation to reach
substantial effect size.
The Spearman-Brown prophecy
formula indicates that
either effect size can be made
arbitrarily large
depending upon the level of
aggregation chosen. The
implication for research and...practice
is that
one should choose the level
of aggregation that
provides the necessary effect
size to achieve the
stated purpose of the empirical
inquiry at hand.
(p. 14)
Example. Lyytinen (1995) studied the effects of two different situations on children's pretend play. She placed 81 children aged 2-6 years in either a play-alone condition or with a same-gender, same-age peer. Children playing with the familiar peer displayed a significantly higher proportion of pretend play acts than when playing by themselves. Children playing with another child, however, displayed fewer play acts overall because of the time they spent looking at and talking about each other's play. Thus, situational specificity appears to be at work in the pretend play of children.
*States
Definition: Transitory psychological phenomena that change because of psychological, developmental, or situational causes.
Description. States are internal or external psychological characteristics that vary; Spielberger (1991) credited Cattell and Scheier (1961) with introducing the state-trait distinction. Even theorists interested in measuring traits acknowledge the presence of state effects in psychological testing. For example, many cognitive abilities such as reading and mathematics skills may have a genetic component, but we still expect some aspects of those skills to change as a result of development (e.g., improvement with age) and interventions (e.g., education).
As described previously, Collins (1991; Collins & Cliff, 1990) described a test construct method appropriate for measuring development. Here researchers were interested in predicting and measuring patterns of change in grade school students' acquisition of mathematical skills. They proposed that children first learned addition, then subtraction, multiplication, and division, in that order. Such a sequence can be characterized as cumulative (i.e., abilities are retained even as new abilities are gained), unitary (i.e., all individuals learn in the same sequence), and irreversible (i.e., development is always in one direction) (Collins, 1991). This sequence can be employed to search for items and tasks that do and do not display the expected sequence of mathematics performance over time.
Example. The State-Trait Anxiety Inventory (STAI; Spielberger, Gorsuch, & Lushene, 1970) is one of the most widely used state-trait measures. The STAI consists of two 20-item Likert scales to measure state anxiety (i.e., situation-specific, temporary feelings of worry and tension) and trait anxiety (i.e., a more permanent and generalized feeling). Both scales contain items with similar and overlapping content: state scale items include "I am tense," "I feel upset," and "I feel content," while trait scale items include "I feel nervous and restless," "I feel secure," and "I am content." However, the state scale asks test-takers to rate the items according to how they feel "at this moment" while the trait scale requests the ratings to reflect how the test-takers "generally" feels.
The instructions do seem to produce the desired difference: test-retest reliabilities for the state scale are considerably lower than for the trait (Spielberger, Gorsuch, & Lushene, 1970). The STAI typically correlates at moderate to high levels with other measures of anxiety (e.g., Bond, Shine, & Bruce, 1995; Kaplan, Smith, & Coons, 1995). For example, Bond et al. (1995) asked patients with anxiety disorders and normal controls to complete the STAI and a visual analogue scale rating of anxiety. In this approach participants mark along a 100 mm line to indicate their level of anxiety; such visual measures are useful when frequent measures of mood are necessary and when reading is a problem. Bond et al. (1995) found correlations in the .50s and .60s between the two scales, suggesting a modest degree of overlap.
*Aptitude-by-treatment interactions (ATIs)
Definition: Interaction of individuals' characteristics with interventions.
Description. Treatments can be conceptualized as types of situations (Cronbach, 1975; Cronbach & Snow, 1977). In an study where an experimental group is contrasted with a control group, both groups are experiencing different types of situations. As illustrated below, persons can also be conceptualized as having aptitudes, that is, individual characteristics that affect response to treatments (Cronbach, 1975):
In an ATI study researchers attempt to identify important individual differences that would facilitate or hinder the usefulness of various treatments (Snow, 1991). A computer-based mathematics course or any type of distance learning course would probably be most beneficial, for example, to students with comfort and knowledge about technology.
From a common sense perspective, ATIs should be plentiful in the real world. That is, it seems reasonable to assume that persons with certain characteristics should benefit more from some treatments than others. From the perspective of selection, intervention, and theoretical research, finding ATIs would seem to be of the utmost importance. ATIs offer the possibility of increased efficiency in these applied areas.
Example. Domino (1971) investigated the interaction between learning environment and student learning style. Domino hypothesized that independent learners, students who learn best by setting their own assignments and tasks, might show the best outcomes in a class when paired with teachers who provided considerable independence. Similarly, conforming students who learn best when provided with assignments by the teacher might perform better when paired with instructors who stressed their own requirements. Domino did find empirical support for this interaction.
*Change-based measurement
Definition: Tests whose primary purpose is to detect change in one or more constructs.
Description. These measurements are intended not to detect stable traits, but states and other conditions such as moods or skills that change over time and in different situations. As noted previously, testing traditionally has focused on measuring traits such as intelligence that were assumed to be largely a function of heredity and immune to situational, developmental, and intervention influences. Attempts to measure traits affected how tests were constructed; reliability and validity became the central criteria for evaluating test's quality (Meier, 1997, 1998). Efforts to develop tests whose purpose is to be sensitive to intervention and developmental effects are relatively new.
In contrast, Meier (1997, 1998) drew on the concepts described by criterion-referenced and longitudinal test developers (Collins, 1991; Gronlund, 1988; Tryon, 1991) to develop test construction rules designed to select test items and tasks sensitive to intervention effects. Intervention items, like traditional items, should also be theoretically based, unrelated to systematic error sources, and avoid ceiling and floor effects. Because empirically-derived items may be capitalizing on sample-specific variance, items should be cross-validated on new samples drawn from the same population. Intervention-sensitive items, however, should possess several unique properties, foremost of which is that they should change in response to an intervention and remain stable over time when no intervention is present.
Example. Meier (1998) conducted a comparison of traditional and IISR rules with an alcohol attitudes scale completed by college students in an alcohol education group and a control group. The intervention and traditional item selection guidelines produced two different sets of items with differing psychometric properties. The intervention-sensitive items did detect pre-post change; these items also possessed lower test-retest reliability in intervention participants while demonstrating stability when completed by controls. In contrast, items evaluated with traditional criteria demonstrated greater internal consistency and variability, characteristics that enhance measurement of stable individual differences. In a study of a symptom checklist completed at intake and termination by students at a college counseling center, Weinstock (1999; cf. Kopta, Howard, Lowry, & Beutler, 1994) found similar differences between intervention-sensitive and traditionally-selected items.
*Administering Tests
Definition: Process in which a researcher administers a test which is completed by a test-taker.
Description. Many research situations includes a test administrator and a test-taker. The administrator's primary job is to insure the standardization (i.e., the establishment of similar test procedures; see below) of the testing environment. Worthen, Borg, and White (1993) suggested several guidelines for administered tests, including (a) checking the physical setting for appropriateness (e.g., adequate lighting, temperature); (b) insuring that participants know what they are supposed to do; (c) monitoring the test administration; and (d) following any standardized instructions carefully (e.g., as provided with a published test). Test-takers, on the other hand, bring their unique individual differences with them to the testing situation--some of which complicate the standardization effort. We have previously noted the work of Kahn (1996, 1997) who found that how individuals defined the construct they were asked to report (i.e., power in a family) influenced the scores they actually reported.
Recall that we employ test generically, that is, to mean any type of measurement and assessment device. How the test is administered at least partially distinguishes among measurement and assessment types. In self-reports the participants themselves read and respond to researcher-generated items. In interviews the researcher reads items to participants. Fewer resources is the advantage for self-reports (i.e., you do not need an interviewer), while greater depth of understanding (i.e., you can ask respondents to elaborate and they can ask you to clarify) is an advantage of interviewing.
Example. Using a within-subjects, counterbalanced design, Blais, Norman, Quintar, and Herzog (1995) compared two methods of administering the Rorschach projective test (i.e., the Rapaport and Exner systems). The Rorschach consists of administration of 10 inkblots designed to provide ambiguous stimuli. Rapaport or Exner administration, which differ mainly in the examiner-examinee seating arrangements and questioning instructions, were randomly assigned first to 20 women with bulimia. Significant differences were found between the two administration systems with Exner producing more color and shading responses. Interestingly, system differences were most prominent on the first presentation of the two administrations. Other research has also shown that Rorschach scores can be changed because of administrators' differing instructions (Exner, 1986).
*Standardization
Definition: Establishment of identical or similar test procedures for each respondent.
Description. Standardization is designed to reduce error by making the test conditions and environment as similar as possible for everyone who takes the test. Conditions could include such procedures as the time to complete the test, the readability of the test, and the order of administration of various subscales or tests.
When students take the GRE or LSAT, for example, no differences should exist in the testing environment. Lighting should be adequate, the temperature should be comfortable, the room should be quiet, and so forth. The use of computers with such tests, discussed further in a subsequent section, raises an interesting issue about standardization. While most test-takers are likely to be familiar with paper and pencil media, the introduction of computers into such testing may represent a significant change in testing conditions for a subgroup of students unfamiliar with computers.
Example. Gay (1990) investigated irregularities in the administration of standardized tests given to grade school and high school students. She surveyed 265 teachers and eight test coordinators and found irregularities in such areas as inaccurate timing, coaching, altering answer sheets, and student cheating. Gay recommended that test administrators review a testing code of ethics and be monitored for proper administration.
*Response strategies
Definition: Processes individuals use to complete test items, problems, and tasks.
Description. Meier (1994) classified the strategies test-takers use to respond to test items into two categories: retrieval and generative. Retrieval strategies involve the recall and reconstruction of information. For example, when you go to your physician for a physical examination, you may be asked whether or not you take any medications, have had previous surgeries, and so forth; to answer these questions, you must recall your past experiences.
When individuals cannot or will not employ retrieval strategies, they use generative strategies that involve the creation of information. Examples of generative strategies include random responding, dissimulation, malingering, and social desirability. When individuals randomly respond to a measurement device they enter answers by chance. With malingering, respondents simulate or exaggerate negative psychological conditions (e.g., anxiety, psychopathology). Respondents who dissimulate attempt to fake good or bad on tests. Socially desirable responses are those that are socially acceptable or present the respondent in a favorable light.
Response sets and response styles represent similar concepts that focus more on motivational than cognitive factors (Lanyon & Goodstein, 1982). With response sets the test-taker distorts answers in an attempt to generate a specific impression (e.g., "I have good work habits for this job"). With response styles there is a distortion in a particular direction regardless of item content. Examples of response styles are acquiescence (i.e., tendency to agree regardless of content) and criticalness (i.e., tendency to disagree regardless of content). Recall our previous discussions with the Career Maturity Inventory (CMI) where false responses to 43 of the 50 attitude items are scored as indicating career maturity.
As shown below, it is possible for multiple sources of error such as acquiescence and social desirability to be influencing a single measurement or assessment method. If the method is a test, for example, summing items that contain systematic error scores produces a total score reflecting the construct and error sources (i.e., invalidities).
Response sets partially result from the clarity of item content: the more transparent the item, the more likely that a response set such as social desirability will occur (Martin, 1988). For example, Murphy and Davidshofer (1994) suggest that a question like "I hate my mother" is very clear and invites a response based on its content. If the item is ambiguous, however, then the probability of a response style such as acquiescence increases. Martin (1988) noted that projective tests were partially constructed on the assumption that more ambiguous stimuli would lead to less faking and socially desirable responding. This assumption, however, has not received much empirical support (Lanyon & Goodstein, 1982). Similarly, test experts have debated the usefulness of more subtle but ambiguous items, whose intent may be less transparent to test-takers, but which may also invite acquiescence or criticalness because individuals have little basis on which to respond.
A question like "I think Lincoln was greater than Washington" is less transparent, but a respondent who must generate a response may simply agree because of the positive item wording. Such a respondent might also agree with the statement that "I think Washington was greater than Lincoln." Research tends to favor the validity of obvious items over subtle ones (Meier, 1994). Consequently, the use of subtle items to diminish response sets may increase the likelihood of a response style and thereby diminish test validity.
Generative responses would seem more likely with reactive or transparent tests. Reactivity refers to the possible distortion that may arise from individuals' awareness that they are being observed or are self-disclosing. Example. Wetter and Deitsch (1996) investigated the consistency of response to the MMPI-2 by persons faking posttraumatic stress disorder (PTSD), faking closed-head injury (CHI), or controls. The researchers asked 118 undergraduate students to imagine they were part of a lawsuit in which their faking of psychological symptoms would increase the chances of a large financial award. After reading descriptions of the disorder they were told to fake, participants completed the MMPI-2 twice (at a 2-week interval). Significantly lower reliability coefficients were found for scales completed by individuals faking CHI obtained than for controls or persons faking PTSD.
*Rater errors
Definition: Judgments produced by raters that are irrelevant to the purpose of the assessment.
Description. Given the prevalence of ratings in research, occupational, and educational settings, it is no surprise that investigators have studied a number of different types of rater errors. We summarize the most important types below.
Murphy and Davidshofer (1994) described (a) halo errors, when a rater's overall impressions about the ratee influences ratings about specific aspects of the person, (b) leniency errors, overestimates of ratee performance, and (c) criticalness errors, underestimates of performance of ratees. To illustrate the latter two errors, suppose you are an employee who has two supervisors. The figure below displays a frequency count of your actual performance, that is, it summarizes the quality of a large number of your performances. You can see that you have relatively few low or high quality performances, but that most of your work would be rated as of moderate quality. In contrast, Supervisor A's ratings (in box A) are at or below your actual performances, while all of Supervisor B's ratings (in box B) are above your actual work quality. Your supervisors are displaying criticalness and leniency errors.
Hypothesis confirmation bias is a special type of error committed by researchers, practitioners, educators, and laypersons--in other words, everyone. It refers to the tendency to crystallize on early impressions and ignore later information that contradicts the initial hypothesis (Darley & Fazio, 1980; Jones, Rock, Shaver, Goethals & Ward, 1968). Overshadowing occurs when a rater focuses on a particularly salient aspect of a person or situation (e.g., mental retardation) while ignoring other aspects that may also be important (e.g., mental illness) (cf. Reiss, 1994).
Among the solutions to rater errors are to provide thorough training, calculate interrater reliability and redo ratings if reliability is low, and recheck raters' reliability randomly (cf. Paul, 1986; Meier, 1994).
Example. Haverkamp's (1993) research provides an example of the hypothesis confirmation bias in a clinical context. She asked 65 counseling students to view a videotape of a counselor and client interaction. Students were provided with problem descriptions generated by the client and also asked to generate hypotheses themselves about the client's problem. After viewing the videotape students were presented with a series of tasks (e.g., what further questions would you ask) designed to determine the frequency of the type of information they were seeking (e.g., confirmatory, disconfirmatory, neutral, other). Haverkamp found that student counselors did not seek to confirm the hypotheses provided by the clients, but did attempt to confirm their own hypotheses about the client. Such an approach, Haverkamp maintained, means that counselor may ignore information that could support an equally plausible explanation and intervention for the client's problem.
*Administrator-respondent relationship
Definition: The degree of rapport and trust established between the test administrator/interviewer and the person taking the test.
Description. Traditionally, the relationship between the administrator and test-taker has been placed in the background. On the other hand, test developers and publishers urge administrators to establish rapport with test-takers, but seldom is the presence of this rapport assessed or monitored (cf. Worthen et al., 1993). Research has been conducted to examine the effects of administrator characteristics on respondents (Meier, 1994). Little attention has been paid to the relationship, however, because test theorists and developers usually do not consider the relationship an important factor.
In qualitative assessment, the relationship is assumed to influence the honesty and accuracy of information shared by the test-taker (Strauss & Corbin, 1990). That is, to the extent that the test-taker trusts the administrator, the test-taker is more likely to make effort to produce reliable and valid information.
Example. One way of approaching the issue of administrator and interviewer effects is to compare traditional testing administration to situations where little or no administrator-test-taker interaction occurs. For example, are tests administered or introduced by humans equivalent to computer-administered tests and interviews? In other words, does the automation of test procedures affect the method's reliability and validity? Some researchers have found no differences between traditional and computer-administered versions of tests (e.g., Calvert & Waterfall, 1982). However, some who take computer-administered tests show more anxiety (cf. George, Lankford, & Wilson, 1990), alter their rate of omitting items (Mazzeo & Harvey, 1988), and increase their faking good responses (Davis & Cowles, 1989). Students who have recently taken the computer-administered version of the GRE or similar tests should compare their experiences to other testing situations. Given the equivocal research findings, the equivalence issue currently must be considered on a test-by-test, sample-by-sample basis.
*Scoring
Definition: Method by which test data are assigned to produce a score or category.
Description. Aggregating or summing individual test responses or items is the predominant method of scoring tests. For example, Luzzo (1995) summed college students' answers to the 50-item attitude scale of the Career Maturity Inventory (CMI) and found an average score of 36.84 for the 401 persons who completed it. This means that on average this group answered about 37 of the 50 items in a manner indicating a mature career attitude.
Items, tasks, and ratings can also be weighted (e.g, counted more or less in relation to other items) prior to aggregation. If you were creating a measure of aggression in children, for example, you migt possess a theoretical reason for assigning more weight to physical acts of violence (e.g., hitting, kicking) than verbal acts (e.g., insults, threats).
Some test items are not scored per se but employed as decision trees where answers direct the tester toward some final decision, typically about diagnosis. Versions of the Diagnostic and Statistical Manual (e.g., American Psychiatric Association, 1980) contains decision trees where diagnosticians can follow a set of branching questions that leads to a specific diagnosis. For example, the tree of differential diagnosis of Organic Brain Syndromes begins with the question "Disturbance of attention, memory and orientation developing over a short period of time and fluctuating over time?" A Yes answer leads to a possible diagnosis of Delirium, while No branches to the next question, and so forth through the set of possible related diagnoses.
Example. Computer scoring of tests generally eliminates errors. However, some research procedures require the participant or experimenter to score a test, and here research suggests that a surprisingly high percentage of mistakes can be made. For example, Ryan, Prifitera, and Powers (1983) asked 19 psychologists and 20 graduate students to score WAIS-R (Wechsler Adult Intelligence Scale-Revised) information that had been administered to two vocational counseling clients. They found that regardless of professional experience, participants' scoring of the identical materials produced scores that varied by as much as 4 to 18 IQ points. Other examples of scoring errors with seemingly straightforward procedures abound (Worthen, Borg, & White, 1993). Scoring becomes even more problematic when human judgment is introduced into the scoring procedures, as with many projective tests and diagnostic tasks.
*Aggregation
Definition: Summing or averaging of measurements.
Description. Aggregation often improves the reliability and validity of measurements because random measurement errors cancel or balance each other (Rushton, Brainerd, & Pressley, 1983; Rushton, Jackson, & Paunonen, 1981). Even if systematic errors are present, if they are of a sufficiently different type, they may offset each other. In most instances, then, an aggregated score should better reflect the construct of interest more than any one item.
For tests of traits, perhaps the single most effective step you could take to maximize measurement reliability and validity would be to administer your tests on as many occasions as resources allow and then aggregate the scores prior to subsequent analyses.
One problem with aggregation is that you may sum incompatible sources. For example, you may be interested in studying parents' ratings of their children's behavior. It may be that mothers, compared to fathers, have more experience with their children and thus can provide more valid data. Adding these fathers' data to mothers' may be introducing a source of error.
Example. Epstein (1979; see also Martin, 1988) provided examples of the benefits of aggregation. Epstein asked 45 undergraduates to keep daily records, for 14 consecutive days, of such behaviors as number of social phone calls made, social contacts, headaches, hours of sleep, and similar constructs. Epstein found that the average correlation of these constructs for 1 day with data provided for the 13 other days was quite low (e.g., .09 for hours slept). That is, little relationship existed between behavior on any 1 day and behavior exhibited on the other 13 days. To demonstrate the effects of aggregation, Epstein summed scores for the even and odd days and correlated these groups. For every behavior measured, the aggregated correlations exceeded the 1-day correlations. For example, the correlation between even and odd days for hours of sleep was .84.
*Factor analysis
Definition: A statistical method for understanding the number and type of constructs influencing a test's score.
Description. Strictly speaking, factor analysis is a method for analysis of test data. Yet factor analysis has been such an important technique in the development of scoring templates for social science measures that we discuss it here.
Test developers assume that any large number of items or tests reflect a smaller number of more basic influences or factors. These factors consists of a group of highly intercorrelated variables (Vogt, 1993). Factor analysis refers to a set of statistical procedures used to examine the relations among items or tests and produce an estimate of the smaller number of factors which accounts for those relations.
Two basic types of factor analysis are commonly employed: exploratory and confirmatory. In exploratory factor analysis little or no knowledge is available about the number and type of factors underlying a set of data. Researchers generally employ exploratory factor analysis when evaluating a new set of items. With confirmatory factor analysis knowledge of expected factors is available (e.g., from theory or a previous exploratory factor analysis) and used to compare factors found in a new dataset. A good way to begin learning about factor analytic techniques and their output is through statistical user's manuals as provided by companies like SPSSx and SAS.
Golden, Sawicki, and Franzen (1984) maintained that test developers must understand the theory employed to select items in a factor analysis "since the resulting factors can only be interpreted accurately within the context of a theoretical base" (p. 27). Nevertheless, many, if not most test developers base their item selection only loosely on theory. Gould (1981) similarly criticized the use of factor analysis in the creation of intelligence tests. Gould believes many social scientists have reified intelligence, treating it as a physical entity instead of as a construct. Gould maintained that "such a claim can never arise from the mathematics alone" (p. 250) and that no such evidence exists in the case of intelligence.
One decision that researchers must take during the course of a factor analysis is whether to rotate the factor loadings. If researchers desire their factors to be independent of one another (i.e., orthogonal), the analysis includes a rotation (but see Pedhazur & Schmelkin, 1991, for a different perspective). Another issue is deciding how many factors should be extracted during an analysis. One approach is to examine the eigenvalues of the found factors; eigenvalues roughly correspond to the proportion of variance explained by summing the squared loadings on a factor. A general rule of thumb is that factors with eigenvalues of 1 or more be considered useful.
Example. Blaha and Wallbrown (1996) conducted factor analyses on the Wechsler Intelligence Scale for Children (WISC-III) subtest intercorrelations. Subtests include arithmetic, vocabulary, picture completion, and mazes. Blaha and Wallbrown obtained two and four-factor solutions for four age levels (6-7, 8-10, 11-13, and 14-16 years old). The two-factor results supported a general g factor (defined as an overlap among different assessments of intelligence) as well as two major group factors of verbal-numerical-educational ability and spatial-mechanical-practical ability. The four-factor solution suggested factors of perceptual organization, verbal comprehension, freedom from distractibility, and perceptual speed. Blaha and Wallbrown concluded that these results support the construct validity of the Full Scale IQ of the WISC-III as a measure of general intelligence.
*Interpretation
Definition: Placing measurement data in a context, or making sense of test data.
Description. Test interpretation depends upon all the steps that came before it. That is, the test construction process must have produced a valid test if the interpretation is to be valid; the test must have been administered and scored with a minimum of error during those processes. Since tests are never perfectly valid, interpretation should include statements about the limits of the test as influenced by demonstrated and likely sources of error. Without such statements of limitations, researchers are likely to misinterpret the scores of the measurement methods they employ.
Test interpretation, particularly in educational settings, traditionally has focused on norms. As discussed below, in norm-referenced tests we interpret a test score by comparing it to a group of scores. We can say, for example, that a 3rd grade student's score on an achievement test places her or him at the 90th percentile of performance. Norm-referenced interpretations are typically contrasted with criterion-referenced test interpretations (i.e., comparison to a standard, instead of other persons). That same 3rd grade student may have correctly answered 35 of 40 test items that assessed previously taught material; the teacher may have set of criterion of 30 correct answers for students to pass the course.
Other types of interpretations are also useful. With formative tests, interpretation focuses on an individual's performance on the components of an intervention. In a mathematics course, a formative test might provide information about the particular types of addition and subtraction problems a particular student answered correctly and incorrectly. During an intervention, formative tests provide feedback to the intervenor and participant that reveal progress and guide adjustment of the intervention. In education, Cross and Angelo (1988) described this process as a loop "from teaching technique to feedback on student learning to revision of the technique" (p. 2).
Summative tests provide an overall evaluation of an individual's performance in an intervention (e.g., a course grade). Summative tests provide data convenient for administrative decision-making. Summative tests can suggest initial hypotheses relevant to interventions: for example, a standardized achievement test can describe a student's strengths and weakness (compared to other students) across subject areas, information that may be relevant to inclusion or exclusion from an intervention (e.g., a remedial course or repeating a grade). More sensitive measures will be needed to develop and test those hypotheses, however, and it is here that formative tests can be useful (Bloom, Hastings, & Madaus, 1971; Cross & Angelo, 1988). The interpretation of summative tests focus on an aggregate score (of items and components), while administrators of formative tests tend to examine item response patterns (Bloom et al., 1971).
Example. Much more attention has been paid in the literature to how the test administrator or researcher interprets test scores than how test-takers make sense of them. One exception to this is research on the Barnum effect. Gauging the accuracy of a particular test interpretation depends upon making comparisons with other types of test interpretation.
One standard against which to compare an interpretation is the Barnum effect (named for the famous showman P. T. Barnum, who took advantage of the gullibility of most people in his circus and freak shows). The Barnum effect occurs when individuals take a test and receive test interpretations based not on their test data, but simple generic statements that might apply to anyone, such as the statements that appear in horoscopes ("Work hard today and your efforts will pay off"). Test-takers usually find such bogus feedback as accurate as real interpretations. For example, Guastello and Rieke (1990) evaluated the accuracy of real computer-based test interpretations (CBTIs) based on 16PF scores (a personality inventory) with bogus reports. A sample of 54 college students rated the real reports as 76% accurate and the bogus reports as 71% accurate. Computer-based reports are likely to increase the Barnum effect because many people ascribe increased credibility to computer operations.
Integration: Can Researchers Use Tests Unethically?
Practitioners who use tests have ethical standards to follow. In individual testing situations, for example, clinical, counseling, and school psychologists, for example, are expected to possess knowledge about (a) basic psychometric knowledge of the type described in Chapter 3, (b) substantive area of the assessment (e.g., vocational development in career assessment), (c) characteristics of the specific test(s) chosen (e.g., reliability and validity estimates for the Career Maturity Inventory; CMI), and (d) the test-taker (e.g., a particular client's relevant history and characteristics). This knowledge is gained through a combination of coursework and practical experiences.
Researchers, however, may employ tests without this depth of knowledge. For example, researchers might wonder if adolescent girls who are vocationally immature are more likely to become pregnant; these researchers might include the Career Maturity Inventory in their study. Given typical training in research, they should possess at least a rudimentary understanding of the scale's basic psychometric properties, but are likely to lack knowledge of vocational theory and in quantitative research, the background of individual research participants.
Should researchers be held to
the same standards as practitioners? Practitioners' interpretations of
test scores have potentially high stakes for the individuals involved (e.g.,
referral to remedial education, admission to treatment, loss of parental
rights). The consequences for researchers are different. Quantitative researchers
usually interpret scores at the level of the study sample, a group; individual
participants typically do not receive feedback about their scores. However,
if individual participants or a person who knows them (e.g., family member,
teacher) do receive information about an individual's test score, potential
consequences for the individual increase--and the researcher probably now
has the same responsibility to the individual as does a practitioner. Similarly,
if researchers misinterpret the CMI scores and this information is disseminated
in some fashion, public policy based on the study may be flawed--and these
may be high stakes also.
Definition: Data about a distribution of scores for a particular test.
Description. As we described above, in norm-referenced interpretations the purpose of testing is to compare scores among individuals. Thus, the test is intended to detect individual differences. Gronlund (1988) indicated that developers of norm-referenced tests seek items with the greatest possible variability. With achievement tests, these items are pursued through a selection process which retains items of average difficulty. Easy and difficult items, which everyone passes or fails (cf. Collins, 1991), are likely to be discarded. Aggregation of such items increases the possibility of making valid distinctions among individuals.
Norm-referenced testing has been the predominant approach in selection testing (Murphy & Davidshofer, 1994). Besides their lower cost, norm-referenced tests also seem more applicable when the test administrator desires to select some portion of a group (e.g., the top 10% of applicants) as compared to all applicants who could successfully perform a function. Thus, norm-referenced tests are useful in selection situations where individuals are chosen partially on the basis of scarce resources. Suppose you conduct a research study and find that 95% of all graduate students who score 600 or above on the GRE Verbal scale are able to pass all required graduate school courses. From the perspective of criterion-referenced testing, everyone scoring 600 or above should be admitted. In many graduate departments, however, that would mean admitting more students than available courses, instructors, or financial supports. Such a situation certainly occurs in other educational, occupational, and clinical settings with fixed quotas. Norm-referenced testing, then, provides a solution: identify the top-scoring number who match the available resources.
If a test is intended to function as a selection device, its items should be developed on a sample representative of the population for whom the test is intended. Thus, the selection of a norm group for test development has serious consequences for the interpretation of future scores compared to that group. Much controversy has occurred over the widespread use of intelligence tests or vocational interest inventories, for example, that were developed and normed on predominantly white, middle class persons.
Other frames of references are available for interpreting test scores. Ipsative scoring, for example, involves measurement of differences within individuals as compared to between individuals. Ipsative scoring can occur, for example, when a test employs a forced-choice format whose responses are scored in different scales.
Example. Schneider, Parush, Katz, and Miller (1995) examined whether norms of the Miller Assessment for Preschoolers (Miller, 1982) applied to a Hebrew version of the test administered to 60 Israeli 3-5 year-olds. Schneider et al. found no differences on the MAPs' total score, but did find that Israeli children scored lower on some subtests. Making sense of such differences is problematic because they may be due to real differences in performance or to differences in the test resulting from the language translation.
*Measurement-related statistics
Definition: Statistics employed to faciliate the interpretation of test scores.
Description. Making sense of test scores often depends at least partially on understanding a number of statistical indices normally computed with tests. For example, test developers usually examine (and present information about) the frequency distribution of all test scores to determine if it is normally distributed. Similarly, developers may present information about the range and standard deviation of scores to examine whether sufficient individual differences exist.
Below are three statistics commonly used during the test interpretation process. Before learning about them, however, we suggest you review basic procedures regarding measures of variability (e.g., range, standard deviation), measures of central tendency (e.g., mean, median, mode), and measures of association (e.g., correlation coefficient). Each of these areas is reviewed in Chapter 11, Introduction to Data Analysis.
A standard score or z score is a transformation of a raw score to show how many deviations from the mean that score lies. The formula is:
z = (Raw score - mean) / Standard deviation)
Thus z equals the person's raw score minus the mean of the group of scores, divided by the standard deviation of the group of scores. Frequently the best information that a test score can give us is the degree to which a person scores in the high or low portion of the distribution of scores. The z score is a quick summary of the person's standing: positive z scores indicate that the person was above the mean, while negative scores indicate the person scored below the mean.
Other types of standard scores have also been developed, including stanines, deviation Iqs, sten scores, and T-scores (Drummond, 1992). T-scores, for example, allow us to translate scores on a test to a distribution of scores of our choice. T-scores use arbitrarily fixed means and standard deviations and eliminate decimal points and signs (Drummond, 1992). The formula is:
T = (SD * z) + X
where SD is the chosen standard deviation, X is the chosen mean, and z is the standard score for a person's score on a test. For example, we might find it simpler to give feedback using a distribution of scores whose mean is 50 and whose standard deviation is 10. If a person had a score on a test whose z equaled -.5, the T-score would be:
(10 * -.5) + 50 = 45
Tests such as the Analysis of Learning Potential use a fixed mean of 50 and a standard deviation of 20, while the SAT and GRE use 500 as the mean and 100 as the standard deviation (Drummond, 1992). Again, the T-score provides a convenient translation of scores so that they might be more understandable during test interpretation.
Acknowledging that error influences any particular testing occasion, the standard error of measurement (SEM) is the standard deviation that would be obtained for a series of measurements of the same individual if the individual did not change on the measured construct over that time period. For example, assume that we administer a test measuring a stable trait 10 times to a particular person. If that person received the same score for each test occasion, there would be no error of measurement. In reality, however, the test score would vary for each testing, and SEM is a statistic designed to summarize the amount of variation. If you have an estimate of a test's reliability, SEM can be calculated as follows:
SEM = Standard deviation * SqRt (1 - r)
Thus, SEM equals the standard deviation of the group of scores times the square root of 1 minus the reliability estimate. SEMs help us know the extent to which an individual's particular test score can be trusted as indicative of the person's true score on the test.
Finally, the standard error of estimate (SEE) helps us know the trustworthiness of a test score's ability to predict a criterion of some sort. Just as no test produces the same score when administered repeatedly to a person, no single score will be associated with the identical score on a criterion. Thus, the SEE refers to the spread of scores around a criterion, or more precisely, the standard deviation of criterion scores for individuals who all have the same score on the predictor test. The formula for SEE is:
SEE = Standard deviation * SqRt (1 - r2)
SEE equals the standard deviation for the group of criterion scores times the square root of 1 minus the squared validity coefficient. The validity coefficient is simply the correlation between the predictor test and the criterion that is attempted to be predicted. For example, graduate schools frequently screen candidates on the basis of their GRE scores because GRE scores (the predictor test) have been shown to have a modest correlation with first year GPA (the criterion). SEE helps us gain a sense of how large the variation is likely to be around the criterion given an individual's particular test score.
Example. Let's walk through simple computations of the standard score, SEM, and SEE.
Let's start with the z or standard score. Assume that the following represents a group of test scores. To compute a z score, we need the mean (which equals 87.95) and standard deviation (6.82) for this group of scores.
78 90 95 70 85
88 85 85 90 83
94 95 88 91 99
93 81 94 91 84
If your score on this test was 90, your z score would be:
(90-87.95)/6.82 = .30
A z of .30 indicates you scored slightly above the mean in this group of scores.
On the other hand, if your score was 70, your z score would be:
(70-87.95)/6.82 = -2.63
This z indicates your score was well below the mean.
SEM depends upon the standard deviation and the reliability of the particular test. If we have a test with a reliability estimate of .90 (high) and a standard deviation of 15, then SEM equals:
15 * SqRt (1-.9) = 4.7
Thus, 4.7 represents 1 standard deviation unit for the distribution of scores around the individual's true score. However, if the test's reliability estimate was .7, SEM increases:
15 * SqRT (1-.7) = 8.21
Thus, the lower the reliability of the test, the less confidence we have that an individual's true score is close to the actual score obtained.
Finally, with SEE, we need the correlation between the test and criterion as well as the standard deviation for the group of criterion scores. If the correlation between test and criterion equaled .61, and the standard deviation for the criterion scores equaled 100, then SEE would be:
100 * SqRt (1-[.61*.61]) = 79
Thus, 79 represents 1 standard deviation unit around the criterion score. However, if the correlation between predictor and criterion dropped to .30, the SEE would increase:
100 * SqRT (1-[.30*.30]) = 95
Thus, the lower the correlation, the less confidence we have that the criterion score we predict is the true score the individual would actually obtain.
*Criterion-referenced interpretations
Definition: Interpreting a test score in relation to a criterion or pre-established level instead of other persons.
Description. Suppose an individual received a score of 95% on a classroom test. What does that mean? In a norm-referenced interpretation, that would indicate that the student scored higher than 94% of the rest of the class. A criterion-referenced statement would be "correctly completed 95 of 100 questions." Criterion-referenced interpretations simply describe performance in relation to a standard other than other persons.
With criterion-referenced tests items are retained during test development because of their relation to a criterion, regardless of the frequencies of correct or incorrect responses. However, criterion-referenced tests cost more than norm-referenced tests because the former (a) require considerable effort in the analysis and definition of the performance criteria to be measured and (b) may necessitate special facilities and equipment beyond self-report materials. If one is interested in predicting performance on a criterion--the major purpose of selection testing--then criterion-referenced approaches would seem a logical choice. If one is interested in knowing whether a person can shoot a basketball, it usually makes more sense to give her or him 20 shots than a test of eye-hand coordination.
During item development of criterion-referenced tests, Swezey (1981) emphasized the importance of precisely specifying test objectives. Criteria can be described in terms of variables such as product or process, quality, quantity, time to complete, number of errors, precision, and rate (Gronlund, 1988; Swezey, 1981). A criterion may be a product such as "student correctly completes 10 mathematics problems"; a process criterion would be "student completes division problems in the proper sequence." Process measurement is useful when diagnostic information is required, when the product always follows from the process, and when product data are difficult to obtain.
Criterion-referenced tests should be reliable and valid to the extent that performances, testing conditions, and standards are precisely specified in relation to the criteria. For example, Swezey (1981) preferred "within 5 minutes" to "under normal time conditions" as a precise testing standard. In some respects, the criterion-referenced approach represents a move away from a search for general laws and toward a specification of the meaning of test scores in terms of important measurement facets. Discussing test validity, Wiley (1991) presented a similar theme when he wrote that the labelling of a test ought to be "sufficiently precise to allow the separation of components of invalidity from valid variations in performance" (p. 86). Swezey's and Wiley's statements indicate the field's increasing emphasis on construct explication.
Example. Gentile and Murnyack (1989; see also Gentile, 1990) described a set of criteria for grading students performing art criticism assignments. They noted that art criticism is a complex analytic skills requiring students to evaluate and interpret their and others' art work. Gentile and Murnyack suggested a 50-point rating system for evaluating students' assignments:
(a) applies critical thinking
criteria (0-10),
(b) employs technical vocabulary
(0-10),
(c) provides feedback according
to criteria (0-10),
(d) presents the criticism (0-10).
Gentile and Murnyack (1989) suggested a possible passing grade of 35 points. Students who scored lower would revise and resubmit their paper based on the instructor's feedback on these criteria.
Suppose you are asked to develop a method to predict individuals' performances in an applied setting such as a school or work setting. Which method would you use? If you decided to review the literature before making a decision, you would find that human judges and tests have been frequently employed to make such decisions. For example, faculty in some graduate departments often make conduct interviews of student applicants before making judgments about admission; other departments rely exclusively on tests like the Graduate Record Examination to make these decisions.
The human judge versus test is the essence of what has been termed the clinical versus statistical prediction debate (Dawes, Faust, & Meehl, 1989; Meehl, 1954, 1957, 1959; Tracey, 1991; Wiggins, 1973). Clinical prediction refers to the use of human judgment (including data from tests) to predict an event or behavior. Statistical or actuarial prediction involves the use of test scores alone to predict.
In psychological contexts, clinical-statistical prediction studies typically take the following form. For statistical prediction, one might obtain test scores (e.g., MMPI and WAIS scores) and diagnoses for a group of individuals. The researcher would then employ the scores in a multiple regression equation to predict diagnosis. Clinicians are also provided the same test information employed in the statistical equations, perhaps with case descriptions, and asked to provide diagnoses. The success rate of the two predictions are then compared. Most studies support the superiority of statistical prediction.
Note that clinician and statistic can be employed in two phases. First, one can collect data through clinical means (e.g., observation, interviews) as well as through statistical or mechanical procedures such as self-report tests. Second, one must choose how to combine the collected data by using clinical judgment or statistical procedures. While no advantage can be claimed for either clinical or mechanical collection procedures, statistical combination of data does seem to outpredict human judges' assessments.
Sarbin (1942) was one of the first researchers to directly compare clinical and statistical prediction procedures. Sarbin examined the relative accuracy of methods of predicting 162 freshmen's first-quarter GPAs. University counselors had access to interview data, personality, aptitude and career test scores, high school rank, and college aptitude test scores. The statistical prediction was made on the basis of the high school rank and college aptitude score. Sarbin calculated the following correlations, reported by student gender:
Men Women
Clinical prediction .35 .69
Statistical prediction .45 .70
The statistical procedure shows a slight advantage. A university administrator interested in selecting students for admission among a large group of applicants would be wise to employ the statistical procedure. A clinician dealing with a single client would probably see little practical difference between the two. Nevertheless, the statistical procedure would require less time than the clinical one, and for that reason alone the former would likely be employed. Few contemporary clinicians would be interested in using clinical prediction for answering a question about GPA or other cognitive abilities where tests have demonstrated predictive validity (McArthur, 1968). Clinicians would, however, be interested in clinical procedures for the many personality and related areas (such as interpersonal skills) where such validity is lower or absent.