Promising Approaches
Traditional Approaches
Measurement of Personality and
Temperament
Act frequency analysis
The Big Five
Predicting occupational criteria
Clinical measures
Cognitive abilities and g
Neuropsychological testing
Aggregation
Summary
Statistically-Oriented Approaches
Item Response Theory
Generalizability Theory
Does Measurement Equal Statistics?
Do the Measurement Properties
of Objects Affect
the Type of Statistical
Procedure Employed?
Summary
Cognitive Approaches
Task Decomposition
Protocol Analysis
Item Characteristics
Item ambiguity
Response formats
The interaction of cognitive
ability and
item characteristics
Constructivist assessment
Summary
Behavioral Approaches
Assessment in Mental Health
Institutions
Behavioral Physics
Simulations, Structured Assessments,
and Work
Assessment Centers
Summary
Computer-Based Approaches
Response Latency
Human Speech
Simulations
Summary
Summary and Implications
Note. Portions of Chapters 6 and 8 are reprinted from Meier (1993) by permission of the American Psychological Association.
This chapter contains descriptions of important existing and innovative approaches in psychological measurement and assessment. I discuss theory and methods in terms of five major categories: traditional approaches (e.g., MMPI and neuropsychological assessment), statistically-oriented approaches (e.g., item response theory), cognitive approaches, behavioral assessment, and computer-based methods.
Traditional Approaches
Traditional measurement devices such as the WAIS-R, Rorschach and MMPI-2 continue to be widely used by psychologists to assist in selection decisions in education, mental health, medicine, business and industry, and the legal system. As many authors have noted (Buros, 1970; Gynther & Green, 1982; Hathaway, 1972; Jackson, 1992; Martin, 1988; Murphy & Davidshofer, 1988), few innovations introduced since the 1940s have been powerful enough to alter the use of traditional tests and procedures employed by psychological practitioners and researchers.
Traditional tests share certain methods, concepts, and statistical assumptions. In classical test theory, an observed test score is composed of true score and error. The true score usually represents a trait, a relatively enduring personal characteristic that influences behavior across settings. The goal of classical test theory is to maximize the true score component and minimize error. During test construction and evaluation, classical measurement approaches attempt to identify and minimize error through statistical methods. Typically, self-report items completed by many individuals are aggregated to produce an estimate intended to discriminate traits and individuals from other traits and individuals.
Measurement of Personality and Temperament
Act frequency analysis. A central problem in personality psychology has been the identification of important personality traits (e.g., Cattell, 1946, 1957; Eysenck, 1947; Fiske, 1949; McCrae & Costa, 1985). Buss and Craik (1983; see also Angleitner & Demtroder, 1988) described a procedure they termed an act frequency analysis. In contrast to factor analysis, which attempts to reduce and isolate traits by focusing on items that covary, an act frequency analysis involves a series of studies to identify prototypic acts characteristic of trait categories. Subjects first nominate acts (e.g., starting a fight) that are examples of a trait or disposition (e.g., aggressiveness). A group of judges evaluates these acts to produce a smaller set most representative of the specific disposition. Subjects then complete the item pool to determine if the items possess sufficient frequency (i.e., a base rate greater than zero) and variation (i.e., individual differences in frequency) in the sample to be meaningful.
Hogan and Nicholson (1988) suggested that a test of a disposition could be correlated with ratings of these prototypic acts to gauge the test's construct validity. Some evidence, however, suggests that act frequency data may be particularly susceptible to socially desirable responding (Block, 1989; Botwin, 1991). Act frequency analysis appears to be a type of construct explication based on natural language rules.
The Big Five. Given the desire for a taxonomy of important traits, personality psychologists have reached a consensus about what are termed the Big Five factors (Cattell, 1946; Goldberg, 1990; Digman, 1990; McCrae & Costa, 1989; Norman, 1963; Tupes & Christal, 1961; Wiggins & Pincus, 1989). These orthogonal or independent factors--neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness--have been proposed as nomothetic structures to guide the measurement of personality and interpersonal behavior. Factor analyses of trait descriptors, produced by different methods (e.g., self-report and ratings by others) and with different samples (including cross-cultural), have resulted in the identification of 5 factors (Botwin & Buss, 1989; John, Angleitner & Ostendorf, 1988; McCrae & Costa, 1985, 1987b). Two major components of the circumplex model of interpersonal behavior, dominance and nurturance, have also been connected to the Big 5 factors of extraversion and agreeableness (Trapnell & Wiggins, 1990).
Empirical results, however, provide some ambiguity about the Big Five model (Botwin & Buss, 1989; Block, 1995; Briggs, 1992). Botwin and Buss (1989), for example, instructed 59 couples report self and other data about previously performed behaviors corresponding to the five-factor model. Factor analyses of self and partner data yielded similar results. Botwin and Buss concluded that the resulting factors, labelled responsible-stable, insecure, antagonistic-boorish, sociable, and culture, departed substantially from the five-factor model. When ratings data were adjusted for the frequency level of the behaviors, however, the resulting factors closely matched the five-factor model.
Predicting occupational criteria. As noted in Chapter 1, interest in the measurement of personality and temperament began when limits were found to intelligence tests' ability to predict school and vocational success. Although some research has suggested that affective variables may influence cognitive development (Anastasi, 1985), personality and temperament researchers have continued to develop such measures in the hope that they might complement intelligence in such predictive tasks. Given the discrepancies between constructs measured by personality tests and constructs represented in many occupational criteria, it is unsurprising that such efforts have often been unsuccessful (Guion & Gottier, 1965; Schmitt, Gooding, Noe, & Kirsch, 1984).
Hough et al. (1990; see also Hogan & Hogan, 1989; Lowman, 1991) developed an inventory to measure six constructs previously found to be useful predictors of job-related criteria. The six constructs--surgency, adjustment, agreeableness, dependability, intellectance, and affiliation--are closely aligned with the Big 5 personality factors described above. Hough et al. administered this instrument to over 9,000 military personnel to obtain correlations between scale scores and criteria which included knowledge tests, supervisory ratings, letters of recommendation, and disciplinary actions. Temperament scales correlated in the .15-.25 range with such criteria as effort, leadership, personal discipline, and physical fitness. Correlations with technical knowledge criteria were near zero.
Hough et al. (1990) also examined self-report distortion, a problem thought to interfere with real-world personnel selection. Hough et al. constructed scales to measure social desirability, poor impression, self-knowledge, and random responding. These validity scales did significantly correlate with job criteria, although at lower levels than the temperament scales. Hough et al. also found that random responding reduced predictor-criteria correlations, with random responders averaging approximately .07 lower than careful responders. The social desirability and poor impression scales also demonstrated some moderation of predictor-criteria correlations.
Clinical measures. Graham (1990) listed a number of problems that led to the recent revision of the MMPI, including (a) a small, relatively homogeneous initial norm group of 724 persons, and (b) items that contained archaic, inappropriate or sexist language. The revised test, the MMPI-2, contains a more representative normative sample of 2,600 subjects and a relatively small number of altered items.
Butcher (1990) described the MMPI-2 as a useful screening device to provide information about an individual's strengths and weaknesses. Only weak evidence is available, however, supporting use of the MMPI in decisions about assignment to psychological treatments (e.g., Alker, 1978). Treatment-related data that could be provided by the MMPI-2, Butcher (1990) indicated, include symptoms of which an individual was unaware as well as motivations, fears, and defensive styles. As an example, Butcher (1990) cited a psychotherapy client who failed to disclose significant problems with substance abuse that were revealed by an MMPI-2. Butcher (1990) also suggested that the MMPI-2 could be employed during therapy for the purposes of accountability and monitoring of progress. He referred to a 29 year-old man who presented with problems of occupational stress. His initial MMPI-2 scores showed elevations on scales 2 (depression) and scale 7 (psychasthenia). After eight sessions of supportive and problem-solving therapy, scale 7 showed a significant decline.
Butcher (1990) maintained that "clinicians and researchers using the MMPI in treatment evaluation have long been aware of the stability of MMPI profiles over time" (p. 10). Graham (1990) reviewed data pertaining to stability and to other psychometric properties of the MMPI-2. Test-retest data of one week for 13 MMPI-2 scales, as cited in the MMPI-2 manual (Hathaway et al., 1989), ranged from .58 to .92 with a median value of .81. In contrast, the stability of a more frequently used MMPI score, the two-point code (i.e., the highest two scales), was fairly low: only about one-fourth to one-third of subjects in research studies show the same two-point code. Most MMPI-2 scales also appear to be multidimensional given that a sample of coefficient alphas for 13 MMPI-2 scales ranged from .34 to .87 with a median of .625. Two major factors that emerge from factor analyses of the MMPI-2 are maladjustment-psychoticism and neurotic characteristics.
Another traditional instrument, the Rorschach, has enjoyed a resurgence as a result of efforts by Exner (1978, 1986) and colleagues to establish more standardized procedures for administering and scoring the instrument. Published in 1921, the Rorschach was also developed to assist in differentiating between normal and clinical groups (Groth-Marnat, 1990). During the test, examinees provide a description of what they see in the inkblots and then clarify their responses (i.e., explain why the inkblot looks like a certain object). The major assumption is that how test-takers organize and respond to ambiguous stimuli in the testing situation reflects similar processes in nontest environments.
The clinician must attend to both the object description and the clarification, which Exner (1978) termed the articulation of the response, to understand the individual response process. Exner (1978) emphasized the multiple factors that influence examinee responses:
Parker, Hanson, and Hunsley (1988) compared the published psychometric properties of the MMPI and Rorschach. Using a meta-analytic approach, they collected data about test reliability (including internal consistency and rater agreement estimates), stability (test-retest), and convergent validity (correlations with relevant criteria). Interestingly, Parker et al. found an insufficient number of discriminant validity reports to be able to report a comparison of the two instruments in this category. Parker et al. (1988) combined test subscales to produce the following psychometric estimates: (a) for reliability, an overall r of .84 for the MMPI and .86 for the Rorschach; (b) for stability, an overall r of .74 for the MMPI and .85 for the Rorschach; and (c) for convergent validity, an overall r of .46 for the MMPI and .41 for the Rorschach. Parker et al. concluded that despite the MMPI's reputation as the superior instrument, both the MMPI and Rorschach appear to possess comparable psychometric values.
Cognitive abilities and g. Although many psychologists "act as though 'intelligence is what intelligence tests measure'...few of us believe it" (Sternberg, 1984, p. 307). The construct validity of cognitive traits and abilities remains a topic of interest in the seminal measurement area of intelligence testing (Lohman, 1989). While measures of cognitive abilities can predict educational and occupational performance (e.g., Austin & Hanisch, 1990; Hunter & Hunter, 1984), what these tests actually measure remains in some doubt. On the basis of strong positive correlations among intelligence measures, Spearman introduced g as a general factor of intelligence (Nichols, 1980). Thurstone (1938) and Guilford (1967) believed that more specific group factors accounted for the operations of cognitive abilities, a conceptualization that some researchers believe is supported by the findings of contemporary cognitive psychology. Ascertaining the structure of intellect remains an important but elusive goal for those who desire to improve the measurement of cognitive abilities and skills.
Gould (1981) claimed that intelligence testing is a misnomer because no one has yet to demonstrate that what intelligence scales measure is a unilinear, single phenomenon. Despite this fact, Gould suggested that Terman's original work became the standard against which test developers compared new scales which they then labelled intelligence tests. Thus, "much of the elaborate statistical work performed by testers during the past fifty years provides no independent confirmation for the proposition that tests measure intelligence, but merely establishes correlation with a preconceived and unquestioned standard" (Gould, 1981, p. 177).
What cognitive ability tests measure has also been central to more recent controversies about the roles of race and ethnicity in ability scores. Humphreys (1992) summarized research on differences between Blacks and Whites which indicated that: (a) Whites as a group score about 1 standard deviation higher, (b) Blacks as a group have demonstrated improvement on national educational measures in such areas as reading, mathematics and science, and (c) ability tests equally predict Black and White success in education, industry and military service. Dana (1993) noted, however, that group differences tend to be reduced when the tested sample is matched on sociodemographic variables and that factor analyses of intelligence tests tend not to produce the same number of factors or factor structures across cultures. Dana (1993) indicated that changes in the use of traditional cognitive ability tests are more likely to result from political than psychometric considerations:
(a) such tests do make excellent predictions in many domains, including school and occupational performance;
(b) such tests are correlated with socioeconomic and familial factors;
(c) SES and background factors are associated with different opportunities to learn material measured by such tests, thus handicapping some individuals' scores on the tests;
(d) work in cognitive psychology has improved theory about such tests but has yet to improve substantially the tests themselves.
Cognitive psychologists have applied their theory and experimental methods to ability testing (e.g., Carroll, 1992; Sternberg, 1988), although this merging is still in its initial stages (Hunt, 1987). As described later in this chapter, cognitive psychological theory and methods also appear to have influenced other types of measurement.
Neuropsychological testing. Goldstein (1990) cited Reitan as describing neuropsychological tests as those that are "sensitive to the condition of the brain" (p. 197). Neuropsychological testing refers to a wide range of measures employed to discover brain damage. Neuropsychology became particularly important during World War II when thousands of brain-injured soldiers required assessment and rehabilitation (Gregory, 1992). When it became apparent that brain injury was related to performance on psychological tests, neuropsychologists became successful heirs to the tradition started by early psychologists who sought relations between physiology and psychology.
Because neuropsychology has roots in a variety of disciplines, many different techniques are employed as neuropsychological tests (Franzen, 1989). These tests evaluate individuals' capacity for sensory input, attention and concentration, memory, learning, language, spatial ability, reasoning and logical analysis, and motor skills (Bennett, 1988; Gregory, 1992). While brain injuries can affect all of these areas, a one-to-one relation between an injury and a dysfunction is seldom apparent (Lynch, 1985). Two important consequences are that (a) diagnosis often involves score profiles, with their accompanying psychometric difficulties (Murphy & Davidshofer, 1988), and (b) a thorough patient history and interview are still required to make sense of the neurological information provided by different tasks and sources (Lynch, 1985).
The most common change after brain injury is a general intellectual impairment, that is, the patient seems less bright or less capable of abstract thinking (Goldstein, 1990). Consequently, cognitive tasks and tests such as the WAIS-R and Wechsler Memory Scale are frequently employed for screening as well as perceptual and motor skills tests such as the Bender-Gestalt. The Luria-Nebraska Neuropsychological Battery (Golden, Hammeke, & Purisch, 1980) and the Halsted-Reitan Battery (Reitan & Wolfson, 1985) present a more extensive set of tasks assessing a larger domain of neuropsychological categories.
Given the strong links between neuropsychological and cognitive ability testing, it is not surprising that questions have also arisen about the construct validity of neuropsychological tests. While reliability estimates are as high as those exhibited by cognitive ability tests (Franzen, 1989) and neuropsychological tests do discriminate between brain-injured and other individuals, what the neuropsychological tests actually measure remains in doubt (Gregory, 1992; Kolb & Whishaw, 1990). This situation may be alleviated more quickly in neuropsychological testing than in other measurement areas because of the sophisticated validation criteria, such as brain imaging techniques, that are becoming increasingly available (Goldstein, 1990). It is also likely that theoretical progress in cognition and neuropsychology will continue to benefit measurement and assessment in both areas (Goldstein, 1990).
Aggregation. Aggregation refers to the summing or averaging of multiple measurements. Spearman (1910, cited in Rushton et al., 1983) appears to be the first psychologist to note the advantages of obtaining multiple measures, although Rushton et al. (1983) observed that averaging of multiple observations was the solution adopted by astronomers to handle Maskelyne's problem of individual differences in observing star transits.
The chief contemporary architect of this work has been Epstein (1979, 1980, 1983, 1990). His research has focused on the effects of aggregating measurements of such variables as unpleasant emotions, social behavior and impulsivity. While acknowledging evidence that behavior varies as a result of situational variables, his studies indicate that averaging measurements of behavior, self-reports, and ratings by others over time dramatically decreases measurement error and increases validity coefficients well above the average .30 ceiling. Measurements can also be aggregated over sources and methods (Martin, 1988). In terms of classical test theory, aggregation works because behavioral consistencies accumulate over multiple measurements while random errors do not (Rushton et al., 1981).
Although Epstein has presented his research as demonstrating renewed support for the existence of psychological traits, others have not fully accepted his arguments (Day, Marshall, Hamilton, & Christy, 1983; McFall & McDonel, 1986; Mischel & Peake, 1982). McFall and McDonel (1986), for example, criticized Epstein for failing to: (a) control for method variance, that is, Epstein's measures were primarily self-reports; (b) demonstrate discriminant validity for personality measures; and (c) notice, in one study, that the best predictor (r = .80) of actual amount of time spent studying was not aggregated personality items, but a single item asking students to rate (from 1 to 5) how much time they typically study.
Rushton et al. (1983) suggested that insignificant correlations in psychological research may partially result from a failure to aggregate measurements. They reviewed research in such areas as judges' ratings, cross-situational consistency, personality stability, and cognition-behavior relations which indicated that aggregation improved validity estimates. Rushton et al. noted that researchers typically find low correlations between measures of attitudes and behavior. However, Fishbein and Ajzen (1974) observed that such correlations may be increased by aggregating different measures of behaviors: attitude scales and single behaviors correlated around .15, while aggregated measures correlated in the .70-.90 range.
Aggregation, however, may be misleading when components of the aggregated data differ substantially in reliability and validity. Martin (1988; see also Epstein, 1981) provided an example where a mother may accurately rate a child as moderately emotionally expressive, whereas the father, because of his emotional difficulties, rates the child as extremely expressive. The aggregation of mother's and father's score may be less correlated with other variables (such as teachers' ratings) than the mother's score alone. In addition, the discrepancy score between mother's and father's ratings may contain useful clinical information about the family system. Martin (1988) suggested that when discrepancies occur--whether across source, time, setting, or methods--the assessor must produce an explanation.
Summary. Many of the traditional approaches described above grew out of a history of testing aimed at identifying stable traits for selection purposes. Except when political and legal forces intervene, these traditional approaches, given their demonstrated effectiveness and fairness, are likely to be increasingly employed in the future for such decisions. For example, if pressure to reduce health care costs continues to shrink mental health benefits, it would not be surprising to see traditional tests such as the MMPI used to screen individuals to determine the degree of psychological disturbance (cf. Goldberg, 1965, 1968) and thus decide the amount or type of treatment they subsequently receive. As will be discussed in more detail in the next chapter, however, I expect traditional tests to be employed less frequently in the future for guiding psychological interventions.
Statistically-Oriented Approaches
Glass (1986) suggested that psychometric investigations began to move away from the mainstream of psychology around 1940. To put it simply, statistically-oriented approaches attempt to make sense of existing measurement data through statistical analyses and transformations, while more theoretical approaches are concerned with understanding how the data came to be created in the first place (i.e., a search for causes). Statistically-oriented approaches such as item response theory (IRT) do tend to assume that the psychological entities being measured are trait-like. IRT, confirmatory factor analysis and structural equation modeling (Joreskog, 1974), and to some extent, generalizability theory (GT), can be included in this category.
Item Response Theory (IRT)
Of all the measurement approaches described in this chapter, IRT has generated the most interest among test developers. Given this enthusiasm, it is not surprising that a number of descriptions of various IRT theories and procedures exist (e.g., Hambleton, Swaminathan, & Rogers, 1991; Lord, 1980; Lord & Novick, 1968; Rasch, 1980; Weiss, 1983; Wright & Stone, 1979). Hambleton & Swaminathan (1985) traced initial IRT work through the 1930s and 1940s, but noted that publications by Lord (1952, 1953a, 1953b) are generally credited with providing the modern impetus for IRT development.
IRT proposes that item responses allow inferences about hidden or latent traits. IRT assumes that a relation exists between an observable item response and the unobservable trait which underlies performance and that this relation, in cognitive ability items, can be described by a mathematical function, a monotonically increasing curve (see Figure 24). Analyses based upon this ideal curve have demonstrated considerable utility in test development and item analysis (Cronbach, 1991a).
----------------------------------------
Insert Figure 24 About Here
----------------------------------------
For example, two different item characteristic curves (ICCs) can be expected for discriminating and non-discriminating verbal ability items. With a discriminating item, persons with good verbal skills should be more likely to correctly answer the question. A non-discriminating item, however, would show no difference between persons of high and low verbal ability. Similarly, persons at the same ability level should provide different answers to items of different difficulty levels; for example, two persons of moderate ability should both answer an easy item correctly and a difficult item incorrectly. Identification of poorly discriminating items allow their deletion at no loss of measurement precision. In addition, IRT analyses also permit the development of a calibrated item bank. From this bank may be drawn subsets of items which yield comparable latent trait scores, a useful benefit in many measurement applications.
The IRT test developer collects item data and compares it to the statistical model proposed for those items. If the fit is good, then: (a) the resulting ability estimates are not dependent upon a particular test (because they are based upon the underlying trait, not the test), and (b) the resulting item estimates are not dependent upon a particular sample (because the slope and intercept of the demonstrated item characteristic curve remain the same despite the value of a score contributed by any particular individual). Item and ability parameters produced through IRT analyses are invariant because they include information about items in the ability estimation process and about examinees in the item estimation process.
Item subsets with comparable scores are necessary for another IRT application, computer adaptive testing (CAT). IRT indicates that a test most accurately measures ability when test difficulty level and examinee ability level are closely matched; easier items do not provide useful information about high ability examinees and difficult items do not provide useful information about low ability examinees. With CAT, an individual first completes a subset of items. Based upon those responses, the testing program selects more difficult or easier items to better fit the examinee's abilities. The program administers items with known difficulty and discrimination levels until the standard error of the examinee's ability score reaches a specified level or stops decreasing by a predetermined amount. CAT produces an estimate of individuals' ability with fewer items, providing a more efficient measurement method, particularly for examinees of very low or high ability. Thus, CATs produce shorter tests of equal measurement precision as well as greater test security, a testing pace set for the examinee that may minimize frustration, and reduced time for test supervision (Hambleton et al., 1991). Adaptive testing is not intrinsically dependent upon automation--Binet introduced the procedure (Dawis, 1992)--but with CAT, computers facilitate the administration and storage of test data.
Although many IRT examples can be found in educational measurement, a few applications of the theory with attitude and personality measurement are scattered throughout the psychological and educational measurement literatures (e.g., Gibbons, Clark, Cavanaugh, & Davis, 1985; Hoijtink, 1990; Koch, Dodd, & Fitzpatrick, 1990; Kunce, 1980, cited in Reckase, 1990; Reise & Waller, 1990; Thissen & Steinberg, 1988; Waller & Reise, 1989). Waller and Reise (1989), for example, employed CAT and IRT procedures to identify extreme responders on a personality scale with an average 25% of the available items and a 100% accuracy rate.
Generalizability Theory (GT)
Shavelson, Webb, and Rowley (1989; Crocker & Algina, 1986; Brennan, 1983; Cronbach et al., 1972; Shavelson & Webb, 1991) describe GT as a framework for examining the dependability of psychological measurement. GT approaches differ from classical test theory in several respects. The concept of reliability is replaced by generalizability, the ability to generalize a score from one testing context to others. GT recognizes that error is not always random; systematic error can be identified by examining variability of measurement data over conditions, called facets in GT. GT suggests that multiple sources, including persons, content, occasions, and observers, can affect measurement devices; GT provides a theoretical and statistical framework for identifying and estimating the magnitude of those sources. Given this information, researchers can then compare the relative effects of different measurement conditions. By knowing the most important sources of measurement variability, measurement devices and procedures can be designed to minimize undesired sources.
Shavelson et al. (1989) described an application of GT in a study of disaster survivors' psychological impairment (Gleser, Green, & Winget, 1978). Two interviewers assessed 20 survivors; two raters then used the resulting interview to determine survivors' impairment. Gleser et al. (1978) estimated variance components for the effects of survivors, raters, interviewers, and their interactions. Two components produced the largest variance: survivors (indicating individual differences among subjects) and the survivor by interviewer interaction (indicating measurement error, i.e., the interviewers produced inconsistent information from their interviews with different survivors). These results suggest that to improve the generalizability of measurement, the researchers should standardize the interviewers' procedures. Other potential sources of variation that might have been investigated include the different occasions of interviewing and the interview questions.
Unlike IRT, GT proposes no particular model of item response. Instead, GT suggests that with any particular test, a universe of conditions exist which may affect test scores. Investigations of the relative effects of such conditions are called G studies. The results of G studies can then be used to design studies to identify sources of variability and error in situations where test scores will be employed to make real decisions about individuals. These latter studies, termed D studies, can help modify some aspect of the testing conditions to provide the most precise data possible (e.g., standardize interviewer training, as noted in the previous example).
Does Measurement Equal Statistics?
Ellis (1966) described measurement as "the link between mathematics and science" (p. 1). Many good reasons exist for the close relations between measurement and statistics. With few measurement theories available, statistical methods played an important role in helping early psychologists efficiently describe the distributions and relations they found in their measurement data. Galton's work around 1870, for example, provided the basis for the standard score and the correlation (Dawis, 1992). Statistical procedures such as aggregation and regression toward the mean helped cope with and explain the large amount of error produced by the new science's measurement procedures. Quantifying general theoretical propositions often forces theorists to be more specific and concrete, thereby positioning the theory for better tests of confirmation and refutation. Lord and Novick (1968) observed that "it is a truism of mathematics that strong assumptions lead to strong results" (p. 25); once psychological theories have been placed in mathematical form, manipulation of the elementary constructs can produce new deductions (Lord & Novick, 1968). Quantification in measurement offers a vehicle for translating different experiences into a universal language of numbers. And given that modern sciences are quantitative, adoption of statistical and mathematical descriptions and procedures enabled a new field like psychology to appear more scientific (Heidbreder, 1933).
But the merging of statistics and measurement also produced difficulties. In much of psychological measurement, the link between measurement and statistics became a chain. Statistics and measurement have become functionally equivalent in the minds of many graduate students and psychologists. Psychometric approaches focus more on statistical procedures than on psychological theory (cf. Glass, 1986); the teachers of measurement courses, more likely than not, are statisticians. Davison et al. (1986) wondered "how many students fully understand what psychometrics entails other than statistics" (p. 586). Over time the line between statistical description and psychological explanation has been erased: psychologists have often failed to distinguish between psychological and statistical reality and to note that statistical models implied certain psychological models of reality (Danziger, 1990). Many proponents of statistical and empirical approaches have been resistant to incorporate theories about causality into measurement (Wiggins, 1973). Contemporary personality researchers, for example, continue to employ statistically-based descriptions that they acknowledge still require causal explanations (Lamiell, 1990; McCrae & Costa, 1987b; Wiggins, 1979).
As noted in Chapter 2, psychometricians describe regression toward the mean (RTM) as a statistical phenomena in which scores which fall at the extremes of a scale on one occasion tend to move toward the mean at a second testing. Psychologists often have no idea about the cause of RTM other than it is proposed to be a result of measurement error (Kazdin, 1980). Measurement error, however, refers to unknown factors which randomly or systematically influence scores on a measurement procedure. We cannot explain RTM very deeply.
----------------------------------------
Insert Figure 25 About Here
----------------------------------------
Figure 25 displays a graphic representation of RTM and the Law of Initial Values (LIV). The LIV indicates that the greater the initial value of a psychophysiological measure, the smaller the response will be to a stimulus (Wilder, 1967). When the initial value is extreme, no response or even a paradoxical response can occur (Surwillo, 1986). For example, an individual with a resting heart rate of 100, when shown a threatening photograph, is less likely to increase heart rate significantly compared to an individual whose initial rate is 70. Many psychophysiologists believe negative feedback provided by homeostasis--the process of maintaining an organism's equilibrium--is responsible for the LIV (Stern, Ray, & Davis, 1980). These feedback mechanisms set limits on the degree to which the organism can move away from its homeostatic set point. The question remains: is the LIV a theoretical explanation for some RTM effects?
If the answer is yes, then the LIV has important implications for measurement of constructs--such as stress, anxiety, and depression--that may be influenced by such mechanisms. On the basis of a homeostatic model, we might assume that individuals who score high on a measure of these constructs may be motivated to decrease their affect without further intervention or stimuli. Such spontaneous remission--improvement without treatment--is a frequent finding of psychotherapy outcome studies (Bergin & Lambert, 1978).
Long ago Boring (1919) criticized psychologists' "blind reliance on statistical samples, statistical conventions, and statistical assumptions for drawing substantial psychological conclusions" (p. 338), declaring that "inaccuracy of definition will never yield precision of result" (Boring, 1920, p. 33; see also Allport, 1937). Tukey (1950, cited in Loevinger, 1957) suggested that psychologists tend to rely on statisticians, instead of theory, to find significant relations between psychological variables. Cronbach (1991b) aptly described the tension between statistical and measurement concerns:
While the ability to translate qualitative statements into mathematic terms is often a sign of theoretical precision, complex statistical analyses of psychological data have often led to an appearance of precision. Using statistical significance testing, it is possible in any study to find significant results given a sufficiently large number of subjects and variables; this process is known as fishing (Cook & Campbell, 1979). At first glance the investigator appears to be detecting small nuggets of knowledge from a morass of data; such nuggets, however, often change or disappear upon replication. The finding of statistical significance in support of proposed hypotheses is the goal of most psychological investigations, and once found, often signals the end of analysis. Psychologists appear to share the judgment of lay persons that information that confirms propositions is more relevant than disconfirming information (Crocker, 1981).
Do the Measurement Properties of Objects Affect the Type of Statistical Procedure Employed?
Another controversy involves the relation between type of measurement scale and statistical procedures. Stevens (1951) indicated, for example, that parametric techniques such as correlations and F tests require at least interval level data. Others (e.g., Gaito, 1980) have argued that no relation exists between type of scale and the statistical technique used. Statisticians who support this perspective frequently cite Lord's (1953c) quote that "the numbers do not know where they come from" (p. 751).
Only the investigator who has explicated the link between the phenomena being measured and the numbers produced by the measurement procedures knows where the numbers come from. Strictly speaking, the results of statistical analyses apply only to the numbers produced by the measurement procedure, not the phenomena itself (Murphy & Davidshofer, 1988). If the procedure does not reflect the phenomena adequately, subsequent analyses are not generalizable to the real world. In a similar vein, contemporary statistical approaches to measurement, such as confirmatory factor analysis (CFA) and structural equation modeling, uniformly work with latent variables. That is, observed variables are assumed to contain substantial amounts of error, and transformations of some sort are necessary to produce latent variables that presumably reflect real-world variables more accurately. However, the connections between observed variables and the latent ones are often weak (Cliff, 1992). Attempts to transform the results of analyses based on transformed data back to the real world will be successful only to the extent that we understand the factors in the measurement procedure that lead to the initial data alteration. Gould (1981) made a similar criticism of the use of factor analysis in research on intelligence. He indicated that intelligence had been reified to the status of a physical entity but that "such a claim can never arise from the mathematics alone, only from additional knowledge of the physical nature of the measures themselves" (p. 250; also see Murphy & Davidshofer, 1988, p. 32-33).
If measurement procedures cause real-world data to collapse from interval to ordinal level, results of statistical tests of the transformed data may be invalid when applied back to the real world. For example, suppose we are examining ratings on a 7-point scale made by managers of prospective employees completing tasks at a work assessment center. As shown in Figure 26, a substantial set of these raters may make a central tendency or range restriction error. Although the employees' performances might range from very poor to very good, most raters employ only the mid-points of the scale. This systematic error by raters results in the collapse of real-world interval level data to ordinal data. Real differences between individuals are lost during the measurement process. Even the latent variables produced by statistical procedures may fail to fully reflect the actual phenomena.
----------------------------------------
Insert Figure 26 About Here
----------------------------------------
Statistics computed on the observed measurement data are likely to produce misleading results. In the example above, the standard deviation will have shrunk from its theoretical true value, as will correlations with other variables. Townsend and Ashby (1984) provided a similar example to demonstrate that statistical tests applied to ordinal level data produce meaningless results.
Advanced statistical procedures which transform manifest measurement data into latent variables may or may not accurately model reality (cf. Cliff, 1992). A correction for attenuation might produce useful data, but in a real world situation, multiple errors may intrude, depending, for example, upon the interaction of the sample and the measurement procedure. Given the knowledge of the desirable properties of interval level data, transformations of raw data raise the question of whether structure is being recognized in items or imposed upon them (Loevinger, 1957). Statistical tests may certainly be applied to observed data regardless of measurement scale type. But it would also appear that: (a) those results would apply only to the observed data, not the real-world phenomena, or (b) results based on latent variables and transformed data apply to the real-world only to the extent that these transformations accurately account for errors that occur during the measurement process.
Summary. Procedures developed in the popular IRT approach aid large scale selection testing because they improve testing efficiency (i.e., they require fewer items without loss of test reliability). IRT procedures do not, however, tend to improve validity. To do so, I suspect that much more must be known about the processes of measurement, and it is here that GT, with its ability to study systematic error, offers useful analytic tools.
Cognitive Approaches
Scaling refers to "the assignment of meaning to a set of numbers derived from an assessment instrument" (Reckase, 1990, p. 41). The important issue here is who assigns the meaning to the numbers. Traditionally, the chief decision maker is the test developer who transforms the raw data provided by test-takers using statistical methods. For example, the developer may: (a) sum all item responses to produce a total score, (b) examine correlations between individual items and summed scores to determine which items should be deleted from a scale because of low correlations, or (c) factor analyze the data to determine which items converge to form separate constructs. In each case, the developer assumes that the transformed data have more meaning than any particular response generated by the test-taker.
In addition to the test developer, the test-taker or observer assigns meaning to numbers. Test items and tasks are cognitively and affectively processed for meaning and subsequent response in a different manner by different individuals (cf. Cronbach, 1946). As Loevinger (1957) stated, "the naive assumption that the trait measured by a personality test item is always related in a simple, direct way to the apparent content of the item has long since been disproven" (p. 651). This approach represents a movement away from viewing questions and answers in a stimulus-response mode toward viewing them in terms of meaning and context (Mishler, 1986). The importance of understanding the processes employed by individuals when responding to tests has long been understood; for example, Cronbach (1946) cited work by Seashore (1939) and others which indicated that test results are not simply a product of individuals' ability (or personality), but also of the methods subjects employ when completing test items. This knowledge, however, has had relatively little impact on psychological testing (cf. Burisch, 1984).
Many contemporary psychologists have returned their attention to response processes (Guion, 1977). Experimental methodologies and theoretical constructs from cognitive psychology have begun to provide paradigms for investigating such processes in cognitive ability and other tests (Embretson, 1985). Aptitude researchers, Embretson, Schneider, and Roth (1986) suggested, have moved their focus from traits to the cognitive components underlying test performance.
Task Decomposition
Perhaps the most promising aspect of cognitive models for measurement is that these approaches seem capable of offering methods for examining the underpinnings of construct validity, particularly for the cognitive skills employed in intelligence tests (Embretson, 1985; also see Sternberg, 1977, 1988). Embretson (1985) suggested that understanding item or task responses requires a theory of the process factors that underlie performance along with research designed to explore those factors. From the standpoint of cognitive psychology, experiments must be designed to decompose those variables and determine their relative weights. The test developer becomes a researcher who creates tests on the basis of studies which examine the stimulus content of test items and the cognitive processes (such as strategies and level of knowledge) individuals use in response to those items. Researchers have employed this approach to examine the cognitive processes that underlie performance on tests of verbal ability, inductive reasoning and spatial ability (Embretson, 1985).
Butterfield, Nielson, Tangen and Richardson (1985) provide an example of decomposition with letter series tasks, a measure of inductive reasoning common to intelligence tests. Letter series can be used to specify such relations as identity (AAAAA), next (ABCDE), and back (ZYXWV). Butterfield et al. conducted a series of theory-based decomposition studies in which they found that performance on letter series tasks could be explained by the examinees' level of knowledge and ability to perform multiple operations. The resulting ability test score, then, summarizes the particular weighting of these two cognitive variables for a particular set of letter series tasks. This score will correlate with other tests or criteria to the extent that they similarly weight the underlying cognitive processes. Using this type of research program, test developers should be able to select or construct items that measure specified components and combinations of components, thereby increasing the possibility of designing new tests with greater validity.
Task decomposition approaches may be limited, however, by the complexity of the response processes required by the test item or produced by the test-taker (cf. Lohman, 1989). If individuals employ substantially different processes to perform components of a task, attempts to isolate common processes will fail.
Protocol Analysis
Ericsson and Simon's (1984) research with protocol analysis exemplifies contemporary work on item and task response. In this structured, qualitative procedure, subjects are instructed to "think aloud" as they complete a task, and their verbal reports are recorded verbatim. In contrast to most measurement procedures, persons providing the reports receive a brief training period consisting of practice exercises (Ericsson & Simon, 1984). Protocol analysis is premised on the belief that information attended to and held in short-term memory can be accurately reported (Martin & Klimoski, 1990). Ericsson (1987; Ericsson and Simon, 1984) maintained that sufficient research has documented that the verbal reports produced through protocol analysis are reliable and valid. Essentially, this research has indicated that individuals' reports about how they solve problems or tasks predicts how well they are able to solve those problems. Research in protocol analysis, for example, has revealed differences between high and low aptitude subjects' ability to represent problems. Unskilled subjects experience difficulty in representing problems in a way that enabled them to access past knowledge.
Without verbal reports of how people solve problems or answer test items, Ericsson (1987) suggested that we have no way of knowing whether the items evoke the same response process: "Understanding what individual tests measure is a prerequisite for understanding the observed correlation between scores on two different tests. Protocol analysis would allow us to evaluate the importance of two different sources of correlation" (p. 222). In contrast, traditional intellectual and personality tests sum the products of those processes.
Jobe and his colleagues have applied protocol analysis methods to test and survey problems. They found that variables such as recall strategies, instructions, mood, time elapsed since target event, and response formats affected the accuracy of recall of medical visits, smoking levels and dietary habits (e.g., Jobe & Mingay, 1991; Jobe et al., 1990; Lessler, Tourangeau, & Salter, 1989; Means, Swan, Jobe, Esposito, & Loftus, 1989; Salovey et al., 1989; Smith et al., 1991; Willis, Royston & Bercini, 1991).
Jobe et al. (1990) investigated the effects of different types of recall strategies on the accuracy of reported health care visits. Some theorists had suggested that forward (chronological) recall would be more accurate because earlier events could guide the recall of subsequent events; other theorists proposed that recent events would be more easily remembered and thus could prompt better recall of earlier events (backward or reverse chronological recall). Subjects were randomly assigned to recall visits using one of three strategies: forward recall, backward recall, or free (uninstructed) recall. Using medical records from physicians, Jobe et al. then matched recalled with actual visits and found that the overall sample of 337 subjects underreported visits by 20 percent. Recall order was compared with 75 respondents who reported two or more visits during the six month reference period. Free recall subjects were found to be more accurate (67% match) than forward (47%) or backward (42%) recall subjects. However, gender interacted with recall condition: women in the free recall condition correctly recalled 60% of their visits while men in the backward condition recalled 61%. Self-reported health status also interacted with recall condition: subjects in good health tended to recall best (85%) in the free recall condition, while those in poor health recalled equally in the backward (50%) and free recall (50%) conditions. Jobe et al. found no main effects or interactions with race, education or income.
Martin and Klimoski (1990) employed protocol analysis to test propositions of attribution theory and cognitive appraisal models with 36 managers who made self- and employee performance evaluations. They asked 36 managers to think aloud while providing four "'thorough, complete' open-ended performance evaluations" (p. 139) of themselves and three subordinates. Managers then provided summary ratings of the subordinates on a 1-5 scale (1 equalled poor, 5 equalled excellent). These evaluations were tape-recorded and then parsed into single ideas or thoughts so they could be coded by trained raters into categories such as positive evaluation, negative evaluation, external attribution and trait attribution. Martin and Klimoski found that the number of positive evaluations found in the verbal reports correlated .49, .69, and .59 with summary ratings of the three employees.
Actor-observer theory indicates that observers are more likely to attribute others' actions to internal traits than situational factors (Ross, 1977). In contrast, actors are more likely to attribute their actions to situational than trait factors. Martin and Klimoski (1990) did find that managers made significantly more internal attributions on employee evaluations (10% of the phrases) than self-evaluations (5%). Managers also made more external attributions (12%) in their self-evaluations than for employee attributions (2%). Martin and Klimoski found that managers remembered negative behavioral episodes for good and bad employees, but that managers dismissed the negative episodes for employees receiving positive evaluations. During self-evaluations, managers tended to use external attributions to enhance their appraisal of their ability to perform despite environmental constraints.
Contrary to attribution theory and appraisal models, managers produced an evaluation immediately, that is, within one or two seconds following a request to do so. Martin and Klimoski (1990) maintained that theory suggests that considerable information processing should occur before the rating is reported. They proposed, however, that the managers retrieved the evaluations from memory rather than generated them as an end product of an employee evaluation task. Martin and Klimoski suggested that managers may have retrieved their most recent evaluation from memory and then attempted to confirm or disconfirm by retrieving relevant performance information about the particular employee.
Ericcson's (1987) work indicates that individuals produce a diversity of mediating thoughts when confronted with a simple task like memorizing individual nonsense syllables. Given the increased complexity of answering personality test items, one would expect an even greater diversity of cognitive processes (but see Ursic & Helgeson, 1989, for an exception). This diversity--the individual differences exhibited when humans construct meaning from seemingly simple stimuli--may dictate limits to the possible understanding of item response processes. Similarly, practice on a test and learning of shortcuts and strategies by a test-taker may complicate research aimed at understanding item response. And like much qualitative research, it is labor intensive to produce and review the transcripts required by protocol analysis.
Item Characteristics
The issue of how respondents construe the meaning of the items extends beyond simple distortion. As Kagan (1988) noted, every question forces the respondent to decide on the meaning of the terms, and "the investigator cannot assume similar meanings" (p. 617). But that is an assumption shared by many test developers. Walsh and Betz (1985; see also Wiggins, 1973) stated that "It is assumed (at least to some extent) that each item on a test and all the words in that item have some similar meaning for different people" (p. 17). Test developers assume similar meaning chiefly for convenience: until recently, few methods have been available for studying differences in item meaning.
Item ambiguity. As noted previously, respondents may generate information if they do not understand the psychological constructs being measured. Some evidence suggests that many items on psychological tests engender subtle differences among respondents in comprehension and recall of information (cf. Baker & Brandon, 1990; Watson, 1988). Such individual differences in item processing are likely to be compounded by the ambiguity found in self-report scales. Gynther and Green (1982) suggested that many self-report items are "stated in such a general fashion that test takers can only respond on the basis of their implicit personality theories rather than how they actually behave and feel in specific situations" (p. 358). Test-takers will interpret such questions and provide answers as best they can by relying on relevant cues, including information provided by adjacent questions (Schwarz, 1999).
Helfrich (1986) found that item understanding may be influenced by the presence of negatives in the item, passive tense, ambiguity, and respondent age. Angleitner, John, and Lohr (1986) reviewed research that found (a) a negative correlation between item length and test-retest consistency and (b) a negative correlation between ratings of item comprehensibility and the number of letters and sentence clauses in items. They also found that (c) according to subject ratings of scales such as the MMPI and 16PF, 25% of the items were ambiguous and 50% difficult to understand, and (d) in one study, almost 20% of subjects changed their item response over a 2-week test-retest interval. Angleitner et al. concluded that "one cannot help but be impressed by the degree of response inconsistency elicited by most personality questionnaire items" (p. 97).
Response formats. Ratings by self and observers are frequently employed to evaluate psychological dimensions such as job performance or attitudes (Landy & Farr, 1980, 1983). With ratings by others, individuals are evaluated on a particular format. As shown in Figure 27, raters may be asked to mark or circle a number along a Likert scale containing descriptive terms. Self-reports typically require test-takers to respond to true-false formats or Likert scales which contain three or more alternatives. The second Likert scale in Figure 27 contains a mid-point to allow a neutral or uncertain answer.
----------------------------------------
Insert Figure 27 About Here
----------------------------------------
Formats can be described along a variety of dimensions, all of which potentially provide information to the test-taker about how to respond (Schwarz, 1999). Among the important response dimension is the number of response alternatives. A test such as the MMPI offers two possibilities to the test-taker (true or false) and an additional category to the test scorer (items left blank). Because Likert scales theoretically should increase the variability of responding, most contemporary tests employ such scales, using between 5 and 9 possible responses (cf. Dawis, 1987; Murphy & Davidshofer, 1988). Increasing the number of response alternatives, however, may increase the cognitive processing demand required of the respondent. Two likely results of such increased demand are: (a) a lengthening of item completion times and total test times when large numbers of items are involved (cf. Lohman, 1989), and (b) an increased likelihood of unmotivated or unskilled respondents generating, as opposed to retrieving, responses.
Concreteness or ambiguity is a second dimension along which rating scales can be described. Thus, rating scales may be anchored simply by two global descriptors (e.g., poor or excellent) or they may include a number of explicit behavioral descriptions (e.g., to receive a rating of 1 on a dimension of promptness, an employee must always meet the required deadline). Considerable ambiguity seems to exist in the content and format of most rating scales (Murphy & Davidshofer, 1988). In response, researchers have attempted to create rating scales with descriptor terms that are as explicit as possible. Reviews of studies using these Behaviorally Anchored Ratings Scales (BARS), however, have not demonstrated their superiority over other types of scales in terms of validity (Murphy & Davidshofer, 1988).
BARS are likely to be subject to the same sorts of production strategies described for self-reports in Chapter 2. In addition, some studies have asked raters to assess expected behaviors rather than observed behaviors (Latham & Wexley, 1977). Latham and Wexley (1977) proposed a substitute for rating scales in the form of a Behavioral Observation Scale (BOS). Rather than rate behavior along a scale, observers simply count the frequency of important job behaviors. Particularly if it is done immediately after a behavioral observation, BOSs should be more valid than BARS (cf. Paul et al., 1986a). However, under certain conditions BOSs may be functionally equivalent to BARS. If supervisors must recall worker behavior over long periods, they may generate such data on the basis of their impressions of workers (Murphy, Martin & Garcia, 1982). Similarly, raters who possess strong beliefs about workers may be motivated to notice positive behaviors in individuals they perceive as "good" and to ignore negative behaviors in "bad" workers (Murphy & Balzer, 1981, cited in Murphy & Davidshofer, 1988).
Some surveys provide respondents with open response formats as compared to the closed formats described above. With open formats respondents are likely to omit information they believe is obvious to the assessor. Schwarz (1999) noted research by Schuman and Presser (1981) which found that when the choice "To think for themselves" is offered on a list of alternatives, 61.5% of individuals choose this as "the most important thing for children to prepare them for life." In an open format, however, 4.6% offer this choice.
Schwarz (1999) also observed that respondents may use response formats for guides to answer ambiguous questions. Questions which request frequency responses can provide such information. Responses which include only low-frequency categories (e.g., "less than once a year" to "more than once a month") may lead the respondents to conclude that the questioner is seeking information about relatively rare events. For example, respondents reporting on the frequency of irritating experiences with a low-frequency scales believe that the question referred to major annoyances, while respondents with a high-frequency response format believe the question concerned minor annoyances (Schwarz, Strack, Muller, & Chassein, 1988).
Similarly, the numbers on the response formats may provide clues respondents employ to answer questions. Scharz, Knauper, Hippler, Noelle-Neumann, and Clark (1991) asked individuals to rate their success in life on 11-point scales (with identical descriptors) that ranged from 0 to 10 or -5 to 5. Thirty four percent provided a value between -5 and 0 on the -5 to 5 scale, while only 13% endorsed equivalent values between 0 and 5 on the 0-10 scale. Further investigation indicated that respondents interpreted a scale value of 0 (on the 0-10 scale) as reflecting an absence of achievements. When the 0 was the midpoint on the -5 to 5 scale, 0 and below was interpreted to mean the presence of explicit failures. Schwarz (1999) concluded that:
A format that ranges from negative to positive numbers conveys that the researcher has a bipolar dimension in mind, where the two poles refer to the presence of opposite attributes. In contrast, a format that uses only positive numbers conveys that the researcher has a unipolar dimension in mind, referring to different degrees of the same attribute. (p. 96)
-----------------------------------------------------------
Not at all Successful Extremely Successful
Unipolar: 0 1 2 3 4 5 6 7 8 9 10
Bipolar: -5 -4 -3 -2 -1 0 1 2 3 4 5
-----------------------------------------------------------
Respondents also tend to assume that mid-range values reflect average behaviors, while the extremes of the scale reflect extremes of the phenonema in question (Schwarz, 1999). The range of the response format has been shown to alter reports about television viewing, psychological symptoms, sexual behaviors, and consumer behaviors (Schwarz, 1999), with the effect strongest for ambiguous questions. The type of scale can also influence subsequent judgments. Schwarz (1990) indicated that when a frequency scale implies that the respondent's behavior is above or below the average, subsequent reports of satisfaction can be influenced in predictable directions.
Schwarz's (1999) recommendation was to suggest that respondents be asked to provide frequency information with a unit provided. For example, one could ask "How many hours a day do you work? ___ hours per day." Using a response format such as "sometimes" or "frequently" with such an item can be problematic because of differences between respondents' understanding of such terms. However, Schwarz's recommended question could also be interpreted differently by respondents (e.g., does having the TV on in the background while studying count?).
The interaction of cognitive ability and item characteristics. Stone, Stone, and Gueutal (1990) noted that test researchers rarely study the ability of test-takers to understand test instructions, item content, or response alternatives. They proposed that if respondents lack the cognitive ability to read and interpret questionnaires, their motivation and ability to complete the questionnaire will be impaired. Stone et al. (1990) suggested that such effects could be detected by comparing the psychometric properties of questionnaires completed by groups with different levels of cognitive ability.
Stone et al. (1990) used the Wonderlic Personnel Test to classify 347 Army Reserve members into low, medium and high cognitive ability groups. Subjects also completed an additional 203 items in a test battery of 27 measures that included the Job Diagnostic Survey, which contains scales to measure such constructs as task identity, autonomy, extrinsic feedback, , satisfaction with job duties, and organizational commitment. Stone et al. found significant differences in coefficient alpha for 14 of the 27 constructs. In 12 of those cases, alpha rankings were as predicted, from lowest to highest reliability estimates matching low to high cognitive ability. Stone et al. also found a significant correlation (r = -.23) between cognitive ability and the number of missing questionnaire responses. Finally, they observed that the scales most adversely affected by low cognitive ability were composed of only three or four items.
Stone et al. (1990) recommended that test developers devise items capable of being understood by all levels of respondents. This may be particularly important in areas such as industrial-organizational psychology where questionnaires may be completed by subgroups with moderate to low cognitive ability. Stone et al. noted that estimates place one-third of the U.S. work force as functionally illiterate and that job titles may provide rough estimates of cognitive ability.
Berk and Fekken (1990) reported a similar finding about the relation between cognitive ability and scale properties. They investigated a person reliability index computed with the scales of the Jackson Vocational Interest Survey. This index, which results from producing two scores per scale (each based on half of the items) and then correlating scores across pairs of scales, was significantly correlated across two administrations of the scale (r equaled .60). This result indicates that the person index can function as a reliable measure of whether a profile will remain stable over a brief time period. Interestingly, a measure of verbal ability was significantly correlated (r equaled .28) with a measure of scale stability, suggesting that verbally more skilled subjects possessed more stable scores.
Constructivist assessment. Test-takers may perceive measurement devices as ambiguous because it is usually the test developer who creates and selects test instructions, item content, and response alternatives. Respondents who differ from the test developer on such variables as culture, gender, age, and cognitive ability may represent the testing task in unintended ways. Proponents of constructivist assessment believe that humans construct their psychological realities and that it is the linguistic constructions of individual persons--instead of the test developer--that should be measured (Neimeyer & Neimeyer, 1993).
These assessment approaches are descendants of the work of George Kelly (1955) who developed a theory of personal constructs. Kelly proposed that constructs are bipolar (i.e., expressions of opposites) distinctions that enable the perceiver to construct discrete meanings out of the vast amount of perceivable stimuli (Neimeyer & Neimeyer, 1993). Constructivist proponents believe that meaning is constructed through language organized into narratives, metaphors, and stories (Sarbin, 1986). Reactivity is not viewed as a problem in constructivist approaches as much as an inevitability (Neimeyer & Neimeyer, 1993). That is, assessments are perceived as interventions that cause the individual to reconsider the constructs being assessed. In contrast to cognitive behavioral assessments which tend to focus on isolated negative self-statements, constructivists examine interconnected constructs through repertory grid techniques (Beail, 1985; Kelly, 1955; Neimeyer, 1993).
Neimeyer (1993) called the repertory grid "the Rorschach or MMPI of constructivist assessment" (p. 72). The grid is typically administered in an interview: individuals are asked to select elements from a personal domain (e.g., potential careers such as construction worker, park ranger and electrical engineer) and then rate those elements on personally selected constructs (e.g., low or high starting salary, indoor or outdoor work, low or high opportunity for advancement). The resulting content and numeric ratings provide qualitative and quantitative information. Qualitative data such as elicited constructs can be coded on such factors as interpersonal content or level of abstractness. Quantitative data are typically analyzed to determine (a) differentiation, the number of different dimensions of judgment employed; (b) integration, the organization or correlation among dimensions; and (c) conflict, the amount of negative correlation among the dimensions (Neimeyer, 1988, 1989a, 1989b). In vocational psychology, differentiation, integration, and conflict constructs have been related to vocational choice, vocational identity development, and quality of career decision-making skills (Neimeyer et al., 1989).
Neimeyer, Brown, Metzler, Hagans, and Tanguy (1989; see also Borgen & Seling, 1978) maintained that research has demonstrated that individuals process experimentally provided constructs (as in standard vocational interest items) and personally elicited constructs in fundamentally different ways. As idiographic proponents would argue, constructs created by individuals carry greater personal meaning and thereby facilitate greater understanding and more accurate recall of vocational information. Referring specifically to vocational research literature, Neimeyer et al. (1989) suggested that if differences between information generated through experimenter provided and personally elicited constructs are sufficiently large, many research findings may be undependable (see also Schulenberg, Vondracek & Nesselroade, 1988).
Summary. Cognitive approaches' great promise would seem to lie in their potential for illuminating important cognitive components that partially form the basis for construct validity. That is, these approaches offer tools for investigating such problems as how test-takers construe the meaning of items and how individuals' cognitive abilities interact with test item characteristics. Historically, motivational and affective influences have also been recognized as important influences in testing processes, and it is here that cognitive psychology may have less to offer in the way of theory and methods. Indeed, some major survey centers have established cognitive laboratories to pretest questionnaires (Schwarz, 1999).
Behavioral Approaches
As described in Chapter 4, behavioral assessment focuses on environmental stimuli, behavioral responses, and the consequences of those responses. Such assessment focuses on obtaining repeated samples of observable behavior, often in conjunction with an intervention.
Assessment in Mental Health Institutions
Decision-makers in settings such as mental hospitals, community mental health centers and residential facilities need information to make rational decisions (Paul et al., 1986a). Given the range of measurement and assessment procedures available--interviews, intelligence tests, behavioral observation, personality tests, vocational tests, progress notes--the question becomes: What is the best method or combination of methods? When these decisions are made by behaviorally-oriented assessors, it should come as no surprise that direct observation of behavior, not trait-based approaches, is heavily emphasized.
Paul (1986, 1987a, 1987b) described a comprehensive assessment system designed to produce clinical and administrative data in a residential treatment setting. Paul et al. (1986a) suggested that information should be assessed in three domains:
(a) clients, including problem behaviors, stable personal-social characteristics (e.g., age, gender, education), and physical-social life environments (i.e., the context in which problems occur);
(b) staff, including therapeutic techniques, stable personal-social characteristics, and physical-social life environment (i.e., the context in which treatments occur);
(c) time, the moment when an assessment is obtained and the period of time to which it applies.
Given these dimensions, who might best provide this information? Paul, Mariotto and Redfield (1986b) listed six potential sources:
(a) clients and staff who can provide information about themselves;
(b) significant others;
(c) residential clients who could provide information about other clients and staff;
(d) clinical staff who could provide information about all three domains;
(e) records and archives;
and (f) trained staff whose only responsibilities are to function as observers. Such staff can be interactive (i.e., they interact with target individuals) or noninteractive.
When should information be obtained and recorded? Paul et al. (1986b) classified observational schedules into programmed (i.e., scheduled or unscheduled) and discrete or continuous. Recording can be accomplished immediately after behavior or delayed, on single or multiple occasions, and with stable or transitory phenomena. Actions or interactions can be monitored as well as individuals or aggregations of individuals. They recommended that observations be recorded as soon as possible because accuracy and precision tend to decrease as the time period between event and recording increases. Although measurement error increases with sampling, continuous recording of data in all three domains is usually impractical.
Paul et al. (1986b) also suggested that the units of observation be established before the observation period so that observers are able to focus on important elements. Such units should be discrete samples of behavior, as opposed to global signs, since greater amounts of interpretation by observers are more likely to reflect characteristics of the observer. Error arising from such factors as carelessness or fatigue by the rater will be minimized when measurement data can be aggregated from multiple occasions. Observation of clients and staff may be reactive, but independent raters should be less so than clients or staff since their ratings will have less personal significance and evaluative potential. Paul et al. (1986b) concluded that the accuracy and relevance of observations can be maximized using multiple, discrete, and scheduled observations made by trained observers as soon as possible following a behavioral event.
Paul et al. (1986b) described their chief assessment tools as Direct Observational Coding (DOC) procedures. DOCs require explicit sampling of individuals and occasions by trained observers. Mariotto & Licht (1986) suggested that such training include:
(a) an orientation stressing the purposes and confidentiality of measurement,
(b) technical manuals which describe coding content and procedures,
(c) practice coding behavior,
(d) objective feedback to coders,
(e) in vivo coding practice,
(f) certification, through a work sample, of the observer's readiness,
and (g) procedures to maintain observer skills.
Paul et al. (1986b) noted two important sources of error that should be monitored with observers: (a) decay, random changes in the observer's reliability or consistency of observation, and (b) drift, systematic changes in the definition or interpretation of coding categories. Paul et al. (1986b) maintained that such errors could be minimized by obtaining converging data from different assessment procedures, conditions, and operations.
Licht, Paul and Power (1986) reported that such DOC systems have been implemented with more than 600 clinical staff in 36 different treatment programs in 17 different institutions. The resulting flood of data has produced results of interest to researchers as well as to clinicians and administrators in the studied agencies. Data from DOC systems have produced evidence of substantial differences in the behavior of different clinical staff and treatment programs. For example, staff-client interactions in 30 studied agencies ranged from 43 to 459 interactions per hour; over a full week, staff members were responsible for as few as four clients or as many as 33. DOC data also demonstrated changes in staff behavior resulting from training and development procedures and the maintenance of such behavior. Licht et al. (1986) reported that how staff interact with clients--that is, specific intervention programs--was highly correlated with client functioning and improvement (r's range from .5 to .9 on different variables). In addition, the quality of staff-client interaction was more important than the quantity of that interaction. Licht et al. (1986) noted that DOC information may not only aid in the monitoring of treatment implementation but may be employed as feedback to adapt treatment for improved effectiveness.
Behavioral Physics
Observing that "we tolerate substantial measurement error in psychological tests and rating scales" (p. 11), Tryon (1991) attempted to reduce such error by selecting highly reliable and accurate measurement devices. Tryon (1991) coined the term behavioral physics to describe how important aspects of psychological behavior can be studied through measures of activity, time, and space. Monitoring devices can record the frequency, intensity, and duration of exercise and other activities, thus providing information for clinical and research purposes. Such instruments have the ability to provide longitudinal, continuous measures of individuals in their natural environment. Examples of activity measurement devices include:
(a) actometers, mechanical wrist watches which record kinetic energy produced by body movements;
(b) heart rate recorders, portable devices which can measure heart rate over days;
(c) body temperature sensors that transmit temperature data to nearby receivers.
Tryon reviewed research examining the relations between activity and such problems as mood disorders, hyperactivity, eating disorders, sleep, substance abuse, and disease. For example, he found that exercise alleviated reactive depression in over a dozen studies, with samples that included post-myocardial infarction patients (Stern & Cleary, 1982), college students (Greist, Klein, Eischens, & Faris, 1979), and psychotherapy clients diagnosed as depressed (Klein et al., 1985). In the Klein et al. (1985) study, 60 depressed individuals were randomly assigned to exercise, cognitive-interpersonal, and meditation-relaxation therapies. Although completion rates differed (56%, 67%, and 48%, respectively), all three groups demonstrated significant reduction in depression at treatment conclusion. Although these studies did not do so, Tryon (1991) suggested that activity monitors could document the presence and intensity of therapeutic exercise. Tryon also proposed that further research be conducted to examine the relation of depression and activity over long periods.
Simulations, Structured Assessments, and Work Assessment Centers
With cognitive ability tests, respondents perform tasks that often resemble the criteria the test is designed to predict. Personality and temperament tests, in contrast, require respondents to report on relevant tasks. In addition, personality tests have been characterized as artificial and restricted in range (Hilgard, 1987). Given the substantial differences in reliability and validity between cognitive ability and personality tests (e.g., Murphy & Davidshofer, 1988; Parker, Hanson & Hunsley, 1988), it would appear logical to develop personality tests that require respondents to perform relevant behaviors in naturalistic settings.
The idea of simulating aspects of real life as a measurement procedure is as old as psychological science. Allport (1921) cited Galton as advocating the representation of "certain problems of actual life, and of observing the individual's adjustment to these situations" (p. 451). Theory and research do support the idea that tests that function as simulations of criteria may maximize predictive validity (Cronbach & Gleser, 1965; Paul et al., 1986b; Wiggins, 1973). Asher and Sciarrion's (1974, cited in Murphy & Davidshofer, 1988) review of research on work sample tests found that the greater the similarity between test content and job, the higher the predictive validity. Wernimont & Campbell (1968) maintained that "an implicit or explicit insistence on the predictor being 'different' seems self-defeating" (p. 373); they described procedures for detailing critical job duties and assessing applicants' history relative to those behaviors. Gronlund (1988) advised developers of achievement tests to "use the item types that provide the most direct measures of student performance specified by the intended learning outcome" (p. 27). Danziger (1990) observed that in general, psychological tests best predict performance on other psychological tests:
When one applies intelligence- or aptitude-test results to the prediction of future performance in the appropriate settings, academic or otherwise, one is essentially using a simulation technique. The more effectively the investigative context simulates the context of application, the better the prediction will be. (p. 188)Contemporary reviewers of cognitive ability tests such as Lohman (1989) have noted trends toward increased study of simulations (Frederiksen, 1986), performance assessments (Gronlund, 1988; Shavelson, Baxter, & Pine, 1991) and the criteria such tests are attempting to predict (Cronbach, 1984; Glaser, Lesgold, & Lajoie, 1987). Cronbach (1992) emphasized that a major focus of measurement work in the 1990s will be improving the psychometric properties of test criteria.
Hilgard (1987) reported that the potential of structured assessments, seen in Murray's work and that of other psychologists, lead to efforts during World War II to create extended observations in naturalistic situations. Murphy and Davidshofer (1988) described the program developed by the Office of Strategic Services (the CIA's precursor) to select and train intelligence agents. Recruits performed in lifelike situations over a three-day period so that observers could record their responses to stress. Recruits completed such tasks as (a) building a simple structure with the help of two uncooperative assistants who were assessment confederates; (b) functioning as a recruiter who interviewed an applicant/confederate; and (c) improvising during a role-play of various interpersonal situations.
Murphy and Davidshofer (1988) described similar methods employed by the Peace Corps. Suitability screenings consisted of questionnaires and tests assessing the applicant's previous training, experience, and language aptitudes as well as academic transcripts and letters of reference. These screenings were followed by field selection procedures consisting of two to three months of training. During this period applicants completed interviews and psychological tests such as the MMPI and were observed and rated by staff. At the end of training a selection board reviewed all information and decided who to accept and reject for the Corps.
Murphy and Davidshofer (1988) noted that it is difficult to evaluate the relative effectiveness of the OSS and Peace Corps assessment programs, partially because of a lack of suitable criteria. They reported that the OSS concluded that little evidence existed to support the predictive validity of their assessments (OSS Staff, 1948). However, later analyses of OSS data (Wiggins, 1973) indicated that the assessments produced a modest increase in the number of correct selection decisions made. With the Peace Corps, only 9% of those selected returned prematurely from their assignments, with 1% of the total returning because of psychiatric reasons. This result compared favorably to the 10-15% of volunteers' age cohort expected to experience some type of emotional impairment that would have interfered with assignment completion. Noting the results of the statistical versus clinical debate, Murphy and Davidshofer (1988) concluded that little evidence supports the use of diagnostic committees to integrate data and make selections over statistical methods.
The OSS and Peace Corps programs provided a foundation for contemporary work assessment centers. Gaugler, Rosenthal, Thornton and Bentson (1987) estimated that since AT&T began the first center in 1956 (Bray & Grant, 1966), more than 2,000 organizations have utilized such facilities. Such centers typically assess individuals in small groups, utilizing multiple methods, including situational tests (such as the in-basket work sample and the leaderless group discussion), interviews, and personality tests, on multiple dimensions, such as leadership and resistance to stress.
Reviews of the reliability and validity of work assessment centers are generally positive (Gaugler et al., 1987; Klimoski & Brickner, 1987; Murphy & Davidshofer, 1988). Gaugler et al.'s (1987) meta-analysis of 50 assessment center studies revealed a corrected mean of .37 for 107 validity coefficients. Gaugler et al. (1987) found that validities were higher when: (a) multiple evaluations were used, (b) assessees were female, (c) assessors were psychologists rather than managers, (d) the study was methodologically sound, and (e) peer evaluation was used. Murphy and Davidshofer (1988) noted that predictors and criteria in assessment center studies frequently are observer ratings; Gaugler et al. (1987; also see Klimoski & Strickland, 1981, cited in Hunter & Hunter, 1984) found that assessment centers better predict ratings of work potential than performance or ratings of performance. Sackett and Dreher (1982; Klimoski & Brickner, 1987), however, questioned the construct validity of assessment center measures, noting that factor analyses of exercises and dimensions more closely relate to distinct exercises than to dimensions or constructs.
Another significant limitation of assessment centers and other performance measures is their failure to exceed the predictive validity of cognitive ability tests for occupational criteria. Hunter and Hunter's (1988) meta-analysis, for example, found comparable validity estimates (.43 to .54) of predictors such as work sample tests, ability tests, peer ratings, behavioral consistency experience ratings, job knowledge tests, and assessment center ratings. The greater cost of assessment centers and simulations, compared to ability tests, sets limits on their utility in educational and occupational domains. Assessment center costs, however, are not fixed, as it is unclear just how much of the criteria must be simulated to obtain valid predictions (cf. Motowidlo, Dunnette, & Carter, 1990). Frederiksen (1962) proposed a set of five such distinctions that gauge the fidelity of psychological tests in comparison with actual behaviors:
(a) The first category, opinion, requires the test-taker to state an opinion about an individual's performance. Given the multiple factors that can influence such opinion, this level represents the lowest fidelity when actually compared to performance.
(b) Attitude scales presumably correlate with performance. These correlations, however, are often low to moderate.
(c) Knowledge measurement is similar to Bandura's (1977) outcome expectation variable. Both reflect an individual's knowledge of the information and skills necessary to perform a behavior. The correlation between such knowledge and actual performance is often low.
(d) Related behaviors are concomitants that are presumed to covary with performance and which are often employed because of practical considerations.
(e) Simulations are constructed situations where respondents indicate what they would do if they were actually in the performance situation.
(f) Finally, lifelike behavior refers to respondents' performance under conditions similar or identical to the situation in question.
Summary. These approaches offer well-developed methods for observing psychological phenomena. To the extent that behavioral assessment retains its radical roots, however, it is likely to resist exploring the usefulness of well-developed traditional concepts such as constructs, reliability, and validity. Such concepts seem indispensable for theory development and measurement evaluation (cf. Silva, 1993).
Computer-Based Approaches
Scientific progress is inseparably linked to the state of measurement theory and procedures. Measurement, in turn, is limited by the technologies available to gather data. Throughout most of psychology's history, the predominant measurement technologies have been printed materials and pencils. While Danziger (1990) maintained that other sciences came to rely on reliable witnesses as the key to credible knowledge, technology can also increase observer reliability (cf. Rosenthal, 1976). The moons of Jupiter, for example, are invisible to all except individuals with exceptional eyesight in excellent atmospheric conditions. A telescope, however, allows any sighted person to easily view those moons at leisure.
The introduction of microcomputers has resulted in widespread interest in measurement uses of this technology. With the exception of IRT applications, most contemporary developers of computer-based testing procedures have focused on adapting traditional tests so that one or all test components--administration, response recording, scoring and data analysis, and interpretation--is done by computer (Hedlund, Vieweg & Cho, 1984; Butcher, 1987a). Many of these applications were published in the 1980s when it appeared that testing software would be an economic boon for developers and publishers. Automated procedures were created, for example, for the MMPI (Anderson, 1987; Butcher, 1987b; Honaker, 1988), 16PF (Harrell & Lombardo, 1984; Karson & O'Dell, 1987), Rorschach (Exner, 1987), Strong Interest Inventory (Hansen, 1987; Vansickle, Kimmel & Kapes, 1989), Self-Directed Search (Reardon & Loughead, 1988), neuropsychological tests (Adams & Heaton, 1987; Golden, 1987; Heaton, Grant, Anthony, & Lehman, 1981), interviews (Erdman, Klein & Greist, 1985; Fowler, Finkelstein, Penk, Bell & Itzig, 1987; Giannetti, 1987), intelligence and aptitude tests (Elwood, 1972a, 1972b, 1972c, 1972d; Harrell, Honaker, Hetu, & Oberwager, 1987), behavior rating systems (Thomas, 1990), psychophysiological research (Blumenthal & Cooper, 1990; McArthur, Schandler, & Cohen, 1988), attention deficits and hyperactivity (McClure & Gordon, 1984; Post, Burko & Gordon, 1990), and diagnostic procedures (Stein, 1987).
Given that the basic objective of much of this work has been the transfer of paper-and-pencil tests to computer, a logical research question to ask concerns the equivalence of procedures, particularly with computer administration of test material (Skinner & Pakula, 1986). That is, does the automation of test procedures affect the instrument's reliability and validity? Given the economic potential of testing software (Meier & Geiger, 1986), these questions have tended to be investigated after software has been developed and marketed. Some studies have found no differences between traditional and computer-administered versions of tests (e.g., Calvert & Waterfall, 1982; Elwood, 1972a; Guarnaccia, Daniels, & Sefick, 1975; Hitti, Riffer, & Stuckless, 1971). However, some who take computer-administered tests show elevated negative affect scores (George, Lankford, & Wilson, 1990), indicate more anxiety with computer-based procedures (Hedl, O'Neil, & Hansen, 1973), alter their rate of omitting items (Mazzeo & Harvey, 1988), and increase their "faking good" responses (Davis & Cowles, 1989). Given the equivocal findings, the equivalence issue currently must be addressed by test administrators on a test-by-test, sample-by-sample basis.
While straightforward automation of traditional tests often improves the reliability and efficiency of testing procedures and scoring, computerization has yet to advance basic measurement theory and technology. The existing and growing base of microcomputers, however, offers a platform from which to support a second phase of new measurement procedures that more fully utilize computer capabilities. Experimental procedures and measurements which have previously been laboratory-based can now be economically transported to microcomputers for use in applied settings. As Embretson (1992) noted, tightly controlled experimental tasks may be implemented as test items. Computer-based measurement can blur the distinction between experiments and tests, thus facilitating the unification of correlational and experimental psychology suggested by Cronbach (1957). Part of the success of neuropsychological testing results from the fact that many of the tasks contained in these tests have been derived from laboratory procedures (Goldstein, 1990). Automation may make such derivations possible in other domains, several of which are described below.
Response latency. A variable usually associated with cognitive investigations in laboratory settings (Welford, 1980a), reaction time or response latency can easily be measured in computer-based tests and tasks (Ryman, Naitoh & Englund, 1988). Brebner and Welford (1980) observed that early psychologists hoped to use reaction time as a physical measure of mental processes. Contemporary psychologists have employed latency as indicators of cognitive ability (Jensen, 1982; Lohman, 1989), stress and fatigue (Nettelbeck, 1973; Welford, 1980b), and psychopathology (Nettelbeck, 1980).
Utilizing latency as a key component, Holden, Kroner, Fekken, and Popham (1992) described a model to predict faking on personality test item response. In this model test-takers respond to items by comparing test item content with self-information contained in a schema (Holden, Fekken & Cotton, 1991). Because schemas expedite the search for information, Holden et al. (1992) proposed that responses should be faster for schema-congruent test answers than incongruent responses. Previous research has found that individuals who possess high total scores on an anxiety scale respond more quickly when agreeing with anxiety-relevant items (Holden et al., 1991; Popham & Holden, 1990). Given the historical emphasis on distortion in self-report, Holden et al. (1992) extended this model to include dissimulation on personality test items. They reasoned that persons faking good would respond more quickly to socially desirable items (i.e., congruent schema) than undesirable items (incongruent). Conversely, persons faking bad should respond more quickly to undesirable items than desirable ones. Using microcomputer-presented items, Holden et al. (1992; also see Holden & Kroner, 1992) found support for both hypotheses in a series of studies utilizing the MMPI and Basic Personality Inventory with college students and maximum security prisoners. Other researchers (George & Skinner, 1990; Tetrick, 1989) have also described studies using subjects' response latency to individual questionnaire items to detect inaccurate responding.
Human speech. As applications to record and transcribe human speech become available and economical in this decade, they have the potential to revolutionize interviewing and measurement procedures. A program that could recognize disruptions of normal speech patterns and relate that information to anxiety (cf. Mahl, 1987) would certainly be of interest to research and applied psychologists. Most theories of counseling and psychotherapy view language as crucial to understanding and intervening with clients (Meier & Davis, 1993). Researchers and practitioners have considerable interest in computer programs that could transcribe psychotherapy sessions and produce or assist in the production of qualitative and quantitative measures of that communication.
Research on speech applications appears to be in its initial stages. Friedman and Sanders (1992) described a microcomputer system, coupled with a telephone, design to monitor pauses in speech. They analyzed long speech pauses (defined as greater than or equal to one second) in relation to mood disorders. Friedman and Sanders (1992) maintained that pause measurement can be useful in studying and identifying such problems as depression, mania, dementia, and coronary-prone behavior. Canfield, Walker, and Brown (1991) described a microcomputer-based coding system to analyze sequential interactions that occur in psychotherapy. Using the Gloria films of Ellis, Perls and Rogers (Shostrum, 1966), Canfield et al. (1991) explored whether this coding system could demonstrate differences among therapists with distinct therapeutic styles. Canfield et al. (1991) analyzed transcripts of the Gloria psychotherapy sessions for positive and negative emotion, cognition, and contracts (i.e., promises and commitments). As expected, they found that the three therapists' use of these categories differed in frequency. They also found that Gloria differed in her frequency of these categories across the three therapists and that therapists employed different sequences of categories when responding to client statements. For example, one therapist would respond to a client positive emotion statement with a positive emotion, while another therapist would respond with a positive cognition. Canfield et al. (1991) noted that previous studies of these films have found differences in use of predicates (Meara, Shannon, & Pepinsky, 1979), reflection and direction (Hill, 1978), and language structure (Zimmer & Cowles, 1972).
Simulations. Computers make increasingly realistic simulations of the type discussed above an economically viable possibility now and in the near future. Multimedia programs that utilize audio and visual material in addition to text may be used to create assessment simulations for use in business, industrial, educational and clinical settings. These simulations can also function as unobtrusive measures to supplement reactive self-report scales (Johnson, Hickson, Fetter, & Reichenbach, 1987). Computer-assisted instruction programs (CAI) can employ simulations to perform the dual functions of teaching and assessment (Fulton, Larson, & Worthy, 1983; Meier & Wick, 1991). Meier and Wick (1991) described a simulation designed to demonstrate blood alcohol levels for subject-selected drinking experiences. Unobtrusively recorded reports of subjects' alcohol consumption in the simulation was (a) significantly correlated with self-reports of recent drinking behavior, drinking intentions, and attitudes toward alcohol, and (b) uncorrelated with a measure of social desirability. Similarly, Worthen, Borg, and White (1993) discussed the use of computers in continuous measurement in educational settings. If a particular curriculum was computer-based, testing could be embedded in the instructional material and thus be relatively unobtrusive. Worthen et al. noted that such an approach fits very well with mastery learning where progression to the next level of instruction depends upon demonstration of successful performance on the current material.
Being relatively resource poor, psychology usually must wait for new technology to become available in the mass marketplace before such devices can be applied to measurement and assessment problems. Such is the case with virtual reality, a set of computer-based devices that allow simulations of visual, auditory, and tactile phenomena. According to Potts (1992), the devices typically include: (a) a helmet for projecting three-dimensional views and stereo sound; (b) a joystick for controlling the user's movement in the virtual world; (c) a glove which allows the user to manipulate objects in the virtual world; (d) a Polhemus sensor, suspended above the user, that tracks the positions of the helmet, joystick, and glove, and relays that information to the computer; and (e) a computer to create sensory input for the user and track the user's actions. If the validity of simulations depends upon the closeness of their match to real situations (Motowidlo, Dunnette & Carter, 1990), then virtual reality holds great potential for psychological measurement. Like much of the technology described in this section, however, the cost of a virtual reality system is high (down from $200,000 a few years ago to $20,000 currently, according to Potts [1992]) and availability is low.
Summary. Computer-based approaches will be increasingly employed in the future for no other reason than they offer increased efficiency in test development, administration, scoring, and interpretation. Automation in psychological testing has occurred, however, with relatively little attention to such theoretical considerations as human factors issues (cf. Meier, 1988; Meier & Lambert, 1991; Rosen, Sears & Weil, 1987) or fully employing computer capabilities in the testing process.
Summary and Implications
This chapter has described current work in traditional measurement, statistically-oriented approaches, cognitive approaches, behavioral assessment, and computer-based approaches. Much of the work in traditional measurement appears to center on confirming the Big Five factors of personality and extending this model to other areas such as clinical assessment (e.g., McCrae & Costa, 1987a; Wiggins & Pincus, 1989). Behavioral assessment continues to thrive even with a greater focus on its relations to traditional psychometric concepts (e.g., Silva, 1993). As described in this chapter and in Chapters 2 and 3, cognitive theory and procedures hold considerable promise for the investigation and explanation of such measurement processes such as item response; cognitive models, however, tend to neglect the motivational and affective influences of measurement processes. Computer-based approaches would seem important if for no other reason than technological innovations tend to reduce the amount of interpretation in measurement and assessment. Content analysis and qualitative research, for example, are facilitated by tape recording of conversations which allows the listener to replay phrases and sentences for coding instead of performing the task as the activity occurs.
Theoretical and statistical approaches need greater reintegration, but that may not happen until the former has grown in strength sufficient to function as a full partner in such a merger. Brennan (1983) noted that IRT is a scaling theory, that is, useful for determining which items fit a theoretical model. With ability tests, good items should fit the shape of the appropriate ICC. GT, in contrast, is a sampling theory, useful for investigating the multiple factors that influence item and test scores. Brennan summarized these differences by observing that IRT attends to "individual items as fixed entities without specific consideration of other conditions of measurement," (p. 122), while GT's emphasis "is placed on viewing an item as a sampled condition of one facet in a (usually) larger universe of conditions of measurement" (p. 122). It makes sense that IRT's major applications have been with cognitive ability tests that measure a single dominant factor (Hambleton et al., 1991). Non-cognitive tests with multiple validities, resulting from method and construct-related influences, would seem more amenable to study utilizing GT procedures. Nevertheless, both approaches seek increased precision in psychological tests. While IRT and GT may eventually be merged in a more complete measurement theory, GT deserves more attention than it has received. Loevinger's (1957) comment remains apropos: "There is extraordinarily little empirical evidence for raising validity by improving scalability, considering the amount of interest in scale analysis" (p. 663).
This chapter contains an important
omission that deserves at least a brief emphasis. Work on fundamental measurement
theory (Krantz et al., 1971; Luce, Krantz, Suppes, & Tversky, 1990;
Suppes, Krantz, Luce, & Tversky, 1989) attempts to understand the foundations
of types of measurement across the sciences through an analysis of measurement
axioms (i.e., self-evident principles). Such principles include the rules
for: (a) assigning numbers to two manifestations of a phenomenon that differ
in some respect and (b) performing mathematical operations on those numbers,
such as addition and division. Knowledge of the properties of these numbers
may help evaluate the usefulness of the scales and actual objects they
are intended to represent. For example, Cliff (1992) noted that Luce and
Tukey (1964) demonstrated that interval scales could be defined when ordinal
consistency appeared among three or more variables. Such a conclusion,
Cliff (1992) observed, "seemed to open the way to define the truly psychological
nature of many variables" (p. 186). However, psychologists' knowledge of
work in this area has been hampered by its mathematical complexities, lack
of demonstrated empirical usefulness, and difficulties in coping with error
(Cliff, 1992).