Consistency across Raters and Rating Procedures
Inconsistencies Across Raters
Employment interviews
Clinical interviews
Rater Errors
Halo errors
Leniency and criticalness errors
Range restriction errors
Improving interviews and raters
Whose rating is most valid?
Inconsistency Across Rating
Procedures
Statistical versus clinical
prediction
History
Current status
Clinical judgment
Selection and psychotherapy
The negligible effects of the
debate
Clinical Observation of Test
Behavior
Scientist as Observer
Summary
If one does not entirely trust psychological self-reports, what then? A logical direction is to employ raters who do not share the biases of subjects and who have some experience or training in gathering information from individuals. In the history of measurement, the most commonly employed method of gathering psychological information has been the interview.
Interviews' greatest advantage and disadvantage is their face validity. That is, the data produced in an interview appear credible to the interviewer and the interviewee. Fremer (1992, p. 4) quoted Thorndike (1969):
As described in Chapter 2, other ratings are qualitative or quantitative assessments made by individuals about others along a psychological dimension. As was the case with self-reports, the search for consistency in data produced by interviews and other rating methodologies has proven problematic.
Inconsistencies Across Raters
Employment interviews. Interviews are often conducted in industrial-organizational (I/O) settings for the purpose of personnel selection. Studies show that over 95% of all employers use interviews and most see the interview as the most influential part of the hiring process (Guion, 1976; Miner & Miner, 1979; Murphy & Davidshofer, 1988). Similar ratings are also employed for measuring job performance and leadership evaluation.
Murphy and Davidshofer's (1988) review of the research literature examining the effectiveness of the employment interview suggested that no empirical basis exists for its popularity. Interviews have often been found to be unreliable, that is, interviewers exhibit little consistency in their judgments of applicants. Interviews also appear to lack high validity: many systematic sources of error, including the applicant's gender, age, ethnicity, and physical attractiveness, affect interview outcome. Little evidence exists to suggest that the interview adds to the effectiveness of selection tests. In other words, research results suggest employment decisions are best made on tests alone rather than tests and interviews.
Why do interviewing practices persist in the face of this evidence? Murphy and Davidshofer (1988) offered two reasons. Interviewers rarely receive feedback about the validity of their decisions and so may be likely to overestimate their judgment's effectiveness. Also, interviewers may feel more confident in their ability to conduct an interview than to employ more difficult techniques such as psychological tests.
Clinical interviews. Murphy and Davidshofer (1988) noted that clinical and employment interviews are fairly similar. Both are usually less structured than tests and are often intended to obtain information unavailable from tests. During both types of interviews, interviewers pay attention to the interviewees' answers as well as their behavior. Like employers, clinicians rely heavily on interviews and typically place more weight on interview data than other sources. Yet empirical research does not support high validity for the clinical interview (Murphy & Davidshofer, 1988; Wiggins, 1973). Why not?
The hypothesis confirmation bias suggests that clinicians, like laypersons, may inappropriately crystallize on early impressions of other people (Darley & Fazio, 1980; Jones, Rock, Shaver, Goethals, & Ward, 1968). If clinicians' initial impressions of their clients are correct, interviewers will pursue useful lines of inquiry. But to the extent that hypotheses are misleading or incorrect, interviewers are likely to ignore important information contrary to their initial impression. As Murphy & Davidshofer (1988) observed, "there is a pervasive tendency to overestimate the accuracy of one's judgments, and clinicians do not seem to be immune from this tendency" (p. 374). Long-standing criticisms of clinical interviews and diagnostic techniques, however, have had little impact on the behavior of practicing clinicians (Peterson, 1968).
Rater Errors
Raters can make errors, that is, they may be influenced by systematic factors other than those intended for the rating process. Murphy & Davidshofer (1988) classified rater errors into three categories: halo errors, leniency errors, and range restriction errors. All three have been recognized at least since the 1920s (Saal, Downey & Lahey, 1980).
Halo errors. One of the first rater errors to be studied (e.g., Thorndike, 1920), halo errors are those in which a rater's overall impressions about the ratee influence ratings about specific aspects of the person. In other words, a rater holds a stereotype about a ratee, and that global impression hinders a more precise or valid rating in specific domains. As shown in Figure 11, if a supervisor believes a particular employee is "good," then that supervisor may rate all instances of the employee's performance as good, regardless of the actual performance. The rating process is influenced by the supervisor's schema instead of data observed, stored, and retrieved from memory about specific behaviors. Nisbett and Wilson (1977) also distinguished between halo errors which cause global ratings even when specific information is held by the rater and errors made when only ambiguous data is available.
Saal et al. (1980) suggested that halo errors can be detected by: (a) high correlations between performance dimensions that should not be highly related, thus indicating less discrimination among different aspects of behavior; (b) factor analysis of the ratings, in which the fewer factors that are found, the greater the halo; (c) smaller variation across dimensions by a particular rater; and (d) a significant Rater X Ratee interaction in a Rater X Ratee X Dimensions ANOVA. While such evidence of halo effects is widespread, data which demonstrates raters' ability to discriminate are also available. For example, James and White (1983) found that some of the 377 Navy managers they studied were able to detect differences in employees' performances across situations.
----------------------------------------
Insert Figure 11 About Here
----------------------------------------
Leniency and criticalness errors. In this class of errors, the rater is either under- or over-estimating the performance of the ratee. Leniency errors might be detected by data whose mean deviates considerably from the mid-point on a scale or a data distribution which is positively or negatively skewed (Murphy & Davidshofer, 1988; Saal et al., 1980). Leniency errors may be a cause of ceiling or floor effects (Kazdin, 1980). In a ceiling effect, all ratings cluster near the top of the scale; with a floor effect, they cluster at the bottom. When an intervention is implemented, ceiling and floor effects can hinder the detection of the intervention's impact. For example, suppose a researcher designs a pre-test post-test study examining the effects of a stress reduction program on air traffic controllers. As shown in Figure 12, an observer's judgments of controllers' stress significantly underestimates those levels at the pretest. How could a decrease resulting from the intervention be detected at post-test?
----------------------------------------
Insert Figure 12 About Here
----------------------------------------
Range restriction errors. Raters' inability or unwillingness to distinguish the range of ratees' performance is reflected in range restriction errors. Raters who make this error fail to discriminate ratees' level of performance, instead consistently choosing a rating near the scale mid-point. For example, a supervisor might rate all workers as average. A standard deviation near zero for a group of ratings would indicate the possibility of range restriction errors.
Halo and leniency-criticalness errors also cause difficulties with range restriction. All three types of errors result in scores which do not reflect actual variation in ratees' performance. In these cases, specific raters show too much consistency and not enough valid variation relevant to the assessment's intended purpose. Scores which have restricted ranges attenuate correlations. For example, an employment test might actually correlate .70 with job performance, but range restriction in the test (and possibly, the criterion) would shrink the found correlation to .35. Statistical corrections for attenuation have been developed, but the transformed correlation can overestimate the actual relation (Murphy & Davidshofer, 1988).
Helzer (1983) defined a structured
interview as one which describes:
(a) the clinical information
to be obtained;
(b) the order of questions;
(c) the coding and definitions
of symptom questions;
and (d) the guidelines for probing
responses so as to obtain a codable answer.
In a number of areas clinicians have employed such procedures to develop explicit diagnostic criteria and a structured format for assessing them. Resulting interviews such as the Schedule for Affective Disorders and Schizophrenia (SADS; Endicott & Spitzer, 1978) and the Diagnostic Interview Schedule (DIS; Robins, Helzer, Croughan, & Ratcliff, 1981) have demonstrated significant increases in reliability over previous interviews (Endicott & Spitzer, 1978).
Wiesner and Cronshaw (1988) and Wright, Lichtenfels, and Pursell (1989) conducted meta-analyses that demonstrated that the addition of structure to employment interviews significantly increased validity estimates. Wright et al. (1989) found a validity estimate of .39 for structured interviews and suggested that this number "approaches that found for many cognitive ability tests" (p. 197) as reported by Hunter and Hunter (1984). Structured interviews work, Wright et al. (1989) maintained, because they: (a) are closely based on a job analysis of the employment position, thus reducing error from information irrelevant to the specific job; (b) assess individuals' work intentions, which are often linked to work behavior; and (c) use the same set of questions and standards for scoring answers, thereby increasing reliability.
In the clinical area, however, little consensus has been reached about the validity of many structured interviews partially because the criteria for diagnoses continue to be refined (Morrison, 1988). Nevertheless, structured interviews and relatively brief rating scales completed by the clinician, such as the Global Assessment of Functioning Scale (GAF; American Psychiatric Association, 1987; see also Endicott, Spitzer, Fleiss, & Cohen, 1976), appear to be multiplying in the same manner as self-report scales.
In such disparate areas as behavioral assessment (Hartman, 1984; Paul, 1986), performance appraisal (Gronlund, 1988) and process research (Hill, 1982), psychologists have increasingly focused on rater training as a method of decreasing rater error. As Paul et al. (1986a) observed, "The schema employed by untrained or minimally trained observers are generally loose, with fuzzy category boundaries based on prototypes" (p. 1-50). Rater training is typically designed to reduce schema-produced biases, increase rater motivation, and improve observational skills (McIntyre, Smith & Hassett, 1984). Behavioral assessors record overt behavior, for example, in staff and clients (Paul, 1986). Observers are trained to record specific behaviors and then assigned to make those observations in specified settings. Describing the elements of one type of behavioral observational training, Hartman (1984) indicated that observers should:
(a) complete a general orientation. In research studies, observers should be provided a rationale explaining that they should remain blind to study hypotheses, avoid generating hypotheses, and avoid private discussions of rating problems;
(b) memorize verbatim such information as the coding procedures and examples as contained in the observation training manual;
(c) be trained to criterion accuracy, first through written tests, and then with increasingly complex practice observations, each followed by feedback and discussion of rating problems;
(d) practice observations in the actual setting, with an emphasis on maintaining high observer motivation;
(e) receive periodic evaluation and retraining.
Elements (d) and (e) seem particularly important for maintaining the reliability of behavioral observers. Research has indicated that the reliability (i.e., inter-observer agreement) typically drops from a .70-.80 range under evaluation conditions to .30-.40 under non-evaluated conditions (Nay, 1979). In other words, when observers know their observations are being observed, they provide reliable observations; in other conditions, they do not.
What types of rater training work best? Dickinson and Baker's (1989) meta-analysis of 15 training studies found some support for rater discussion of material during training, rating practice, feedback, and discussion of feedback. Bowers (1973) found that some studies show greater consistency between self and other ratings as raters' familiarity with the target person increases (Norman, 1969), while other research has demonstrated an increase in the judges' rating accuracy when they receive correct feedback about the target (Fancher, 1966, 1967). However, Fancher (1967) also found that judges with greater conceptual sophistication were less accurate in predicting behavior. Bowers (1973) interpreted this finding as indicating that when experienced judges encounter data discrepant from their schemas, they modify the information rather than their schemas. This would seem to be an instance of the hypothesis confirmation bias.
One of the major problems in research with rater error is knowing exactly when ratings are in error. In a work setting, for example, what is there to compare a supervisor's ratings with? Laboratory studies in which actual performance is videotaped, however, allow comparison between ratings and actual performance (Murphy & Davidshofer, 1988). These studies indicate that valid ratings are related to a variety of factors, including supervisors' motivation to rate validly, variability of performance, memory aids, and the social context in which rating occurs (Murphy & Davidshofer, 1988). As displayed in Table 6, this list of factors is very similar to that provided by Boice (1983), Allport (1937), Taft (1955), Hartman (1984), and Hill (1982). Other authors (e.g., Groth-Marnat, 1990; Vernon, 1964), however, have suggested that differences on variables such as age and gender are subtle or non-existent.
----------------------------------------
Insert Table 6 About Here
----------------------------------------
Whose Rating Is Most Valid?
One of the basic questions still faced by psychological assessors is whether to employ the client's or the assessor's definition of the problem as the focus of assessment and intervention (Howard, 1981; Martin, 1988). In other words, whose rating is most valid? A similar issue centers on whether initial assessments should be geared to the presenting complaint only or made broader, perhaps including intelligence and personality testing. Martin (1988, p. 70) refers to these issues as requiring "artful decisions" based upon the "best guesses" of the assessor and client. Error is handled by the assessor.
Some evidence indicates that others see an individual more accurately than she or he see her- or himself (e.g., Hollingworth, 1916, cited in Allport, 1937). Yet other researchers have shown the extent to which one person can hold a distorted view of another (Brody & Forehand, 1986; Martin, 1988; Vernon, 1964). Regardless of relative accuracy, social psychological researchers have documented differing perceptions about causality between persons who perform behaviors (called actors) and persons who observe those actors (Jones & Nisbett, 1971; Ross, 1977; Weiner, 1985). The basic finding, which Ross termed the "fundamental attribution error" (p. 183), is that actors attribute their actions to environmental factors such as task difficulty, while observers tend to view behavior as resulting from stable internal traits. Jones and Nisbett (1971) attributed different perceptions to the different salience of information available to actors and observers. Actors who perform what Jones and Nisbett (1971) termed "preprogrammed and prepackaged" (p. 85) response sequences tend to monitor the environmental stimuli that initiated the sequence. They concluded that:
Research comparing multiple raters often reveals at least some degree of inconsistency (Lambert & Hill, 1994). Botwin and Buss (1989; also see Cheek, 1982, and Quarm, 1981) instructed 59 couples to rate self and other behaviors along 22 personality dimensions such as responsible, secure and extraverted. The self-other correlations of these sets of ratings ranged from .14 (secure) to .64 (emotional instability) with a mean of .43. Christensen, Margolin and Sullaway (1992) found considerable differences in mothers' and fathers' reporting on the Child Behavior Checklist about their children ages 3-13. Mothers reported more negative behaviors than did fathers, and parents disagreed about the occurrence of a behavior twice as often as they agreed. Christensen et al. (1992) found more consistency with behaviors described as more disturbed, overt, and specific. Powell and Vacha-Haase (1994) reported that children experience difficulty accurately reporting their behavior, while Kazdin (1988) noted low correlations between parent and child reports. Finally, Dix (1993) found that parents who were experiencing a negative emotional state (because of divorce) were more likely to see their children's age-appropriate behavior as maladjusted.
Similarly, Heppner et al.'s (1992) review of process research--which examines factors relating to client change from the perspective of the client, counselor and outside observers--indicated inconsistencies across multiple perspectives. Tichenor and Hill (1989), for example, found near zero correlations among ratings by clients, counselors, and observers; other researchers have found that different types of observers assess different constructs, all of which relate to psychotherapy outcome (Heppner et al., 1992; Horvath & Greenberg, 1989; Marmar, Marziali, Horowitz, & Weiss, 1986). Studies of ratings by counselors in training, peer observers, and supervisors display similar inconsistency (Fuqua, Johnson, Newman, Anderson & Gade, 1984; Fuqua, Newman, Scott & Gade, 1986). Like Christensen et al. (1992), Heppner et al. (1992) suggested that agreement among observers should increase with more overt and concrete behaviors.
Inconsistency Across Rating Procedures
Statistical Versus Clinical Prediction
The question of valid perspectives has been superceded in the research literature by an issue slightly more complicated. The statistical versus clinical prediction debate refers to whether clinicians (intuitively using data from interviews and tests) or test scores alone (combined statistically) make better predictions or diagnoses. This debate, then, concerns not just the differences between self-reports and clinical judgment, but the differences between the processes clinicians use to make predictions and statistical methods used to make predictions. In a sense, the question may be reformulated as follows: Which is the better method for prediction, psychological measurement or clinical assessment? We could, of course, substitute an I/O psychologist or personnel manager in the place of the clinician.
Clinical-statistical prediction studies typically take the following form. For statistical prediction, one might obtain test scores (e.g., MMPI and WAIS scores) and diagnoses for a group of individuals. The researcher would then employ the scores in a multiple regression equation to predict diagnosis. Clinicians are also provided the same test information employed in the statistical equations, perhaps with case descriptions, and asked to provide diagnoses. The success rate of the two predictions are then compared.
History. Hough (1962) presented a detailed history of the development of the statistical-clinical prediction debate. He noted that some of the earliest publications came in the form of exchanges between Viteles (1925) and Freyd (1925). Observing the tremendous growth in personnel selection testing after World War I, Viteles maintained that such tests should not be employed as stand-alone devices and that their scores should not be accepted at face value. In the tradition of Binet, Viteles maintained that tests required a psychologist for their proper interpretation. Freyd replied that such judgment should be left to employers who had more experience with their particular jobs. Clinical judgment, Freyd maintained, needed to be compared to statistical predictions to settle the matter.
Lundberg (1926) then published a classic paper in the area. Lundberg first noted the potential of statistical methods and the aversion to such methods by practitioners. Lundberg then proposed that the two methods were not in opposition, but that the clinical method was "merely the first step in the scientific method" (1926, p. 61). Lundberg (1929) later suggested that the clinical method represented only a crude and unsystematic form of the statistical method.
Hough (1962) described Allport as the first strong proponent of clinical prediction. Allport (1937) classified prediction into three types. The first was actuarial and consisted of finding statistical regularities (e.g., a mean number of deaths per city which is relatively stable by year). The second type of prediction was based on knowledge of general principles (e.g., normal individuals show increased skill on learning tasks after practice). Both of these types of prediction, it should be noted, were situation-free. The third type of prediction, Allport (1937) maintained, "forecasts what one individual man (and perhaps no one else) will do in a situation of a certain type" (p. 352). For this type of prediction Allport maintained that only clinical, idiographic methods would suffice. Nomothetic, trait-based measurement might predict psychological phenomena in large groups across situations, but change the scale of study--to one person, in different situations--and idiographic methods must be used. Allport (1942) also criticized statistical methods for (a) failing to distinguish between an event's frequency and its cause; (b) its assumption that an event has the same meaning for all individuals; and (c) its inability to deal with latent or unmanifested causes. He maintained that predictions must be made on the basis of perceived relations, the factors that give rise to change from the present situation, and recognition of contingent factors that cause exceptions to the predictive rules.
One of the first integrations of the two positions was offered by Horst (1941). While acknowledging the usefulness of statistical prediction, Horst maintained an important role for clinical methods. That is, case studies provide a thorough understanding of individuals; they allow prediction in the absence of general knowledge about relationships necessary for statistical prediction; and they are invaluable to hypothesis formation.
Meehl's (1954, 1957, 1965) reviews are frequently cited as providing the most important impetus for consideration of this issue. In Meehl's (1954) initial review, 19 of 20 studies provided evidence that actuarial prediction equalled or exceeded clinical prediction. These conclusions have been confirmed and extended by subsequent research and reviews (e.g., Carroll, Wiener, Coates, Galegher, & Alibrio, 1982; Dawes, Faust, & Meehl, 1989; Holt, 1986; Sarbin, 1986a; Sawyer, 1966; Wedding, 1983; Wedding & Faust, 1989; Wormith & Goldstone, 1984). As Hough (1962) noted, Meehl also contributed to the debate by distinguishing between methods of data collection (by test or clinician interview) and methods of data combination and prediction (statistical and clinical). While clinicians can be valuable data collectors, they rarely combine data in an optimal fashion.
Current status. The current resolution is to call for some combination of statistical and clinical procedures (cf. Murphy & Davidshofer, 1988). The best strategy involves collecting information through clinical interviews and tests and then combining data statistically to predict a criterion.
In the tradition of Binet, Hough (1962) and others have noted that clinical prediction might be used to supplement actuarial prediction. That is, a test is administered, scored and predictions made; the clinician also interviews the test-taker to gain supplemental information and to be certain test results match the individual's history (which will not occur, for example, if the test-taker responds randomly). While tests such as the MMPI provide validity indices designed to identify misleading responses, it is commonly considered the clinician's responsibility to interpret the truthfulness of the information provided (Ben-Porath & Waller, 1992). In actual practice, then, this combination of actuarial prediction and clinical confirmation is usually considered the best method of psychological testing and assessment.
Recognizing the pragmatic limitations faced by most clinicians, Wiggins (1973; see also Goldberg, 1970) suggested another approach to this issue. He proposed that (a) if data about a predictor-criterion relation exist, employ statistical prediction combining clinical and test data, or (b) if no data exist, combine the judgments of clinical raters, create a model based on the best raters, or use the model of the best rater instead of the rater her- or himself (Wiggins, 1973). While the statistical approach may be the ideal, in the real world of clinical practice actuarial data relevant to the case at hand are often unavailable. Even if the relevant trait and environmental dimensions were known, it may also not be feasible to quantify this amount of information (Walsh & Betz, 1985).
Clinical judgment. Why do clinicians finish second? Clinicians in these studies often appeared unreliable, that is, when given the same information on two different occasions, they reach different conclusions. Considerable evidence points to an association between observer error and personal characteristics of the clinician (see Rosenthal, 1976, p. 7). Evidence exists to suggest that observers' reports of traits better reflect the observers' personalities than the observed (e.g., Schneider, 1973). Also, statistical methods are designed to maximize prediction. Clinical judgment could never exceed statistical prediction unless (a) the statistical model was inaccurate (e.g., the relationship being studied was nonlinear and the statistics employed to predict it were linear), or (b) the clinician had access to information not quantified by the tests. The latter could occur if clinicians interviewed clients in these studies, but typically clinicians only read case studies or test results to make their judgments.
Meehl's initial judgment of statistical superiority has not been overturned, but the debate has often been reframed. In contrast to the nomothetic approach of most psychological tests, clinical assessment is often idiographic. A test like the MMPI or the 16PF measures traits assumed to exist in all individuals. Measurement dimensions selected by a clinician with one client, however, may be shared with none, some or all of the clinician's other clients. Skilled clinicians will also be open to modify or change their measurement dimensions after repeated meetings with clients. McArthur (1956, cited in Hough, 1962) suggested that clinicians attempt to create an unique model of each client they see. These conditions are not those typically found in the statistical-clinical prediction studies. Also, nomothetic measures typically assess traits, not psychological states. To the extent that the predictors and criteria reflect traits, actuarial predictions based on nomothetic measures should exceed clinical judgment. Clinicians, however, conceivably could be more sensitive to psychological phenomena which vary over time.
Clinicians' skills lie in the selection of variables to be measured and included in causal models to guide intervention (Meehl, 1954). Which tests to use and which variables to measure are decisions best made by clinicians. As noted in Chapter 1, tests provide descriptions, not causal explanations; clinicians and theorists provide the latter. Clinicians' weaknesses typically involve the lack of structure they employ when gathering and combining data. Tests (and presumably, structured interviews) should provide a better method of gathering data and statistical analysis a better method of combining data.
Clinicians would also enjoy an advantage over tests if they could detect information that tests could not. Shedler, Mayman, and Manis (1993) recently provided just such an example. They suggested that a group of individuals exists whom they labelled defensive deniers, that is, persons who deny and repress personal psychological distress. Shedler et al. proposed two hypotheses about this group: (a) they could be identified through interviews with another person, and (b) their defensiveness would have a physiological cost, that is, it would be associated with autonomic reactivity.
In a series of studies, they first instructed research subjects to complete standard mental health scales (e.g., Beck Depression Inventory and the Eysenck Neuroticism scale). Subjects also completed a written version of the Early Memory Test (EMT); the EMT requested accounts of subjects' earliest childhood memories as well as their impressions of themselves, other people, and the mood in the memory. An experienced clinician then evaluated this material to determine each subject's mental health or distress. Finally, the experimenters exposed subjects to laboratory stressors and recorded their changes in heart rate and blood pressure.
Shedler et al. (1993) found one group of persons who reported themselves as distressed on the self-reports scales and who were also rated as distressed by the clinician. However, another group who self-reported mental health were rated as distressed by the clinician. As hypothesized, this second group of defensive deniers did demonstrate greater reactivity on the physiological measures. Shedler et al. thus concluded that human judges could detect defensiveness that standard mental health scales could not. However, it is worth noting that of the 58 subjects who completed the EMT, 29 were judged distressed, 12 were judged healthy, and 17 were left unclassified "because their written responses to the Early Memory Test were too sparse for analysis" (Shedler et al., 1993, p. 1121). This unclassified group, nearly one-third of the total, may represent another subtype who were unwilling or unable to provide data that the clinician was able to process.
Selection and psychotherapy. In the statistical versus clinical prediction debate, the purpose of testing is usually unstated, although it appears that the purpose usually is selection. That is, if one computes correlations between tests and clinicians' judgments of school or work performance and then the actual performance, the implied purpose of such testing in actual situations would be to assist in the admission or rejection of applicants into the performance situation.
While administrators have had moderate success when using tests for selection, clinicians have experienced more difficulty. Tests are relevant for clinical selection in the sense that one could administer a test for the purpose of selecting an individual to a treatment or to one among many treatments. In regards to the latter, the question becomes: what individual characteristics (i.e., traits) will interact with which treatment environments to produce the best outcome? This is a long-standing intervention issue (e.g., Paul, 1969) that fits within the person by situation interaction debate discussed in Chapter 4. As is the case in other areas, research has yet to provide viable methods for matching clients and therapies (Hayes, Nelson & Jarrett, 1987).
Selection, however, is only one reason clinicians may use tests. Clinicians (and many other types of psychologists) also desire tests that could assist them during the intervention process. That is, tests could assist clinicians to gauge progress during the course of therapy, and ultimately, assess therapeutic outcome. This use of testing, however, implies that one is administering tests repeatedly and for the purpose of detecting psychological phenomena that change.
The issues of using tests for selection and intervention may be combined under the rubric of the treatment utility of assessment (Hayes et al., 1987). The question here is: do tests contribute to treatment outcome? Since outcome provides the ultimate justification for test use, tests should provide the intervenor with information which improves the intervention (Korchin & Schuldberg, 1981; Meehl, 1959). Unfortunately, what little research that has been done suggests that traditional tests have a negligible impact on treatment outcome (Hayes et al., 1987).
Measurement and assessment may be considered treatment interventions in and of themselves. For example, clients' self-monitoring (i.e., recording) of such behavioral problems as smoking has been shown to decrease that behavior independent of other "interventions" (e.g., Nelson, 1977a, 1977b). Interestingly, Hayes et al. (1987) noted that assessment procedures could have treatment utility without possessing any amount of reliability or validity. This could occur, for example, when clinicians' observations of clients' test-taking behavior leads to an improvement in intervention or when the test-taking procedure increases clients' expectations that the intervention will indeed work. In both cases, the changes could occur even if the tests were psychometrically useless. It is a small step, then, to suggest that the widespread use of tests and assessments may be partially a result of their beneficial impact on clinicians' expectations for improvement. Tests may be clinicians' placebos.
The negligible effects of the debate. Given the results of the typical clinical-statistic review, one might expect clinicians to change their practices. Like employment interviewers, however, most clinicians have yet to suspend their judgment in favor of statistical predictions (e.g., Meehl, 1965, 1986). Proposed explanations for this phenomenon include the face validity of interviews, clinician overconfidence, and the need for the clinician to integrate test data and data not available from tests (Murphy & Davidshofer, 1988). Clinicians' overconfidence suggests that they do not differ from laypersons who have been shown to possess exaggerated confidence in what they know (e.g., Lichtenstein & Fischhoff, 1977; MacGregor et al., 1988). Professional reasons may also motivate psychologists to add their professional judgments to test data. Hilgard (1987) noted that in hospital settings where psychologists work with psychiatrists and social workers, psychological tests give clinicians something unique to say. He suggested that instead of being simple technicians who report test results in case conferences, tests such as the Rorschach require considerable clinical judgment and legitimize clinicians' desire to participate in group speculation about clinical hypotheses.
While clinicians may acknowledge the superiority of statistical prediction, the difference between statistical and clinical prediction may be slight enough so as to preclude an acceptance of the former. Sechrest (1963) observed that a test may lack incremental validity, that is, its use does not improve the validity of a prediction. Peterson (1968) suggests that clinicians continue to employ familiar methods because no more effective methods have appeared to take their place.
For example, Hough (1962; see also Meehl, 1954) cited a study by Sarbin (1942), one of the first researchers to directly compare clinical and statistical prediction procedures. Sarbin examined the relative accuracy of methods of predicting 162 freshmen's first-quarter GPAs. Clinicians worked at a university counseling center and had access to interview data, personality, aptitude and career test scores, high school rank, and college aptitude test scores. The statistical prediction was made on the basis of the high school rank and college aptitude score. Sarbin calculated the following correlations, reported by student gender:
Men Women
Clinical prediction .35 .69
Statistical prediction .45 .70
The statistical procedure shows a slight advantage. A university administrator interested in selecting students for admission among a large group of applicants would be wise to employ the statistical procedure. A clinician dealing with a single client would probably see little practical difference between the two. Nevertheless, the statistical procedure would require less time than the clinical one, and for that reason alone the former would likely be employed. Few contemporary clinicians would be interested in using clinical prediction for answering a question about GPA or other cognitive abilities where tests have demonstrated predictive validity (McArthur, 1968). Clinicians would, however, be interested in clinical procedures for the many personality and related areas (such as interpersonal skills) where such validity is lower or absent.
Hough (1962) observed that "although statistical modes of prediction at the present time seem to have surpassed the clinical ones in accuracy, neither procedure has done very well" (p. 573; for a more contemporary example of low validity with statistical prediction, see Carbonell, Moorhead, & Megargee, 1984). Hough suggested this may partially be due to the fact that for events with low occurrence (e.g., suicide or homicide), it is difficult to increase the success of predictions above the base rate. If suicide occurs, for example, in 3 of 100 persons of a certain age range and gender, then the base rate is 3 per cent. Predicting that no one in this population will commit suicide will result in a successful prediction rate of 97 per cent. However, psychotherapists (as well as family members, researchers, and attorneys, for example) are very interested in knowing which three persons will kill themselves. As indicated elsewhere, psychologists are far from that level of prediction. A good psychological test might identify 15 of the 100 who would be likely candidates to commit suicide, but it is unlikely that a test or clinician could differentiate the eventual three out of the pool of 15.
Clinical Observation of Test Behavior
As noted in Chapter 1, Binet found it useful to think of the intelligence test as a clinical interview during which the examiners could observe test-takers as they completed the test. These observations formed an important component of the process of psychological assessment, of which completion of the IQ test was only a part. Examinees' activity level, concentration, responses to stress and failure, persistence and interpersonal behaviors with the examiner could be recorded (Kaplan & Owen, 1991). Thus, clinicians' failure to outperform tests alone might partially be a result of statistical-clinical studies' failure to allow clinicians to observe test-takers' behaviors. Is there any evidence to suggest that such observations are valid?
Kaplan and Owen (1991) found only three studies which examined the relations between test-takers' testing behavior, test performance, and behavior in non-test situations. Two studies (Glutting, Oakland & McDermott, 1989; Oakland & Glutting, 1990) found test behavior during completion of an intelligence test to be related to test score but not to classroom behavior. Glutting et al. (1989) suggested that test performance may be explained to some degree by the test-taker's persistence and motivation during testing. Gordon, DiNiro, Mettelman and Tallmadge (1989) did find test behavior to be related to teacher assessments of behavior and classroom achievement, although they also found instances of disagreement among the three areas.
Kaplan and Owen (1991) developed a 43-item Behavioral Observation Scale (BOS) with which to record such testing behaviors as facial expressions, enthusiasm, and irrelevant verbal interruptions. The BOS was completed following the administration of an intelligence test, the WPPSI-R, to 128 children. Although only a few of the correlations computed between the BOS and the criterion measures of WPPSI-R scores and classroom behavior indices reached statistical significance, several results were noteworthy. For example, the children's attention and cooperation during testing was related to teachers' ratings (six to nine months after testing) of their willingness to follow rules (r = .35, p < .01) and to their attention and persistence to school work (r = .33, p < .01). These results provided evidence that clinicians' ratings could predict relevant future behaviors. Kaplan and Owen (1991) did not report correlations between the WPPSI-R and classroom behaviors.
Attention to the manner in which respondents process test stimuli is perhaps most explicit with the Rorschach, the major projective device. Rorschach examinees are presented with stimuli to which they respond and are later probed to gather more information about their experience of the stimuli. Exner (1978) suggested that the manner of articulation of the response to the projective stimulus can be as important as the content of the response itself. Exner (1978) maintained that "it does seem certain that different people will process the same stimulus cues differently; and the matter of articulation becomes one of the fundamental avenues through which some understanding of the processing may evolve" (p. 55). The clinician's observation of this processing and responding thus becomes central to the scoring of Rorschach responses.
Scientist as Observer
What about the observational skills of scientists and researchers?As with clinicians, considerable evidence indicates that researchers (a) find what they expect or wish to find in experimental data and that (b) those expectations may be communicated to subjects (Rosenthal, 1976). Rosenthal (1976) documented numerous cases throughout the sciences where data refuting accepted ideas were ignored or rejected. He noted that Newton did not see the absorption lines in the solar spectrum produced by a prism because he did not expect them to find them there. Similarly, Blondlot's 1903 discovery of N-Rays, which appeared to make reflected light more intense, was confirmed by other scientists, only to be later discounted as observer error. Observers' expectations have also led to faulty counts of blood cells by laboratory technicians and mistaken associations between cancer and cherry angioma, a skin condition.
A good rule of thumb is that the more interpretation and inference that must be made during an observational process, the more likely that error will occur. As Rosenthal (1976) observed, "Some observations require a greater component of interpretation than others" (p. 16).
In most sciences, Rosenthal (1976) indicated, the eventual introduction of modern instruments reduces the effects of human observers. These procedures do not replace the human observer, but place the observation in a time or setting that reduces the likelihood of error. A tape recording of a conversation allows an experimenter to replay segments or record the entirety of the discussion--tasks impossible for most experimenters during the interview itself. Thus, the push is to increase the quality of research methodology so as to minimize the role of expectations and preferences in the interpretation of results.
One may also minimize expectancy problems by keeping subjects and experimenters blind to the purposes of the study. In the case of the latter, one may separate the role of the investigator who formulates study hypotheses from the experimenter who runs the study and interacts with subjects.
Summary and Implications
As with self-reports, raters' systematic errors made may be classified according to their cognitive, affective/motivational, and behavioral/environmental sources. Figure 13 displays a classification of such errors listed in this chapter. In comparison to self-report errors (shown in Figure 13 in the previous chapter), relatively few motivational/affective errors have been discussed in the literature. This makes sense given that most raters should benefit from valid ratings. However, in some cases, as when the rater-ratee relationship might be adversely affected by the consequences of the rating, raters would be expected to display some of the generative response strategies (e.g., social desirability) found with self-reports.
----------------------------------------
Insert Figure 13 About Here
----------------------------------------
It is worthy noting that self-reports and other ratings share some similar problems. Raters make systematic errors based on their global impressions of the ratee (i.e., halo) just as persons completing self-reports make errors based on schemas. Similarly, researchers, clinicians, and lay persons all tend to notice confirming information more than disconfirmation. Interestingly, many of these generative response strategies frequently produce data that are too consistent. That is, raters and self-reporters fail to notice and report the full range of characteristics exhibited by the phenomenon (usually behavior) they are observing.
Yet interviews and self-reports continue to be the dominant measurement methods, probably because of their economy and face validity. Adding structure to interviews improves their psychometric properties just as standardization does so for psychological tests. Rater training is crucial in such areas as behavioral assessment in that it makes concrete the observational procedures. That is, the observer is provided with specific information and practice concerning what, who, where, how, and when to rate. Providing individuals who complete self-reports with similar training is a possibility worthy of further investigation.
Although important differences exist between data produced by self- and other observers, few guidelines other than tradition and cost exist to direct the choice of observer. Aggregation of multiple observers can be useful, but may also be misleading if one of the observers produces more valid data than others. Actor-observer theory suggests that individuals may be better observers of environmental influences affecting their actions and others better observers of intrapsychic traits. On the other hand, researchers have found that individuals may be unable to accurately identify situational factors which influence their behavior (Nisbett & Wilson, 1977). It would seem incumbent upon theorists and test developers to specify type of observer as one part of the construct explication process (i.e., specifying how the construct should be measured).
In general, psychologists frequently
employ self-reports and interviews together to balance the strengths and
weaknesses of both. When that is not possible, however, this chapter has
provided information that may be employed to build tentative rules for
the choice of self-report or interview. In general, I would suggest that
a self-report test only be employed:
(a) if no significant mismatches
occur between test and test-taker on cognitive, affective/motivational,
and behavioral/environmental dimensions, and (b) in any type of selection
decision where predictor-criterion data exist. If significant mismatches
in (a) occur, then an interview would be preferred, given that: (a) no
significant mismatches occur between rater characteristics and rating task
on the B-A-C dimensions, and (b) the purpose is to measure intervention-related
events.
As noted in Chapter 2, interviews may provide more flexibility than self-report tests when probing for idiographic material relevant to psychological interventions. The remaining question is, what should one do if substantial errors exist in both self-reports and interview data? I would suggest that ethical considerations must then guide whether a client is better served by the provision of no information or error-filled information.
Reviewers of studies comparing
clinical and statistical prediction of future events agree on the latter's
(typically slight) superiority, but the issue is complex. A key purpose
of clinical measurement, hypothesis formation and testing, is often left
unaddressed in these studies. That is, clinicians perform their work by
forming hypotheses about factors causing client problems and using those
hypotheses to guide interventions. Little evidence is available to support
the proposition that psychological tests provide better hypotheses than
clinical judgment, nor is there evidence that testing significantly contributes
to the efficacy of psychological interventions. The tests employed in statistical-clinical
comparison studies are instruments designed in the tradition of selection
testing. As such, they can be very useful for answering questions that
can be approached through the perspective of stable, individual differences.
The issues of clinical assessment--such as measuring the interaction between
type of intervention, intervenor characteristics, and the psychological
states likely to be the focus of the intervention--may require types of
psychological tests yet to be developed.