Chapter 2
Consistency of Measurement Across and Within Individuals
Consistency of Measurement Across Individuals
Ordinarily we assume that some type of consistency resides with each person and that the job of measurement is to tap into that stable reservoir to reveal the truth. Yet consider these incidents of inconsistency, compiled by Schwarz (1999):
1. When the choice "To think for themselves" is offered on a list of alternatives, 61.5% of individuals choose this as "the most important thing for children to prepare them for life." In an open format, however, only 4.6% offer this choice (Schuman & Presser, 1981).
2. When asked to rate their success in life, 13% report high success when the scale ranges from 0 to 10. When the scale ranges from -5 to 5, 34% do so (Schwarz, Knauper, Hippler, Noelle-Neumann, & Clark, 1991).
3. The correlation between items measuring marital satisfaction and life satisfaction ranges from .18 to .67 depending on the question order and introduction (Schwarz, Strack, & Mai, 1991).
As described in Chapter 1, these are serious violations of the discipline's expectations about measurement and assessment. Instead of tests being passive measurement devices, questionnaires and survey can function as sources of information about how respondents should respond (Schwarz, 1999). This chapter describes some of the initial theory and empirical work beginning to explain these and other surprises.
Consistency of Measurement Across Individuals
One of the persistent controversies in psychology involves nomothetic versus idiographic approaches to measurement. Although psychologists typically discuss this issue in the context of personality psychology, the nomothetic-idiographic debate has relevance across psychological domains.
Nomothetic measurements observe attributes of populations, while idiographic measures focus on individuals. The objects of nomothetic measurement are assumed to be present in every person. A nomothetic theoretician would maintain that every person could be described as possessing some amount, for example, of the traits of neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness (Goldberg, 1993; McCrae & Costa, 1985; Wiggins, 1982). Idiographic theorists believe that individuals possess unique characteristics that may be shared with all, some, or no other people. An idiographic account is concerned with how the characteristics of a person combine into an unified, unique personality (Allport, 1937). From an idiographic perspective, a particular person might be described as very neurotic and somewhat extraverted, but the dimensions of agreeable, open and conscientious simply would not be meaningful to the study of this individual. From an idiographic perspective, error occurs when a score is assigned to an individual for a category which has no relevance for that individual.
Idiographic researchers study one or a few individuals, often over a long period. Nomothetic researchers study large groups, often on one occasion. Nomothetic researchers search for and believe that their research results apply to all persons, although such goals are also common to some idiographic researchers; for example, McArthur (1968) maintained that "we need to know many human natures before we can hope that Human Nature will be revealed to us" (p. 173). Both groups have tended to disparage the other. Allport (1937) quoted Meyer (1926): "A description of one individual without reference to others may be a piece of literature, a biography or novel. But science? No." (p. 271). Allport replied: "The person who is a unique and never-repeated phenomenon evades the traditional scientific approach at every step" (1937, p. 5). Although nomothetic approaches dominate many areas of contemporary psychological measurement, it is not surprising that idiographic measurement has its strongest foothold in areas, such as clinical assessment, where psychologists tend to work with single persons.
Large samples and individual differences equals nomothetic measurement. How did nomothetic approaches come to dominate measurement? As noted in Chapter 1, psychometricians developed statistical models to describe the relations among the psychological characteristics they studied. To reach sufficient power to detect such relations, statistical methods require large samples of individuals. The larger the aggregate of individuals, the more likely that random errors of measurement would balance each other, thus increasing the chance of detecting the characteristic. If measurement errors balanced or cancelled, it did not matter who any particular subject was, as long as you had many subjects. If you required many subjects, however, you also needed to assume that everyone in the sample possessed the characteristic. However, if all persons do not possess the characteristic, you must identify an individual who does and study that individual. As Danziger (1990) wrote, "If the subject is an individual consciousness, we get a very different kind of psychology than if the subject is a population of organisms" (p. 88).
Psychologists such as Allport (1937) took the nomothetic approaches to task because of their emphasis on groups of individuals instead of the individuals themselves. Idiographic psychologists were interested in developing laws that generalized across persons instead of groups of persons (Lamiell, 1990). For Allport, there were no psychological laws to be found outside the study of individuals. Lamiell (1990) provided an example of such a strategy in a series of studies which investigated how individuals rate other persons along psychological dimensions. Lamiell found that subjects typically rate other people not by comparing them to others (i.e., looking for differences between individuals), but by comparing the information provided about the person to an imagined polar opposite of that information. Thus, this comparison is not a retrieval of information from memory, but "mental creations of the person formulating the judgment" (Lamiell, 1991, p. 8).
In practice, the nomothetic approach seemed to work--to a point. With large samples, one could produce bell-shaped distributions of psychological characteristics, thus mirroring the distributions found in other sciences. But psychologists often found only weak correlations between psychological characteristics and the behaviors they were supposed to predict. Although correlations of .30 aided selection decisions in large groups, they still surprised psychologists. Why did x and y only correlate at .30? Were internal psychological variables and behavior really correlated in nature at such a low level? Or had psychology reached the limit of its measurement-statistical capabilities? Idiographic proponents have cited the selection of a nomothetic approach to measurement as the major cause of this and other problems reflecting a lack of scientific progress in psychology. Progress in the accumulation of knowledge, they maintained, cannot be achieved with nomothetic approaches. Similarly, more valid prediction of individual behavior might also be possible if measurement were idiographically based (cf. Magnusson & Endler, 1977; Walsh & Betz, 1985).
----------------------------------------
Insert Table 2 About Here
----------------------------------------
From this discussion it is evident that the purposes of nomothetic and idiographic measurement can be considered complementarity, not antithetical. As shown in Table 2, the advantages of one approach appear to be matched with the disadvantages of the other and vice-versa. For example, an idiographic approach was not well-suited to assist in the selection and administrative decisions that provided the impetus to develop early psychological tests; efficient nomothetic tests, however, did work well for such purposes. Idiographic assessment usually occurs in the context of a relationship between assessor and assessee. Such a relationship allows a greater understanding of the interaction between an individual's perception of traits and other factors, such as psychological states and external situations, that change over time.
Consistency of Measurement Within Individuals
A brief history of traits
. Social statistics were developed during the 18th century for the purpose of linking social and economic variables to social reform (Danziger, 1990). Crime rates, for example, appeared related to geographic locale, with the attendant environmental influences (e.g., poverty) readily recognized. To explain these statistical regularities, Quetelet conceived of the idea that individuals might possess "propensities" to commit acts such as homicide or suicide. Buckle argued that "group attributes were to be regarded as nothing but summations of individual attributes" (Danziger, 1990, p. 77). Propensities and attributes became traits, and the application of social statistics to individuals seemed a natural progression.Psychological measurement and assessment have long been guided by the assumption that psychological phenomena are traits (Maloney & Ward, 1976; Martin, 1988). This assumption provided psychologists with a set to find consistency in such phenomena. Most definitions of attitudes, for example, have assumed some consistency or persistence. Krech & Crutchfield (1948, cited in Scott, 1968) defined an attitude as "an enduring organization of motivational, emotional, perceptual, and cognitive processes with respect to some aspect of the individual's world" (p. 152). Similarly, Campbell (1950) wrote that "a social attitude is...evidenced by consistency in response to social objects" (p. 31). Many contemporary psychologists continue to assume they are measuring traits, as evidenced by the fact that psychologists typically observe individuals and administer tests in their offices and assume that the resulting behavior generalizes outside of that particular situation (Martin, 1988).
Murphy and Davidshofer (1988) suggested that the concept of a trait has three meanings. First, psychological traits are causes. Thus, persons who are introverted avoid extensive social interaction, that is, their introversion motivates them to avoid others. Historically, this is the meaning of traits employed explicitly or implicitly by most psychologists. Second, traits function as convenient organizational schemes for perceiving and remembering similar information. Thus, we might tend to term certain behaviors (e.g., turning in a found wallet, paying all of the taxes you owe) as "honest" although their relatedness may only be illusory. Or the relation may be real: individuals form concepts, about how to act across situations, that others perceive as traits (e.g., Stagner, 1984). Third, traits can be considered descriptive summaries of behavioral consistencies or act frequencies (Buss & Craik, 1983). Anastasi (1985) suggested that this conception of traits is being increasingly accepted. The personality traits identified by factor analytic studies, for example, can be seen as "summarizing behavioral consistencies, rather than as underlying, fixed, causal entities" (Anastasi, 1985, p. 121).
The problem with trait-consistency approaches to measurement is that human behavior is also variable. Behavior changes, yet measurement approaches predominantly depend upon trait ideas. It should be no surprise, then, that measuring change is one of the most difficult tasks for psychological measurement. Regression toward the mean (RTM), for example, is a frequently cited problem in psychological research (e.g., Cook & Campbell, 1979). When a measure is administered two times, it is often observed that scores falling at the extremes of a scale at the first occasion often move toward the mean at the second measurement. This can be a fatal alternative explanation when trying to interpret the results of research which contrasts psychological treatment and control groups who are not equal before an intervention.
----------------------------------------
Insert Figure 5 About Here
----------------------------------------
Suppose, for example, that you design a study to test the effectiveness of an intervention to decrease classroom behavior problems. For the dependent variable, you choose a checklist of classroom behavior problems. Before the intervention, teachers of students in the treatment class and the control class complete the checklist daily for one week. Figure 5 displays the range of daily behavior problems for the treatment and control classes. As shown in Figure 5, the mean score of students in the treatment group is higher at pretest than the mean score of control students. If the treatment students' problems decline from pre-test to post-test while the control group scores remain unchanged, two alternative explanations appear: (a) the treatment worked, thus decreasing behavior problems, or (b) RTM occurred. If the true mean of the problem checklist in this example is 10, then we would expect the treatment group scores to decline upon retesting even without an intervention. RTM is a strong possibility when the treatment and control groups are not randomly assigned, as often occurs in quasi-experimental designs.
Interestingly, tests with more measurement error display more regression toward the mean (Kazdin, 1980). In other words, the more error in a measurement (i.e., the more factors influencing the test score that we do not understand), the more likely that its scores will change.
Systematic Measurement Errors: Sources of Inconsistency
Error refers to factors that influence measurement in ways we do not recognize or understand. Random errors are those which occur unpredictably. Systematic errors occur in some regular manner and may accumulate, as does a trait, with aggregation. Historically, test theorists and developers have assumed errors in measurement to be random rather than systematic. That is, at the scale of the individual, errors might be systematic--one person may distort responses because of fatigue, another because of poor comprehension--but in large groups such a conglomeration of errors will behave as if they were random (Murphy & Davidshofer, 1988).
Psychologists have long sought to identify systematic errors in the measurement process. Table 3 displays errors, examined in the traditional psychological measurement and behavioral assessment literatures, that influence the consistency of individuals' test responses and observers' ratings. For example, individuals change their responses to test items when the items are rephrased from positive ("Do you feel the world is a good place to live in?") to negative ("Do you wish that you had never been born?") (Ong, 1965). Individuals behave differently when they are observed and unobserved (Webb, Campbell, Schwartz, Sechrest, & Grove, 1981). Changing the schedule of self-monitoring (i.e., *data recording by a client) can influence the resulting data (Nelson, 1977); retesting in as short a period as one day can reveal changes in test scores (Hough, Eaton, Dunnette, Kamp, & McCloy, 1990; Dahlstrom, 1969). Given the emphasis on traits, these inconsistencies are surprising.
----------------------------------------
Insert Table 3 About Here
----------------------------------------
Human Judgment Ratings
Human Judgment Ratings (HJRs) are defined here as qualitative or quantitative assessments made by individuals, about themselves or others, along a psychological dimension. HJRs consist of two types: self-reports and other ratings. Self-reports have also been referred to in the literature as S data and observers' ratings as R data (Cattell, 1957, 1973). Self-reports are judgments made by individuals about some personal psychological attribute (e.g., rate your current job satisfaction). Other ratings occur when one person rates another on a psychological dimension, as when a manager rates an employee or a teacher rates a child on some attribute (e.g., rate how persistent a student has been over the past semester). Other ratings will be discussed in detail in Chapter 3.
As previously noted, the use of large samples in psychological research meant that one no longer needed to assume, as was the case in early experimental work, that the person who made the report was a psychological expert. As long as the person provided data of some validity, errors would cancel in the aggregation. With both self and other ratings, Danziger (1990) observed that "During the early period of personality psychology, and to a considerable extent thereafter, it was simply assumed that personality ratings were an unproblematic product of attributes of the task, not attributes of the rater" (p. 160). Test items were assumed to be face valid across individuals and individuals were assumed to be able to respond in a valid manner to those items.
Given this history, it should be no surprise that self-reports and ratings by others constitute the method most frequently employed throughout the history of psychological measurement. With the exception of a few areas, such as behavioral assessment, most contemporary psychologists continue to rely on self-reports. Kagan (1988), for example, cited research indicating that most personality research during the past 10 years was based on self-report questionnaires. Noting that self-reports have been employed in alcohol research since the beginning of the 20th century, Babor et al. (1987) observed that verbal reports remain "the procedure of choice for obtaining research data about patient characteristics and the effectiveness of alcoholism treatment" (p. 412). If self-reports are assumed to be credible and valid in some sense, their widespread use is inevitable given their ease of development, administration, and scoring. Indeed, investigators are continually tempted to create new instruments in their research area, as evidenced by a recent American Psychological Association (1992) advertisement in the Monitor that estimated that 20,000 psychological, behavioral and cognitive measures are created annually. Hartman (1984) noted that "if a behavior has been studied by two investigators, the chances are very good that at least two different self-report questionnaires are available for assessing the behavior" (p. 133).
Despite the widespread use of self-report, psychologists often seem to adopt one of the following dichotomous beliefs: (a) because individuals can self-report, self-reports must be valid, or (b) because self-reports can be distorted by test-takers, self-reports are useless. The first position represents that taken by most early measurement psychologists. In contrast, self-report critics espousing the second position have pointed to studies comparing self-reports to what the critics see as a more objective criteria, that is, overt behavior. Psychologists have consistently found discrepancies between self-reports of psychological phenomena and overt behavior indicative of or related to the phenomena (e.g., Doleys et al., 1977; Schroeder & Rakos, 1978). Kagan (1988) summarized a variation on this perspective:
A serious limitation of self-report information
is that each person has only a limited awareness
of his or her moods, motives, and bases for
behavior, and it is not obvious that only
conscious intentions and moods make up the main
basis for variation in behavior. . . . Less
conscious indexes, including facial expressions,
voice quality, and physiological reactions,
occasionally have a stronger relation to a set of
theoretically predicted behaviors than do self-
reports. The reader will remember that when a
large number of investigations were derived from
psychoanalytic theory, the concept of unconscious
ideas was prominent and self-report was treated
as replete with error. (p. 617)
Although the discordance between behavior and self-report has led many psychologists to conclude that self-reports are untrustworthy sources of information, other explanations are available for these inconsistencies. Many psychologists have wondered if the prediction of behavior could be improved if systematic errors in the self-reports were identified and corrected in some manner. What are considered some of the most important of these errors are described below.
Response Strategies
Response strategies refer to the processes individuals employ to complete psychological items, problems, and tasks. Response strategies may be classified along a unbiased retrieval - generative dimension. At one end of this continuum are retrieval strategies used in the direct recall and reconstruction of extensive information about self or others from long- and short-term memory storage. At the other end are generative strategies, employed when individuals cannot or will not produce the relevant information from memory. In fact, about 30% of representative survey respondents have been shown to be willing to offer opinions on fictitious issues (Schwarz, 1999).
Test-takers probably employ strategies throughout the continuum when completing psychological tests, but many who use psychological tests often assume a predominance of unbiased retrieval strategies. That is, respondents are assumed to be retrieving valid information from memory as compared to creating distorted data. Considerable evidence exists, however, to suggest the frequent use of generative strategies. For example, when respondents answer questions that request information about events or beliefs that they do not possess, they create the information, as opposed to retrieve it from memory (Smith, Jobe, & Mingay, 1991). This may occur regardless of whether the items request specific data (as in the question, "How many visits to a physician have you made in the past 6 months?") or beliefs and attitudes ("How often should a person visit a physician per year to maintain good health?"). Whatever the cause, human judgment fallacies have been well-documented (Dahlstrom, 1993; Nisbett & Ross, 1980).
Factors Related to Strategy Selection
The degree to which respondents employ generative as compared to unbiased retrieval strategies appears to be related to a variety of cognitive, affective/motivational, and behavioral factors. These factors are discussed below.
Cognitive Influences
Cognitive factors refer to how individuals think about and process information related to tests. Cognitive psychology's relatively recent rise as an important paradigm has led many measurement psychologists to consider the role of cognitive factors in the testing process.
Level of knowledge and uncertainty
. To the extent that respondent find ambiguous the elements of the testing situation--such as test items, instructions and format--they may be expected to employ more generative strategies. What do people do when asked a question for which they are uncertain about the answer? MacGregor, Lichtenstein and Slovic (1988; see also Bradburn, Rips, & Shevell, 1987; Sudman, Bradburn, & Schwarz, 1996) suggest that the simplest strategy to pursue, when estimating an uncertain quantity, is to intuitively produce a number based on whatever information comes to mind. While inexpensive, simple and often approximately correct, research has shown such intuitive judgments to lead to systematic biases (MacGregor et al., 1988). For example, MacGregor et al. note that such judgments tend to remain close to an initial value (Tversky & Kahneman, 1974) and may be influenced by recency, salience and vividness effects.Some evidence indicates that persons with less experience in a behavioral domain are more likely to exhibit greater attitude-behavior inconsistency in that domain (Regan & Fazio, 1977; Fazio & Zanna, 1978). Regan and Fazio (1977), for example, found that college students who personally experienced a dormitory housing crisis showed greater consistency between their attitudes and behavioral attempts to cope with the crisis than did students with similar attitudes but no direct experience.
Memory processes
. Many contemporary descriptions of the item response process rely heavily on cognitive models (e.g., Babor et al., 1990; Biemer, Groves, Lyberg, Mathiowetz, & Sudman, 1991). Theorists usually include the following stages in the item response process:(a) individuals, over time, notice their behavior or the behavior of others,
(b) they store that information in memory, where it is subject to some degree of decay,
(c) individuals are presented with an item or task which they attempt to comprehend and then relate to information in memory,
(d) individuals retrieve the information and employ it in response to the item or task.
Distortion may arise from a number of complications during this process. For example, individuals usually do not know what information they will be required to retrieve for a subsequent psychological test, and consequently they cannot plan to systematically store relevant information. The longer the time period between information storage and recall on the test, the more likely the information will decay or be interfered with and retrieval altered in some manner. The most extensive production strategies must be employed when rating the past or the future since the information to make those ratings must be retrieved from memory or generated on some basis. The least generation should occur with ratings of the present, as when a person self-reports about current mood or a teacher rates a student's current classroom behavior. As the length of time increases between the rated event and the rating itself, so does the inaccuracy (cf. Paul, Mariotto, & Redfield, 1986a). Similarly, the number and complexity of the behaviors that observers must process about themselves or others appear to influence reliability. Nay (1979) reviewed research which found high negative correlations (in the -.50 to -.75 range) between the complexity of the categories recorded by an observer (defined as the number of categories employed divided by the total number of entries) and the reliability of those observations. Similarly, Schwarz (1990) reported that the more difficulty respondents experience in recalling information from memory, the more likely they are to rely on cues from response formats in creating their answers.
Schwarz (1999) noted that research evidence indicates that when asked to report information, respondents typically do not retrieve all the applicable knowledge. Instead, they tend to recall information that is chronically accessible, that is, information that comes to mind whenever the person thinks about it. In contrast some information is temporarily accessible, that is, it comes to mind when a context or situation prompts the person to recall it. One of those contexts is preceding items. For example, Strack, Schwarz, and Gschneidinger (1985) instructed respondents to recall 3 positive or negative events. Those who reported positive events subsequent indicated higher life satisfaction than those who reported negative events.
Schwarz (1999) also described assimilation (where positive association encourages positive judgments) and contrast effects (where comparisons against a standard are made). Schwarz provided the example of popular Colin Powell. When asked if respondents knew which political party he joined, respondents tend to subsequently rate the Republican party more positively (assimilation effect). When asked to rate Bob Dole with Powell, Dole's ratings decreased compared to ratings independent of Powell (contrast effect).
Schemas
. Schemas are cognitive structures or networks which organize information. Usually functioning outside of awareness because they have been overlearned, these existing stores of knowledge and action sequences allow individuals to expect certain events, interpret new information, and solve problems (Gentile, 1990).Test items and tasks reflect the language and schemas inherent in the culture of the test developer. To the extent that test respondents hold discrepant organizational schemas about test information, distortion from the reference frame of the test developer will occur. Such discrepancies are likely, for example, when the requirements of the test and the experiences of the respondent differ on such dimensions as language and culture. Some evidence suggests that a percentage of individuals store relevant information in a form incompatible with the measurement procedure. Tellegen (1985) reported a study in which 23 subjects completed self-ratings of mood over a 90-day period. Subjects' responses were factor analyzed separately; for 21 of the subjects, a two-factor solution emerged. However, for 2 of the subjects, no interpretable result was found. To search for suspected differences in perceived item meaning, 15 subjects (including the 2 discrepant subjects) were recontacted and asked to sort the mood terms into subsets with similar meanings. Analysis of the discrepant subjects' sortings indicated that they were processing and understanding the terms in a manner different from the remaining individuals. Tellegen concluded that the two-factor model of emotion was valid "provided respondents are able to report emotional experiences in accordance with consensual semantic rules" (p. 704).
Item meaning
. Closed related to schemas is research examining the process of how respondents come to create and understand the meaning of questions asked them. Schwarz (1999) provided a summary of many of the factors involved in this process.Although respondents must understand the literal meaning of a question, they also make judgments about the question's pragmatic meaning. Schwarz (1999) relied on Grice's (1975) concepts about the conduct of everyday conversations to suggest four guidelines that influence that pragmatic meaning. In everyday conversation (which includes measurement situations), respondents attempt to make their communications:
1. Relevant to the goals of the conversation.
2. As informative as possible without providing more information than appears necessary.
3. As clear and unambiguous as possible.
4. Truthful.
Schwarz notes that respondents assume that the other person(s) involved in the communication also follows these implicit rules; in the case of self-reports, the other is the creator of the questions or test items. Respondents will assume, for example, that the most obvious meaning to the question is the correct interpretation.
Summary
. Cognitive factors such as knowledge level, memory processes, and schemas represent potential sources of systematic errors in psychological measurement. However, such internal factors, by themselves, cannot account for the presence of error. Rather, the interaction of these cognitive factors with test design and purpose increases the likelihood of generative responses. Thus, a test which requires the respondent to read at the eighth grade level is likely to elicit retrieved responses from a group of eighth graders, but guesses or random responses from a group of second graders. I will discuss in more detail the results of such matches and mismatches at this chapter's conclusion.Motivational and Affective Influences
These factors refer to the effects of individuals' affective characteristics and states on the testing process. Interestingly, even these influences tend to be discussed in the literature in terms of cognitive processes.
Testing consequences
. Given the widespread use of psychological tests for selection decisions, it would seem apparent that considerable emotion could result from an individual's perceptions of testing consequences. Tests can help decide whether a person obtains a particular job, is admitted to a desired school, or avoids an unwanted intervention. Cronbach (1984) noted that:Draftees have been known to report impressive arrays
of emotional symptoms, hoping for discharge. In an
ordinary clinical test, exaggerating symptoms may be a
gambit to enlist sympathy and attention. An
unsuccessful student may prefer to have the tester
believe that his troubles are caused by an emotional
disturbance rather than to be thought of as stupid or
lazy. (p. 471)
In addition, when tests become the vehicle to create a label or diagnosis that becomes known to test-takers and other decision-makers, their consequences can have effects that last long beyond any immediate decision. Such labelling can potentially influence individuals' self-concepts and behavior across a range of situations; for example, a student who is placed in a remedial class on the basis of a test may overgeneralize a lack of skill to other content areas (cf. Fairbanks, 1992). This type of effect is one reason psychologists have increased their attention to ethical issues in testing. For example, Messick (1989a, 1989b, 1980) discusses test validity in terms of the function of testing (interpretation and use) as well as the justification for testing (appraisal of evidence and consequence).
Attempts to cope with problems introduced by testing consequences have ranged from complete openness to concealing testing purposes (Cronbach, 1984). Cronbach (1984) suggested that making the testing purpose transparent is most common in situations where respondents are anonymous (as in some types of opinion polling) or when respondents may potentially benefit from valid self-disclosure (as in symptom reports in preparation for a clinical intervention). At the other extreme is a strategy of concealment where test developers attempt to hide the test purpose. For example, developers frequently create innocuous titles for tests (e.g., "Human Services Inventory" instead of the Maslach Burnout Inventory; Maslach & Jackson, 1981) or provide test-takers with a plausible but false rationale for the testing purpose (Cronbach, 1984).
Test anxiety
. Gregory (1992) provided a contemporary review of evidence and theory about test anxiety, the emotional experience of individuals who anticipate failure on a test. While noting that past research has shown that test anxiety negatively affects test performance (e.g., Hembree, 1988), Gregory (1992) also questioned whether poor performance precedes and causes the anxiety. For example, Paulman and Kennelly (1984) found that test anxious students had ineffective test-taking strategies, while Naveh-Benjamin, McKeachie, and Lin (1987) found that many test anxious students also possessed poor study habits.Gregory (1992) cited studies which indicate that test anxious individuals appear to possess a threshold that once crossed, results in severe performance drops. For example, Sarason (1961, 1972) found no difference in performance on a memorization task between high- and low-anxiety individuals when the task was presented as neutral and nonthreatening. When the task was presented as an intelligence test, however, the high-anxious students' performance declined significantly. Similarly, Siegman (1956) found that high-anxious patients performed worse on timed as opposed to untimed WAIS subtests. The results may be explained by the cue utilization hypothesis (Easterbrook, 1959) which indicates that emotional arousal alters individuals' ability to attend to environmental cues. As arousal increases, attention is narrowed to task-relevant stimuli; however, once arousal crosses a threshold, individuals lose their capacity to effectively process the cues related to the task.
For individuals who perceive a topic or test situation as anxiety-producing, completing the test quickly constitutes escape (negative reinforcement). Gentile (1990; see also Geen, 1987) argues that this is widespread in academic tasks, and it is likely to occur in clinical assessment as well. Similarly, Cronbach (1946) noticed that some students may speed through a test, giving the appearance of random responding.
Emotional states
. The emotional states that individuals bring to tests or that are induced by tests can affect test response. Brody and Forehand (1986) found that depressed mothers were more likely than mothers with low depression to interpret their children's noncompliant behavior as indicative of maladjustment. Neufeld (1977) observed that psychologists may avoid testing schizophrenics and some depressed persons because those groups are presumed to be unable to make valid judgments about their psychological attributes. Contrada and Krantz (1987) reported data indicating that illness and accompanying treatment can affect self-reports. Perceptual and cognitive distortions that may interfere with performance on measurement tasks are also apparent in such clinical phenomena as eating disorders, stress, anxiety, and depression (Halmi, Sunday, Puglisi & Marchi, 1989; Meier, 1991).Some authors have proposed an association between affective disorders and test response style. Freeman (1955), for example, suggested that: (a) obsessive-compulsive persons provide test responses which are too detailed, but also full of uncertainty and doubt; (b) anxious persons have difficulty finding appropriate words or blurt out inappropriate replies; and (c) psychotic individuals demonstrate disorganized thinking and bizarre content in their responses.
Schwarz (1999) reported on research that indicates that individuals report more intense emotions retrospectively than concurrently (e.g., Parkinson, Briner, Reynolds, & Totterdell, 1995). This may be due to reference period employed by respondents. Schwarz noted that concurrent reports typically refer to one day or so, while retrospective reports request information about longer periods. Respondents may infer than with the briefer period the questioner is interested in more frequent events, while the longer period asks for more infrequent (and possibly, intense) events. It is also possible that with more time respondents feel more comfortable holding the intense emotions in consciousness.
Fatigue and boredom
. As previously displayed in Table 3, authors who create lists of measurement errors typically include fatigue and boredom. These are psychological states whose effects are presumed to be an interaction between test-taker and test characteristics. Given that traditional personality and cognitive performance tests can require several hours of effort, it is not surprising that some respondents report fatigue and negative thoughts at the end of tests (cf. Galassi, Frierson, & Sharer, 1981). Fatigue effects have been observed, for example, in surveys of magazine readership, crime reports, and reports of symptoms (Sudman & Bradburn, 1982).In general, humans attempt to minimize their cognitive processing load (e.g., Fisher & Lipson, 1985). Sudman and Bradburn (1982) noted that questionnaire respondents who become aware that "yes" answers are followed by lengthy follow-up questions may quickly learn to answer "no." Similarly, questions may vary in the amount of detail they request respondents to recall (e.g., current salary versus current interest on savings). As Biemer et al. (1991) noted, when questions become too difficult for respondents, they may prematurely terminate their cognitive processing.
Summary
. Testing consequences, test anxiety, emotional states, and fatigue and boredom are potential sources of systematic errors. As with cognitive factors, individuals' affect and motivations interact with test design and purpose. Thus, an ability test administered to select new employees may evoke different motivational states and response styles than the same test administered in a research study investigating learning style.Behavioral Influences
The testing environment can also influence test-takers' responses. Potential factors include the presence or absence of observers, test administrators' gender and race, the physical characteristics of the testing room (e.g., temperature and lighting), and the use of testing apparatus such as pencils or computer keyboards (which may pose difficulties, for example, for persons with physically disabilities). Probably the most studied problem involves the use of behavioral observers.
Reactivity
. Although the term has been employed in the literature with slightly different meanings, reactivity is defined here as the possible distortion that may result from individuals' awareness that they are being observed by other persons for the purpose of measurement (Kazdin, 1980). The assumption is that as a result of learning (a) that testing or research is occurring, or (b) the intent of testing or research, individuals may respond differently than they would in unobserved situations. Hartman (1984) reviewed research that found that children's reactivity is influenced by such observer attributes as gender, age, and activity level, while adults are influenced by observers' tact and appearance. Other research shows that respondents will pay attention to different aspects of a research situation depending upon their knowledge of the researcher's interests (Norenzayan & Schwarz, in press).Reactivity has also been described in terms of the transparency of testing or research; the purpose of test items, for example, can be described as more or less obvious to test-takers. The potential importance of reactivity can be illustrated by results reported in Smith and Glass's (1977) meta-analysis of psychotherapy outcome research. Smith and Glass calculated correlations between the amount of psychotherapy gain and such variables as client intelligence, therapist experience, and the reactivity of outcome measures. To gauge reactivity, Smith and Glass rated the transparency of each measure employed in the 375 psychotherapy studies they examined. Of all factors examined, reactivity correlated the highest at .30. This means that studies which employed the most transparent measures demonstrated the greatest therapeutic gain, leaving open an important alternative methodological explanation for study results.
Early work by Terman and Miles (1936, cited in Loevinger, 1957) indicated that traits could be measured more accurately when the intent of the measurement was hidden from the test-taker. As noted above, some research indicates that more subtle items may prevent socially desirable responding and make it more difficult for psychiatric patients to generate normal responses. The logical direction to move with such an approach is to make measurement unobtrusive, that is, to collect data from individuals without their knowledge. Unobtrusive measures have been proposed as a viable alternative and supplement to traditional assessment strategies (cf. Webb et al., 1981).
Unobtrusive measurement
. Examples of unobtrusive measurement include simple observation in naturalistic settings, observation in contrived situations, examination of archival records, or obtaining physical traces (Kazdin, 1980; Webb et al., 1981). Abler and Sedlacek's review (1986) provided several applied examples of unobtrusive measurement. In one study, researchers attempting to determine the effectiveness of an assertiveness training program posed as magazine salespersons and telephoned former participants to determine the program's effects (McFall and Marston, 1970). Another group of researchers found that prospective college students who made more errors filling out orientation applications were more likely to drop out (Sedlacek, Bailey & Stovall, 1984). Epstein (1979) reported a study in which students' self-reports of tension were significantly correlated with the number of erasures on exam answer sheets, number of absences, and number of class papers that were not turned in.Several problems are inherent, however, with unobtrusive measurements (Webb et al., 1981; Kazdin, 1980; Meier & Wick, 1992). First, considerable effort may be necessary to obtain an unobtrusive measurement. It is much easier, for example, to administer a self-report scale to alcohol treatment participants than to create a simulated bar or observe subjects drink on weekend nights. Second, collecting unobtrusive measurements without arousing subjects' suspicions may be difficult. Third, construct validity is seldom addressed with unobtrusive measures. The behavior of individuals in naturalistic or contrived situations, for example, may not be direct reflections of a unidimensional construct. Finally, unobtrusive measures may pose ethical problems. While researchers who employ unobtrusive measures may reveal this fact at debriefing, practitioners who wish to collect multiple unobtrusive measures (e.g., at the beginning, middle and conclusion of multiple treatments) may be motivated to conceal the measures' true intent.
In experimental situations, researchers have documented that experimenters' expectancies and subjects' desire to receive experimenters' approval influence subjects' behaviors (Kazdin, 1980; Rosenthal, 1976). Several strategies have been employed to decrease experimenter expectancy and experimenter approval effects. One might attempt to keep the person who actually runs the study--the experimenter--as well as the subjects blind to the study's hypotheses. One might also include a control group whose expectations have been set similar or counter to the experimental group's; analyses would contrast the changes in such a control with those of the intervention. With such control groups, expectancies and desirability factors become objects of investigation rather than error.
Summary
. The environment and context of testing provide a third category for describing sources of systematic errors. The most central of these problems is reactivity, changes in behavior that occur in individuals who become aware of being observed or measured on some psychological dimension. Unobtrusive measures and concealing knowledge about testing purpose are among the strategies employed to circumvent these problems.Examples of Generative Response Strategies
Generative response strategies such as socially desirable responding and random responding have been extensively studied by psychologists. As will be seen below, however, little consensus exists about the importance of such strategies, and with a few exceptions, about methods to minimize them.
Random responding
. Systematic errors at the level of the individual may result in what appear to be random test responses. For example, respondents' lack of motivation to cooperate with testing may be manifested by responding to items randomly.Test developers attempt to identify random responding by including items likely to be true or false for all persons (e.g., "I eat every day"). Clinicians may become experienced in recognizing random response profiles. Random response profiles, however, may also resemble those produced by persons with psychiatric diagnoses such as psychosis (Murphy & Davidshofer, 1988). Consequently, psychologists who administer tests will also conduct an interview to separate random responders from individuals with other problems.
Berry et al. (1992) investigated random responding in a series of studies with the MMPI-2. In a study of college students, they found that 60% gave one or more random responses to the 567 items. Seven percent reported random responding to many or most of the items; students who acknowledged some random responding averaged 36 such responses. Berry et al. found few correlations between self-estimates of random responding and subjects' demographic characteristics. In a second study, Berry et al. found that most subjects who admitted to random responding reported having done so at the end of the test, although another sizeable group scattered responses throughout. A third study with subjects from the general population found that the number of self-admitted random responders dropped to 32%. Finally, a study of 32 applicants to a police training program found that 53% indicated that they had randomly responded to some items.
Dissimulation and malingering
. Dissimulation refers to faking good or bad on test items (Murphy & Davidshofer, 1988), while malingering occurs when individuals simulate or exaggerate psychological conditions (Smith & Burger, 1993). Given that many test items and tasks are transparent in their intent to detect such phenomenon as psychopathology or dishonesty, test-takers may be motivated and able to generate answers that suit their purposes rather than reflect valid or retrieved information. For example, prejudiced individuals may very well tell a poll-taker that they would vote for an African-American presidential candidate when in fact they would not. Similarly, individuals who wish to receive disability payments may exaggerate their complaints and symptoms. Dahlstrom (1985) noted that as early as the 1930s investigators were able to demonstrate the ease of faking answers on psychological tests. Terman and Miles (1936), for example, found that the most discriminating items on a scale designed to show personality differences between men and women were also the most susceptible to change under explicit instructions to fake test answers in a masculine or feminine direction.Test developers attempt to identify and reject items that may be easily faked during the test construction process. Developers have created scales to detect malingering (e.g., Beaber, Marston, Michelli, & Mills, 1985; Rogers, Bagby, & Dickens, 1992; Schretlen, 1986; Smith & Burger, 1993) as well as tests which include special items designed to detect dissimulation. Psychiatric patients, for example, appear less able to provide normal responses when item subtlety increases (Martin, 1988). Martin (1988) reviewed the best-known MMPI scales designed to identify distorted responding, including the (a) Lie Scale, items in which the respondent may claim great virtue, (b) F scale, infrequently answered responses that may indicate a tendency to fake illness, and (c) K scale, subtle items intended to assess defensiveness and willingness to discuss oneself. A weighted derivation of the K scale is added to other MMPI clinical scales to correct for the generation of defensive responses. The problem with identifying items sensitive to dissimulation is that such items may also be sensitive to other factors. The F and Fb scales of the MMPI-2, made up of items reflecting clinically aberrant and statistically rare responses, are also affected by symptom exaggeration, psychosis, and random responding (Berry et al., 1992). The Variable Response Inconsistency (VRIN) scale (Tellegen, 1988) is composed of statistically and semantically rare item pairs and appears to be able to separate random responders from other groups (Wetter, Baer, Berry, Smith & Larsen, 1992).
Social desirability
. Social desirability is a type of response set, that is, a tendency to respond with answers that the respondent believes are most socially acceptable or makes the respondent look good (Edwards, 1953; Nunnally, 1967). Paulhus (1991) noted that psychometricians have been aware of social desirability effects since at least the 1930s (e.g., Bernreuter, 1933; Vernon, 1934). Social desirability researchers maintain that it is a separate trait that varies among individuals (i.e., individuals have different needs for approval) and that it is social desirability that most personality tests actually measure. Edwards (1970), for example, summarized research demonstrating strong correlations between the probability of personality item endorsement (contained in tests such as the MMPI) and their social desirability value.While social desirability has been researched primarily with personality tests, the phenomenon has also been noted with other measurement methods, such as the clinical interview. Barlow (1977) described a patient who came to treatment with problems of anxiety and depression which the patient indicated were associated with social situations. Over a one year period the patient made progress in a treatment emphasizing the learning of social skills, but still complained of anxiety and depression. Finally, the patient blurted out that the real cause of the discomfort were strong feelings of homosexual attraction he experienced in some social situations. Asked why he did not report this previously, Barlow (1977) wrote that "he simply said that he had wanted to report these attractions all year but was unable to bring himself to do so" (p. 287). While homosexuality may not be the taboo subject it was for many people in the 1970s, issues surrounding such sensitive topics as sexuality and substance abuse remain subject to social desirability errors. Hser, Anglin and Chou (1992), for instance, found that self-reports of male addicts showed greater inconsistency between two interviews for more socially undesirable behaviors, such as narcotics use, than for socially desirable behaviors, such as employment.
Given the evidence that social desirability affects test responses, psychologists have attempted to eliminate its effects during scale construction and during test-taking (Paulhus, 1991). Although no consensus about best methods has been reached, strategies have included:
(a) instructing test-takers to respond honestly (e.g., Benjamin, 1988). Little research is available to document this instruction's effectiveness (Martin, 1988).
(b) developing instruments such as the Social Desirability Scale (Crowne & Marlowe, 1964) to identify and eliminate test items (during scale development) or test-takers (during concurrent administration of other tests) which correlate too highly with social desirability scores. Similarly, judges may rate new test items on a scale from extremely desirable to extremely undesirable in an effort to detect relatively neutral items. Research results suggest considerable agreement among groups of judges, including preschool children, different psychiatric populations, and even judges from different cultures, on the desirability of specific items (Edwards, 1970; Jackson & Messick, 1969).
(c) providing items with two alternatives of equal social desirability value (Edwards, 1970). Some evidence suggests this strategy is ineffective (Waters, 1965).
(d) presenting subtle items which may be less transparent and therefore less easily faked (Martin, 1988). Hough et al.'s (1990) review of the literature provided little support for use of subtle items (e.g., Duff, 1965; Holden, 1989; Holden & Jackson, 1979; Jackson, 1971; McCall, 1958; Nelson & Cicchetti, 1991; Wiener, 1948). Graham (1990) reviewed studies of obvious and subtle items designed to indicate emotional disturbance that Wiener (1948) proposed for the MMPI. He concluded that obvious items are more highly correlated with nontest behaviors than subtle items and that MMPI Subtle-Obvious subscales do not accurately detect faked responses. *More recent research conducted by Nelson, Pham, and Uchiyama (1996) also support Wiener's conclusions.
and (e) warning respondents that methods to detect distortion exist. Hough et al. (1990) cited four studies which found support for the efficacy of these approaches (Haymaker & Erwin, 1980; Lautenschlager & Atwater, 1986; Schrader & Osburn, 1977; Trent, Atwater & Abrahams, 1986).
Despite an acknowledgment of the potential effects of social desirability, no consensus has been reached about the size of its effects. Dahlstrom (1969) suggested that social desirability may simply be another component of, instead of substitute for, factors such as neuroticism that are measured by scales such as the MMPI. Social desirability, then, becomes not so much an error which must be eliminated or controlled but another component or type of psychological trait. Similarly, Martin (1988) suggested that socially desirable responses may not be invalid because most people typically do behave in a socially desirable manner. That is, individuals do attempt to manage the impressions they present to other people (Messick, 1991).
Problems such as social desirability bias may have persisted partially because of the dominating assumptions of selection testing. For example, McCrae and Costa (1983) wrote that,
As an item or scale characteristic, therefore, SD [social desirability] is a potentially serious problem in situations in which information is required about the absolute level of a response. For most psychological applications, however, absolute levels of scale scores are meaningless or difficult to interpret. Instead, normative information is used to compare the score of one individual to those of other persons from populations of interest. If the scores of all individuals are uniformly inflated or decreased by SD, it will make no difference in interpreting scores, since rank order and position in a distribution are unaffected. (p. 883)
But for many testing purposes, including selection, the absolute level of response is important.
----------------------------------------
Insert Figure 6 About Here
----------------------------------------
In psychotherapy outcome research, it is quite plausible that social desirability effects would influence the mean of individuals' scores on such negative constructs as stress, anxiety, depression, and drug abuse. As shown in Figure 6, pretest scores on a socially undesirable construct such as anxiety might demonstrate a range restriction or floor effect (a). Many psychological interventions, however, teach clients to recognize and accept the experience of some amounts of anxiety as normal. If this education was the primary intervention effect--thereby reducing the socially undesirable perception of anxiety--posttest scores in the intervention condition might demonstrate a greater range as well as an increase in mean anxiety level from pretest to posttest (b). Use of a retrospective pretest (e.g., Howard, Millham, Slaten, & O'Donnell, 1981) might demonstrate the expected pretest-posttest decrease in anxiety (c), but the strength of placebo effects--the intervenee's expectation that change has occurred, regardless of the intervention's actual efficacy--makes acceptance of retrospective reports controversial.
To the extent that social desirability moves test scores toward the floor or ceiling of scale values and thereby restricts the range, interpretation of theoretical research and selection relations becomes problematic. As discussed in Chapter 1, theory development requires precise measurement which demonstrates the smallest distinctions (and therefore, greatest range) possible; to the extent that social desirability reduces such distinctions, measurements cannot reflect the full characteristics of the actual phenomena. The usefulness of selection testing depends upon concurrent and predictive validity coefficients, correlations that will be attenuated when range restriction occurs. Social desirability thus has the potential to affect many types of psychological measurement and assessment.
Response styles
. Lanyon and Goodstein (1982) differentiated between response styles and response sets. They described a response set as a tendency to distort responses so as to generate a specific impression; thus, social desirability is a response set. Response style is a distortion in a particular direction regardless of item content; agreeing with all items regardless of what the item states is an example of a response style. Response style and set have often been employed interchangeably in the measurement literature, causing some confusion. In the discussion below, what researchers have termed both styles and sets refer to Lanyon and Goodstein's description of response styles.The two most recognized response styles have been the tendency to agree with an item, called acquiescence, and the tendency to disagree with an item, called criticalness (Murphy & Davidshofer, 1988). Other proposed dimensions (see Messick, 1991) include uncritical agreement, where individuals agree because they lack the intellectual abilities necessary to critically respond to items, and impulsive acceptance, where individuals answer quickly and without much thought. A fifth type, acceptance, refers to individuals' willingness to accept personality characteristics on psychological tests as indicative of one's self-system (Messick, 1991). Acceptance relates to the nomothetic-idiographic split: if individuals do not accept particular dimensions as having relevance for them, how can they produce an answer to test items based on those dimensions?
Historically, interest in response styles has been cyclical. Messick (1991) credited reviews published by Cronbach in the 1940s (e.g., Cronbach, 1946) with calling attention to response sets in psychological testing. Sparked by Cronbach's reviews, hundreds of empirical studies investigating response styles were published, only to be followed by another set of publications that disputed the very existence of response styles (Messick, 1991). Interest in response styles has also been motivated by a desire to minimize such effects in popular psychological tests. Acquiescence became an issue in personality measurement when researchers noted that a major measure of authoritarianism, The California F Scale, was keyed so that all true responses were indicative of the trait (Martin, 1988; Messick, 1967). Thus, high scores could indicate a high level of authoritarianism or acquiescence. However, F Scale defenders suggested that acquiescent responding was itself one component of authoritarianism, a claim which never obtained much empirical support (Messick, 1967).
In the 1960s, acquiescence became a concern with the MMPI since 85% of its items are positively phrased (Messick, 1991). Several studies found that factor analysis of MMPI items revealed two factors which could be interpreted in terms of social desirability and acquiescence (e.g., Jackson & Messick, 1961, 1962, 1967). In other words, these researchers suggested that the MMPI did not measure its purported content as much as it reflected individuals' response sets and styles. Block (1965) responded to this criticism by revising MMPI items to reduce the effects of response styles and sets and then subjecting responses to the resulting items to factor analysis. Results indicated that the revised MMPI had a factor structure similar to the unrevised scale. This raised doubts about whether response styles actually confounded content measurement, suggesting instead that sets and styles were actually reflections of the very traits measured by the MMPI (see also Dahlstrom, 1969). The issue continues to be raised in present times, as in Messick's (1991) contention that problems caused by response styles have been documented in the measurement of mood states and androgyny.
Some response styles occur in the presence of certain item and test-taker characteristics. For example, acquiescence appears most pronounced when test-takers are presented with ambiguous items (Jackson & Messick, 1958). Berg (1955) suggested that acquiescence is a logical response of individuals in our culture who are presented with questions about matters they deem to be unimportant. Individuals with low verbal ability or high impulsivity are also more likely to employ response styles (Jackson, 1967).
To reduce acquiescence and criticalness, many test developers maintain a balance of items that can be scored true or false as indicative of the measured attribute. Suggestions have also been made to increase the content saturation of tests (Messick, 1991) and to write items that are clear and relevant to the test-taker (Jackson, 1967). Cronbach (1946) suggested that response sets could be reduced by any procedure that increased the structuredness of a test, such as better defining response alternatives or changing test instructions.
Interestingly, efforts to decrease response styles have been found to increase response sets and vice-versa. Martin (1988) suggested that response sets partially result from the clarity of item content: the clearer the item, the more likely that a response set such as social desirability will occur. If the item is ambiguous, however, then the probability of a response style such as acquiescence increases. Martin (1988) noted that projective tests were partially constructed on the assumption that more ambiguous stimuli would lead to less faking and socially desirable responding. This assumption, however, has not received much empirical support (Lanyon & Goodstein, 1982). Similarly, test experts have debated the usefulness of more subtle but ambiguous items, whose intent may be less transparent to test-takers, but which may also invite acquiescence or criticalness because individuals have little basis on which to respond. For example, Murphy and Davidshofer (1988) suggest that a question like "I hate my mother" is very clear and invites a response based on its content. Test-takers may also suspect, however, that such a question may be intended to measure neuroticism or psychopathology. A question like "I think Lincoln was greater than Washington" is less transparent, but a respondent who must generate a response may simply agree because of the positive item wording. Such a respondent might also agree with the statement that "I think Washington was greater than Lincoln." As noted previously, studies tend to favor the validity of obvious items over subtle ones. Consequently, the use of subtle items to diminish response sets may increase the likelihood of a response style and thereby diminish test validity.
Error Simulations
As noted above, most of the attention paid to systematic errors occurs during the item selection process. Test developers, for example, might examine correlations between test items and scores on a social desirability scale to eliminate items with a high SD relation. Few psychological scales contain items designed to detect systematic errors, and subscales designed to detect such errors often cannot differentiate between error types or their causes. Yet systematic errors are likely to be a function of persons and situations as well as items, so we should expect such errors even with items designed to minimize them. If errors such as random and socially desirable responding are present in test data, how could they be detected? Would such errors, for example, be apparent in the statistical procedures typically applied to describe and evaluate psychological scales?
One approach to this problem is to create a series of data sets containing ideal and error-laden values for comparison. Figure 7 displays responses to 10 5-point Likert items by 100 hypothetical individuals that form a unidimensional Guttman scale: all the persons in a higher level possess the characteristics of those at the next lower level, plus one more (Reckase, 1990). The data in Figure 7 have a mean of 30.00, a standard deviation of 14.00, and a coefficient alpha of 1.00.
----------------------------------------
Insert Figure 7 About Here
----------------------------------------
Suppose I simulate random responding by creating a computer program that takes these data and substitutes random numbers for 50% of the values. Table 4 displays means, standard deviations, and coefficient alphas for 10 such simulations. The means of these data are near the original's value; the standard deviations are considerably lower, but are relatively consistent within themselves. Surprisingly, the alphas are moderately high, near the median reported for actual scales sampled by Meier and Davis (1990). Given these results, I think it unlikely that most researchers and practitioners would be able to identify, in typical psychological test data, the type of systematic errors discussed above.
----------------------------------------
Insert Table 4 About Here
----------------------------------------
Consistency within Human Response Modes: Desynchrony of Behavior, Affect and Cognition
Binet and his contemporaries believed that physical and mental processes were closely linked. But Binet's cognitive tasks outpredicted Cattell's physical measures. Why should that be so? Should not individuals' performance on sensory and perceptual tasks be the first link in the chain of psychological processes? The lack of correlation between psychological and physiological measures surprised early psychologists and opened the way to fuller inquiry into the relations between different human systems.
Awareness of desynchrony--the lack of correlation between human systems that seemingly should be interrelated--has expanded over time. Social psychologists initially assumed that individuals' attitudes caused subsequent behavior, only to find that many attitude-behavior correlations are low (Liska, 1975). Researchers who wished to decrease alcohol and drug use in adolescents by improving attitudes and knowledge about alcohol often find little behavioral change (Tobler, 1986). And psychotherapists who assumed that changes in clients' cognitions, affect, or behaviors would result in immediate changes across other systems have been wrong (Rachman & Hodgson, 1974). This is a measurement problem in the sense that psychological theories have often failed to describe what system should be measured when.
Psychological theorists have found it useful to divide human functioning and personality into distinct systems or modes. A large number of such modes have been suggested. A partial, overlapping list includes behavior (e.g., motor and interpersonal), affect, cognition, sensation, imagery, physiology, and communication (e.g., verbal and nonverbal) (Lang, 1968; Lazarus, 1981). Such divisions allow theorists to propose causal mechanisms to explain functioning. Other factors external to individuals, such as social, cultural, and physical environments, have also been discussed as causal factors, but they will not be included here.
Psychologists commonly discuss three modes as distinct and basic: behavior, affect, and cognition (B-A-C). B-A-C represents a simplified system that most psychologists understand. Behavior refers to the overt actions individuals perform; we may measure, for example, how frequently a manager speaks with an employee and the duration of those conversations. Affect is the emotion or mood a person experiences; an employee might experience such feelings as anxiety or satisfaction on the job. Cognition refers to the covert thought processes of an individual; a manager might think "Giving Jones a pay raise might increase her job satisfaction and keep her at this company."
----------------------------------------
Insert Figure 8 About Here
----------------------------------------
Desynchrony refers to inconsistencies within an individual's B-A-C modes. For example, an adolescent might express, during a health education class, a strong belief about the dangers of drinking and driving and then be
arrested the following weekend on a drunken driving charge. A woman might know the importance of getting a mammogram but fail to schedule an examination. As shown in Figure 8, a patient completing systematic desensitization may no longer obsess about snakes, may be able to hold a snake, but may still report considerable anxiety about snakes when walking in the woods.Desynchrony has measurement implications in terms of what to measure and when (Eifert & Wilson, 1991). If distinct modes exist, measurement of any one mode can provide only an incomplete model of human psychology. Suppose we are attempting to decrease drug abuse by at-risk adolescents and believe that behavior change should occur after an intervention that includes education and some type of affectively-focused group discussion. In such a model, attempts to observe behavior change would be futile until the required cognitive and affective processes had occurred. If those modes are desynchronous, measurement of different modes must be guided by a theory of the desynchrony to make sense of the correlations that result. Such issues become only more complex if the processes of desynchrony vary idiographically.
History
. Desynchrony between physiological and cognitive measures constituted one of the first crises in the history of psychological measurement. As noted in Chapter 1, the success of physiology as a science strongly influenced early psychologists. Fechner and Cattell not only borrowed the methods of physiological laboratories, but viewed physiology as the foundation for the higher psychological processes. In 1896 Cattell asked: "To what extent are the several traits of body, of the senses, and of mind interdependent? How far can we predict one thing from our knowledge of another? What can we learn from the tests of elementary traits regarding the higher intellectual and emotional life?" (Sokal, 1987). Psychologists were interested in discovering general laws that would mark psychology as a legitimate science, and a genuine body-mind link would certainly qualify as an important instance of such a law.It follows that psychologists would first employ physiological tasks such as grip strength in an attempt to predict such psychological phenomena as intelligence. But Binet's tasks, which were distinctly non-physiological, better predicted school performance, leaving open the question of why physiological states did not better correspond with psychological ones. These issues remain important 100 years later. Psychologists still search for physiological markers and causes of psychological states, but this has proven to be a difficult task (cf. Cacioppo & Tassinary, 1990). For example, Goldstein and Hersen (1990) maintained that efforts to identify biological markers of most forms of psychopathology have been unsuccessful. Babor et al. (1990) similarly noted the lack of success in identifying biochemical markers of alcoholism. Matarazzo (1992) predicted that intelligence testing would become increasingly linked with physiological measures, particularly those assessing brain activity. He reviewed studies (Hendrickson, 1982; Reed and Jensen, 1991) which found moderate to high correlations between brain activity and intelligence scores. However, rather than demonstrating that biology causes intelligence, as hereditarians since Galton have believed, these studies illustrate the concurrence of brain activity and cognitive processes.
Rachman and Hodgson (1974) first discussed the inconsistency of human modes using the term desynchrony. Working with phobic patients, they noticed that avoidance behavior and expressions of fear could correlate positively, negatively, or not at all. Rachman and Hodgson were also aware of Lang's (1968) work in which he referred to fear as a group of "loosely coupled components" (Rachman & Hodgson, 1974, p. 311). Rachman and Hodgson (1974) defined discordance as low correlations between measures at a single time point and desynchrony as low correlations between change scores across time. Rachman (1978) also suggested that desynchrony can be described as mode changes that occur at different speeds. Rachman and Hodgson (1974) suggested desynchrony will be minimized during intense emotional arousal and maximized when an external source influences one of the modes. They indicated that psychotherapy also appears to influence desynchrony; synchrony frequently is both a goal and outcome of therapy.
Surprisingly few gains in knowledge have occurred about consistency across human response modes. Psychologists have typically assumed that these modes work as we experience them, that is, as a unified whole, or that only one of the modes, typically behavior, is worthy of study. Regarding the latter, Loevinger (1957) maintained that "the common error of classical psychometrics and naively operational experimental-theoretical psychology has been to assume that only behavior is worth predicting" (p. 688).
Evidence for mode inconsistency. Evidence of desynchrony is plentiful. Behavioral, cognitive and affective measures of fear and anxiety, for example, often demonstrate low to moderate intercorrelations (e.g., Abelson & Curtis, 1989; Craske & Barlow, 1988; King, Ollendick & Gullone, 1990; Leitenberg, Agras, Butz, & Wincze, 1971; Mineka, 1979). Measures of subjective sexual arousal do not always correspond to physiological measures of arousal (Hall, Binik, & diTomasso, 1985; Henson, Rubin, & Henson, 1979). Some psychotherapeutic and psychopharmocological treatments appear to produce desynchrony (Kincey & Benjamin, 1984), while others do not (McLeod, Hoehn-Saric, Zimmerli, de Souza, & Oliver, 1990). Desynchrony between cardiac responding and skeletal action has also been observed in animals (Gantt, 1953). In alcohol prevention programs, it is common to find a positive, but low correlation between measures of attitudes toward alcohol consumption and alcohol consumption itself (Tobler, 1986).
Mavissakalian and Michelson (1982; see also Barlow, Mavissakalian & Schofield, 1980; Michelson, Mavissakalian, Marchione, Ulrich, Marchione, & Testa, 1990) studied patterns of change with 26 agoraphobics who were assigned to different 12-week treatment programs. They measured clinical, behavioral and physiological variables at pre-test, during treatment, and at a 1-month follow-up. Mavissakalian and Michelson found that the appearance of synchrony and desynchrony was at least partially due to the examined interval between measurement periods. In general, behavioral and clinical measures changed most quickly, followed by physiological measures. Hodgson and Rachman (1974) reported similar findings for the order of mode change. The most common form of individual desynchrony during treatment was for self-reports of anxiety to decline while heart rate increased.
Hall, Binik, and diTomasso (1985) employed 20 heterosexual male college students in a study designed to assess physiological and subjective sexual arousal. While listening to audiotapes describing heterosexual intercourse, subjects moved a dial signifying low to high arousal; simultaneously, penile tumescence was measured by a strain gauge. Subjects demonstrated considerable variability on correlations calculated between physiological and subjective arousal measures. The highest correlations were present for individuals who were both most physiologically aroused and slower to report maximum levels of subjective arousal.
Causes
. Why does desynchrony occur? Rachman and Hodgson (1974) reviewed three possibilities that remain viable. First, different modes could be linked to different types of reinforcement and reinforcement schedules (Gray, 1971). Thus, some agoraphobics maintain their avoidance behavior by attending to their home as a signal of safety; cognitive and affective states might not be reinforced by those same cues. This explanation has found support in studies that demonstrate that when highly motivated, phobic subjects can perform threatening behavior despite the accompaniment of fear (Hodgson & Rachman, 1974). Second, Schachter's (1966) research indicated that affect could be defined in terms of cognitive appraisal of physiological states. From this perspective, individuals could misinterpret their physiological status, thus resulting in desynchronous measures. Only at high arousal would individuals be unlikely to misinterpret their response modes, a hypothesis that has also received empirical support (Craske & Craig, 1984; Kazdin, 1980; Marks, Marset, Boulougouris & Huson, 1971; Rachman, 1978; Watson, 1988) as well as disconfirmation (Kaloupek & Levis, 1983).Watson (1988) reported that scales measuring positive and negative affect, often observed to be independent factors, exhibit higher (negative) intercorrelations during periods of greater emotion. Craske and Craig (1984) divided a group of pianists into high and low anxious groups and recorded self-report, behavioral and physiological measures during a performance before an audience. The high anxious group displayed increased anxiety across measures, while the low anxious group was desynchronous. Such results seem to support a flight or fight stress response in which organisms oriented to a threat focus all their resources on coping with the threat (Selye, 1956). Finally, Lang (1971) suggested that verbal reports of affect may be more precise than data produced by measurement of autonomic systems. Intercorrelations between self-reports and physiological measures would be reduced because of range restriction in the latter. On the other hand, Kagan (1988) suggested it may be difficult to translate certain physiological phenomena into natural language items, and this may partially account for difficulties that respondents encounter when answering test items related to affect and physiology.
Psychologists often validate latent traits by correlating behaviors with self-reports of the construct representing that behavior. But Evans (1986) believes that "psychometric principles maintain the fallacy that behaviors are 'measures' of more fundamental underlying entities" (p. 149). Self-reports and behaviors may be organized by a third mechanism: reinforcement (Evans, 1986). Responses can form a cluster because they are under the control of a reinforcement contingency. Cognitions, psychophysiological responses, and overt behaviors "all interact in mutually dependent submodes of individual repertoires, not...alternative measures of a construct from different 'modes' of behavior" (Evans, 1986, p. 152). Evans suggested that reinforcement can create stable response repertoires across modes (i.e., synchrony) and that similar groupings may be shared by different individuals. In other words, stability that has been attributed to personality traits more accurately reflects, in Evans' view, the stability of environments inhabited by individuals.
Evans (1986) noted the common finding in the psychophysiological literature that individual differences exist in the ability to detect physiological processes. For instance, some individuals may be good judges of their heart rate, while others are not. Evans indicated that this variation may be partially explained by different strategies employed to monitor physiology. In a study of the correspondence between self-report and objective measurement of penile circumference during sexual arousal, some subjects appeared to observe their tumescence, while others made judgments on the basis of their appraisal of the erotic materials (Evans, 1986). Evans (1986) also cited research indicating that some alcoholics and obese persons are poor judges of their relevant internal states.
Hodgson and Rachman (1974) offered three methodological hypotheses to explain desynchrony. First, B-A-C scales may differ in specificity, that is, the behavioral test may require a snake phobic to deal with a snake in the psychologist's office, while the cognitive self-report asks the phobic to predict behavior across a range of situations. Second, in treatment groups, a range restriction (not desynchrony) that exists at pre-test will disappear at post-test when measures diverge because of differences between treatment responders and non-responders.
----------------------------------------
Insert Figure 9 About Here
----------------------------------------
Third, change scores are affected by the Law of Initial Values (LIV) which states that physiological responses to stimuli depend upon the pre-stimulus value of the physiological system (Wilder, 1957, 1967, cited in Andreassi, 1980). As shown in Figure 9, the higher the initial level, the smaller will be the response to stimuli which increase responding and the larger the response to stimuli which decrease responding. A person with a high pulse rate should show greater change to a relaxing stimulus than a person with a moderate pulse rate. Similarly, highly phobic patients should be more easily amenable to intervention as compared to moderately phobic patients. It is also possible that the LIV applies differentially to mode. Measures of affect and physiology may change according to LIV mechanisms, while cognition and behavior may be operating under different influences. In a similar vein, Kaloupeck and Levis (1983) proposed that desynchrony may partially be an artifact of the different scales employed to measure B-A-C. Changes of a certain magnitude on an affect scale may be not be matched by equivalent changes on a behavioral scale, with correlations correspondingly lowered.
Desynchrony may also occur because it presents individuals with significant opportunities for learning and adapting to their environment. Because environmental information may be differentially processed and acted upon, independence of modes expands the degrees of freedom afforded individuals. A child moderately frightened by a harmless pet cat can talk her- or himself into touching the cat. An adult mildly obsessed with a thought can often stop the obsession by doing a behavior. A behavior may be delayed or modified while an individual considers its consequences.
Psychological interventions
. Desynchrony poses problems for theorists who devise psychological interventions and for practitioners who implement them. In the past, psychologists examining intervention effects have often assumed that treatments cause uniform effects across modes. Subsequent measurement has often focused only on one mode; misleading results occur, however, if intervention effects differ by mode. Even when multiple modes have been measured, they often did not correlate, thus raising questions about the construct validity of the scales and their constructs.In psychotherapy outcome research, investigators attempt to change behavior, affect and cognition as part of a therapeutic intervention (Bergin & Lambert, 1978). The effects of the intervention are assessed by measures of these three modes. The common desynchrony problem encountered in such research is that individuals often change on one, but not all of the measures, and sometimes in unexpected directions. Because the type of change produced by different forms of psychotherapy is difficult to predict, researchers typically include scales that assess all relevant domains (Bergin & Lambert, 1978). Although this enables researchers to describe the varying changes produced by therapy, it does not explain them.
In a study of desynchrony in 21 female agoraphobics, Craske, Sanderson, and Barlow (1987) found that high heart rate was strongly related to positive treatment outcome. Heart rate was associated in these patients with a willingness to approach a feared situation. Craske et al. interpreted these results as indicating the importance of patients' willingness to tolerate intense physical sensations in treatment success. It is also possible, however, the patients employed awareness of their heart rate as concrete evidence of a phobia that they wanted to overcome. Rather than simply tolerating the physiological signs, these patients may have used those signs as motivation to change.
Assumptions of synchrony by psychological intervenors have also been found to influence clinical judgment. Evans (1986) reported a study by Stark (1981) which examined staff ratings of the behavior of an autistic adolescent girl. Staff at the facility treating the girl reported that she frequently had off days in which they believed all aspects of her performance deteriorated and during which they believed it would be unsuitable to provide her with the usual educational programming. Stark found that staff judgments of off days were correlated only with the frequency of echolalic speech (i.e., involuntary repetition of speech said by others), but not with other measures of task performance. The professional staff, then, made the mistake of overgeneralizing from the speech problems to other modes: they assumed a consistency that did not exist. Chapter's 3 review will show that this is common problem with interviewers and raters.
Summary and Implications
Consistency assumptions have led psychologists to pursue such questions as: Are psychological dimensions common to all individuals? A better question might be: For what purposes are nomothetic tests best? One answer is to suggest that nomothetic tests seem to make the most sense for selection purposes where few resources are available for training or intervention. Idiographic assessment seems to fit better in intervention contexts with more resources to spend on individual data-gathering.
In general, few resources are available in most testing situations, and the result has been a reliance on economical, nomothetic self-reports. The major question surrounding self-reports has been: What would cause individuals to respond inconsistently? A categorization of systematic response errors indicates that individuals employ item response strategies influenced by such variables as cognitive, affective/motivational, and behavioral/environmental factors.
----------------------------------------
Insert Figure 10 About Here
----------------------------------------
As shown in the decision tree in Figure 10, the possibility of such variables creating generative response strategies depends upon the degree of match between the test and the test-takers' cognitive, affective, and behavioral states and traits. If test-takers' reading level preclude them from fully understanding items, for example, it is reasonable to assume that they may guess on such items. To the extent that test-takers' cognitive traits differ from test requirements and purposes--because of language or cultural differences, lack of education and cognitive skills, or insufficient knowledge--even objective tests may function like projective ones. That is, when the mismatch between test items and respondents' cognitive characteristics is sufficiently large, objective test items become ambiguous stimuli to these respondents, with the result being idiosyncratic associative responses. Such responses are desired in projective tests and error in objective tests.
Even when a cognitive match exists, however, test-takers' affective characteristics may not. Unmotivated individuals may respond randomly or employ a response style such as acquiescence; highly motivated test-takers may fake good or bad or employ a response set such as socially desirable responding. Finally, factors in the test environment, such as the presence of observers and test administrators as well as their characteristics, may influence responding. Even respondents who have appropriate motivation (e.g., few concerns about doing well or behaving properly) may restrict or alter their responses or behaviors until they become accustomed to the unusual aspects of the environment.
In general, the larger the number and degree of mismatches, the greater will be the use of generative response strategies. Correspondingly, more generative responses means lower validity for the test's designed purpose. If mismatches occur, it may be possible to intervene cognitively (e.g., explaining difficult words to a less-educated test-taker), affectively (e.g., increasing motivation for valid responding by informing test-takers' that their responses will be checked for accuracy), or behaviorally (e.g., observing unobtrusively or allowing time for adaptation to the observer or administrator). Alternative methods, such as the interview, may be useful for determining mismatch causes or implementing an intervention. For example, listening skills are designed to increase a client's trust in the counselor, with one intent being that the client self-disclose more accurate information as the relationship develops (cf. Egan, 1990). Finally, it is also possible that type of generative response style (e.g., socially desirable responding) may be indicative of a specific type of mismatch (e.g., affective/motivational errors).
----------------------------------------
Insert Table 5 About Here
----------------------------------------
The sequence of answers to the questions in Figure 10 also has implications for desynchrony. As shown in Table 5, matches and mismatches of tests and respondents can be examined in the cognitive, affective, and behavioral domains. Mismatches within any one mode mean that desynchrony is likely to be evident between that and any other mode. Synchrony can occur only when matches occur first within and then between modes. One solution to desynchrony has been to aggregate scores across modes, but this strategy may produce misleading data if synchrony and desynchrony are the result of systematic and not random errors. If theorists can specify when synchrony and desynchrony should occur in psychological and psychophysiological phenomena, more precise and valid measurement should result.
Desynchrony and systematic errors may be recurring problems partially because test developers and researchers have been guided by assumptions of consistency. Theorists working without consistency assumptions may be more likely to consider the roles and interrelations of cognitive, affective, and behavioral variables when they specify how constructs should be measured. From an idiographic perspective, systematic errors occur when the test developer assumes consistency across individuals, that is, when the test attempts to measure a cognitive, affective, or behavioral characteristic that either is not present or cannot be accessed by an individual. Nomothetically-inspired items and tasks will thus lead to mismatches in many individuals and the use of generative response strategies.