Chapter 7

Integration

The Implications of Measurement History
Constructing Tests
Measurement for Selection
Criterion-referenced and norm-referenced tests
Measurement for Explanatory Theory-Building
Where do test items come from?
Response process validity
Why are items dropped?
Measurement for Intervention
Model building and assessment
Behavioral assessment
Precision
Item and task selection
Change-based measurement
Criteria for selection of intervention items
Administering Tests
Selection
Theory-Building
Item response strategies
The effects of non-test events and interventions
Active measurement
Studying temporal inconsistency:
Psychological states
Intervention
Scoring Tests
Selection
Theory-Building
Intervention
Interpreting Tests
Selection
Theory-Building
Construct explication
Precision
Desynchrony and construct explication
Desynchrony implications for explanations and predictions
Construct explication and phenomenon-data distance
Intervention
Summary and Implications

The major concern of validity, as of science more generally, is not to explain any single isolated event, behavior, or item response, because these almost certainly reflect a confounding of multiple determinants. Rather, the intent is to account for consistency in behaviors or item responses, which frequently reflects distinguishable determinants. In contrast with treating the item responses or behaviors in question separately as a conglomeration of specifics, these behavioral and response consistencies are typically summarized in the form of total scores or subscores. We thus move from the level of discrete behaviors or isolated observations to the level of measurement. (Messick, 1989a, p. 14)

Tests are used to make decisions. (Murphy & Davidshofer, 1988, p. xi)

Counseling psychology research has almost exclusively been, and still is, individual differences research. (Dawes, 1992, p. 13)

The Implications of Measurement History

The above quotations describe several of the most important current assumptions resulting from the evolution of psychological measurement and assessment. Messick indicated, for example, that measurement consists of the aggregation of items. Little attention is given to the possibility that consistency may be possible at a scale other than that of the aggregated total. The error that occurs at the scale of the single item is considered random and unavailable to investigation. Loevinger (1957) noted the belief about the futility of single item analysis when she wrote that "the many sources and many meanings of every response induce scepticism about the value of searching for rigorously structured, pure, unidimensional patterns" (p. 644).

Danziger (1990) suggested that during the historical development of psychology, certain methods became the only acceptable alternatives available to investigators. The decision to select these methods, Danziger noted, was not necessarily rational and was certainly influenced by political forces within the profession. From the historical review provided in previous chapters, psychological measurement and assessment can be seen to have been primarily influenced by administrative, selection purposes. Psychologists have underappreciated the role that selection pressures and procedures have played in measurement and assessment. One can frame the purpose of tests in terms of cost-effective decision-making, but such a perspective can also obscure and complicate the use of tests for other purposes.

Tests constructed for selection purposes now dominate theoretical and intervention research in all of soft psychology. I would interpret Dawis' statement to mean that psychological researchers, whatever their purpose, employ tests designed to measure stable traits that distinguish between individuals. Below I will argue in greater depth that psychologists' purposes should influence the design strategy of their measurement devices. For example, rather than employing tests containing items measuring stable traits, researchers interested in evaluating the efficacy of psychological interventions might more appropriately construct tests on the basis of items' ability to distinguish between groups of individuals exhibiting treatment success and failure.

----------------------------------------
Insert Figure 32 About Here
----------------------------------------

As illustrated in Figure 32, the procedures of selection testing have exerted a dominant influence on the four major components of testing: test construction, administration, scoring, and interpretation. Although exceptions can be found, in the typical psychological test:

(a) test items have been selected and evaluated with only a loose connection to psychological theory;

(b) tests have been designed to be administered economically, which often translates into quickly and in large groups,

(d) test scores are interpreted in terms of some placement decision (e.g., should this person be classified as mentally retarded? admitted to this school?), with little attention to how scores could provide information about intervention.

Test validity must be evaluated in the context of its purpose (Cronbach & Gleser, 1965). In addition to selection, tests may be employed for the purposes of intervention and theory-building. One can certainly employ tests for selection into an intervention: for example, a test might be useful in determining degree of psychopathology and thus deciding whether an individual requires an intervention (e.g., Hayes et al., 1987). One could also employ a test to determine which type of intervention would be most effective. However, tests might also be used to develop a model or theory of the target of intervention, be they individuals, groups, or organizations. Finally, selection and intervention testing would certainly be applicable to theory-building, but researchers also seek data for the purpose of accumulating explanatory knowledge independent of any immediate application.

Below I discuss four components of testing--construction, administration, scoring, and interpretation--within the context of three major test purposes: selection, intervention, and theory-building.

Constructing Tests

The first step in test construction is to decide on the form and the content of the test. In traditional measurement, this involves creation of a pool of items and selection of items from that pool. In behavioral assessment, the "test" is constructed by selecting behaviors to be targeted by the intervention. Other types of testing, such as performance appraisal, focus on selecting appropriate tasks for testing.

Measurement for Selection

Selection tests are typically used for the purpose of classification and subsequent prediction. Martin (1988) indicated that such measurement focuses on broad, relatively stable traits. Under this heading, Martin (1988) included testing for the purposes of screening (i.e., isolating a subgroup for further assessment or intervention), differentiating between normal and abnormal behavior, differential diagnosis (i.e., classification into one type of psychopathology), and delineating normal individual differences. Again, psychologists historically have been interested in creating measures that show differences among individuals in order to classify them into academic, occupational, or therapeutic groups. For example, Hathaway and McKinley originally developed the MMPI to function as a substitute for the individual interview in assigning psychiatric diagnoses (Graham, 1990). They selected MMPI items on the basis of their ability to discriminate between normals and hospital patients who possessed some type of psychiatric illness as indicated by clinical judgment (Graham, 1990). However, researchers found that the MMPI could not adequately fulfill its diagnostic role in that individuals who scored highly on a scale matching their initial diagnoses (e.g., depression) also tended to score highly on other clinical scales (Graham, 1990). Graham (1990) suggested that instead of seeking diagnoses, MMPI users came to rely on profiles of MMPI scores that were associated with specific behaviors and traits. But because of the limited reliability of subscales and the difficulties clinicians encounter in applying complex decision rules reliably, the validity of profiles used in classification and prediction tasks remains controversial (Murphy & Davidshofer, 1988).

Given that cost remains the major element in psychological measurement and assessment (Martin, 1988), selection tests for the military, business, and education have usually been empirically developed. That is, large numbers of items are written and administered to a representative sample. Items are then evaluated against a criterion such as job performance or grades. The time-consuming task of developing items and tasks which follow from a theoretical construct or a task analysis is avoided. Selection procedures represent the least expensive method of test construction. The source of items is relatively unimportant, as long as the initial pool is relatively large and a criterion to correlate with those items is available.

If a theory is employed in selection test development, it is likely to be trait-based. The construct in question is assumed to be stable and consistent over time; test developers seek items which show differences among individuals. The testing format is likely to be the most economical available (i.e., self-report), unless differences between methods show a significant improvement for a more costly procedure (as is the case with cognitive ability testing versus self-report of ability; see Hunter & Hunter, 1984).

Criterion-referenced and norm-referenced tests. In criterion-referenced tests, scores are compared to some absolute measure of behavior, a criterion; norm-referenced scores are compared among individuals (Glaser, 1963). Gronlund (1988) indicated that developers of norm-referenced tests seek items with the greatest possible variability. With achievement tests, these items are pursued through a selection process which retains items of average difficulty; easy and hard items are likely to be discarded. Aggregation of such items increases the possibility of making valid distinctions among individuals. With criterion-referenced tests, however, items are retained because of their relation to a criterion, regardless of the frequencies of correct or incorrect responses. Since maximizing the similarity between test and criterion tends to increase predictive validity (e.g., Cronbach & Gleser, 1965; Danziger, 1990; Paul, 1986; Wiggins, 1973), selection tests might be improved by following guidelines provided for criterion-referenced tests (and, interestingly, behavioral assessment), where item selection is based upon performance objectives or criteria the test is designed is measure (Swezey, 1981).

Swezey (1981) observed that criterion-referenced tests have received widespread attention only since the 1960s. A plausible guess is that cost played a major role in delaying the development of this approach. Criterion-referenced tests cost more than norm-referenced tests because the former (a) require considerable effort in the analysis and definition of the performance criteria to be measured and (b) may necessitate special facilities and equipment beyond self-report materials. If one is interested in predicting performance on a criterion--the major purpose of selection testing--then criterion-referenced approaches would seem a logical choice. However, norm-referenced testing has been the predominant approach in selection testing. Besides their lower cost, norm-referenced tests also seem more applicable when the test administrator desires to select some portion of a group (e.g., the top 10% of applicants) as compared to all applicants who could successfully perform a function. Thus, norm-referenced tests are useful in selection situations where individuals are chosen partially on the basis of scarce resources. Suppose you conduct a research study and find that 95% of all graduate students who score 600 or above on the GRE Verbal scale are able to pass all required graduate school courses. From the perspective of criterion-referenced testing, everyone scoring 600 or above should be admitted. In many graduate departments, however, that would mean admitting more students than available courses, instructors, or financial supports. Such a situation certainly occurs in other educational, occupational, and clinical settings with fixed quotas. Norm-referenced testing, then, provides a solution: identify the top-scoring number who match the available resources.

To the extent that additional resources become available to test developers, selection testers are likely to increase their use of criterion-referenced procedures. Descriptions of the development of criterion-referenced tests (Gronlund, 1988; Swezey, 1981) typically include the following steps:

(a) Task analyses of the criteria must be performed to create operationally defined objectives. Objectives include performances (what the test-taker knows or does), conditions (the testing situation), and standards (the level of satisfactory performance). Popham (1978) discussed this specification in terms of stimulus attributes (the material presented to the test-taker) and response attributes (the test-taker's selected or constructed responses). Descriptions of specifications for criterion-referenced items resemble, in spirit if not in detail, those contained in behavioral assessment manuals, structured interviews, and process research manuals. All are time-consuming.

(b) Items must be planned in terms of cost (time and personnel constraints) and fidelity (realism of the items as compared to the performance criteria). Popham (1993) recommended multiple methods and definitions for criterion-referenced items to avoid test-specific instruction (i.e., teaching only material relevant to the subsequent test).

(d) Final items must be selected. The item pool should be administered in a pilot study to a group of mastery and non-mastery individuals to determine the extent to which the items can discriminate between the two groups. Content experts may review the items, and the items should be evaluated against psychometric standards of reliability and validity. Gronlund (1988; Kryspin & Feldhusen, 1974) suggested the following formula for evaluating an item's sensitivity to instructional effects:

S = Ra - Rb / T

where Ra is the number of test-takers correctly responding to an item after instruction, Rb is the number correctly responding before instruction, and T the total number of test-takers at a single administration. Values near 1 indicate items more sensitive to instruction. Gronlund (1988; p. 148) also described a method for computing the reliability of a criterion-referenced test.

During item development, Swezey (1981) emphasized the importance of precisely specifying test objectives. Criteria can be described in terms of variables such as quality, quantity, product or process, time to complete, number of errors, precision, and rate (Gronlund, 1988; Swezey, 1981). A criterion may be a product such as "student correctly completes 10 mathematics problems"; a process criterion would be "student completes division problems in the proper sequence." If the test objective is a product, then product measurement is appropriate. Process measurement is useful when diagnostic information is required, when the product always follows from the process, and when product data is difficult to obtain. An adequate specification describes which of these components are included and excluded as part of the task. Criterion-referenced tests should be reliable and valid to the extent that performances, testing conditions, and standards are precisely specified in relation to the criteria. For example, Swezey (1981) preferred "within 5 minutes" to "under normal time conditions" as a precise testing standard. In some respects, the criterion-referenced approach represents a move away from a search for general laws and toward a specification of the meaning of test scores in terms of important measurement facets. Discussing test validity, Wiley (1991) presented a similar theme when he wrote that the labelling of a test ought to be "sufficiently precise to allow the separation of components of invalidity from valid variations in performance" (p. 86). Swezey's and Wiley's statements indicate the field's increasing emphasis on construct explication.

Measurement for Explanatory Theory-Building

Where do test items come from? How do test developers create and select items and tasks for their measurement devices? Golden, Sawicki, and Franzen (1990) listed three sources: theory, nomination by experts, and other tests. Golden et al. (1990) indicated that whatever their source, items initially are selected on the basis of their face validity. That is, the items appear to measure what they are intended to measure. When test developers employ a rational or theory-based approach, item selection stops when theoretically-relevant items are written. Most test developers, however, typically administer those items and score the responses according to empirical strategies. The initial item pool, which Golden et al. (1990) recommended should be two to four times the final number of desired items, is fitted empirically against a psychometric model. Test developers typically search for consistent item responses which consistently predict some criterion. Consistency is often evaluated by selecting only items (a) that are highly correlated with the total score (i.e., internal consistency), and (b) that contribute to total scores highly correlated with a second administration of the same test with the same subjects (i.e., test-retest reliability).

In another approach, items may first be subjected to factor analyses or other "complex analyses aimed at revealing the truth" (Burisch, 1984, p. 215). That is, the data are transformed statistically in an attempt to reveal a true, latent value, which theoretically should then demonstrate better psychometric properties (e.g., higher reliability and greater predictive validity). Next, scores which predict a criterion are retained. A more rigorous evaluation is passed when those same scores maintain some degree of predictability across samples and occasions. This cross-validation, however, is rare (Schwab & Oliver, 1974, cited in Murphy & Davidshofer, 1988).

Given its importance in the construct explication process, a surprising looseness and lack of organization exists during the process of item selection for most psychological tests. Loevinger (1957) stated there exists an:

...idiosyncratic, nonreproducible process, the process by which the given investigator or group of investigators constructs or selects items to represent that content. Although this process, the constitution of the pool of items, is a universal step in test construction, it has not been adequately recognized by test theory. But this step in test construction is crucial for the question of

whether evidence for the validity of the test will also be evidence for the alidity of a construct. (p. 658) [italics added]

Similarly, Lessler, Tourangeau, and Salter (1989) maintained that questionnaire design "remains essentially an art" (p. 1). Some measurement psychologists have offered item selection guidelines which attempt to integrate the theoretical and statistical strategies described above (e.g., Jackson, 1970). Little consensus, however, exists about a best method. In fact, Burish's review of the literature (1984; Goldberg, 1972; see Murphy & Davidshofer, 1988, and Paunonen & Jackson, 1985, for opposing views) found that few differences existed in the psychometric properties of scales produced by different test construction methods.

Although selection procedures have dominated measurement, I propose that test construction procedures should follow from the intended purpose of the test. Test developers should not default to traditional approaches, that is, searching for consistent items that contribute to aggregated scores and differentiate among individuals in linearly predicting a criterion. Traditional approaches perform well for selection tests, but different strategies are likely to be necessary for tests of explanatory theory-building and for tests which assess the effects of interventions.

Response process validity. Response process validity (RPV) refers to knowledge about the processes employed by individuals when completing psychological tests and tasks. RPV would seem a critical component of construct validity. To the extent that we understand item and task response processes, we should understand what a test measures. RPV applies equally to tests and criteria and ultimately should have implications about the possibility of increasing correlations between any two operations. Embretson (1985) maintained that "understanding the nature of the relationship, as opposed to just its magnitude, puts test developers in a better position to increase validity" (p. 286).

What would constitute a RPV research program? A frequent criticism of single subject research is that the idiosyncrasies of any particular individual will interfere with the ability to generalize to all persons. However, if your purpose is to investigate error, then the single subject (as well as the single item) is exactly the place to start. Qualitative approaches, such as naturalistic observation, interviews, and protocol analysis, should provide information about the processes employed during the production of task and item response. Messick (1989b) noted that a test's construct validity can be ascertained through probes of "the ways in which individuals cope with the items or tasks, in an effort to illuminate the processes underlying item response and task performance" (p. 6). Similarly, Cronbach (1989) suggested that one could learn about construct validity by administering a test to individuals who think aloud as they answer the items.

In N of 1 studies (Barlow & Hersen, 1984; Gentile, 1982), a baseline period (A) is followed by an intervention (B). Next, the intervention is removed during a second baseline period (A); finally, the intervention is instituted for a second time (B). In this ABAB design, the targeted behavior should be stable during the first baseline; behavior should change during the intervention and then revert to previous levels once the intervention is removed; when the intervention is again implemented, behavior should again change. The sequencing of repeated baseline and intervention periods helps to establish the intervention's causal effects on the desired behavioral change.

----------------------------------------
Insert Figure 33 About Here
----------------------------------------

This type of N of 1 research may allow investigation of a test's characteristics as well as its effects as an intervention. Suppose that you were interested in understanding the processes by which self-monitoring (i.e., data recording by a client) affects behavior (Nelson, 1977a, 1977b), which in this example will be amount of anxiety as expressed by a highly anxious client. As shown in Figure 33, during a baseline period of one week, an observer could daily and unobtrusively rate the individual's level of anxiety; the observer rating would continue during the remaining components as the dependent variable. The intervention might then consist of hourly self-monitoring of anxiety by the client for a one-week period; the observer ratings would continue during this period also. The self-monitoring would cease for the second baseline period and then be re-implemented for the final intervention. If the self-monitoring functioned as an effective intervention, we would expect to observe the following sequence of observer anxiety ratings: (a) during the first baseline, stable, high observer ratings of anxiety; (b) during the intervention, a gradual decrease in observed anxiety; (c) during the second baseline, a gradual increase in observed anxiety; and (d) during the second intervention, a decrease in observed anxiety.

To increase our knowledge about the self-monitoring procedure, it would be useful to (a) ask the client to talk about how she or he is gauging anxiety while making the self-reports and (b) explore reasons (during and after the research) for the synchrony or desynchrony between the self- and observer ratings (particularly if the self- and observer ratings were conducted in a blind fashion to each other). Inquiry about how the client assesses anxiety might occur during the self-monitoring period or at the end of an intervention period (if one wished to avoid contaminating the effects of self-monitoring alone). Comparison of reports about self- and observer ratings might reveal similarities and differences in the basis for which those ratings were constructed. For example, it might be found that the observer rated anxiety on the basis of the client's facial expressions, while the client rated on the basis of physical sensations such as sweating or upset stomach. In Figure 33, the observer's ratings of anxiety are always lower than the client's. Such differences might occur, for example, if the client tended not to be very facially expressive and the observer had no way of detecting the physiological conditions of the client.

Research with large numbers of subjects assumes that some of the variance contributed by one individual is uncorrelated with variance contributed by another individual (Martin, 1988). But investigation into such problems as self-report and rater error suggests that individuals may share considerable error sources. Expansion of the investigation from a single subject to include additional persons (and items) is likely to lead to greater understanding of the etiology of systematic errors. Measurement psychologists who believe that useful research necessitates large numbers of subjects should note that Spearman employed ability data he collected from children at a small village school to develop the concepts and procedures of test error, correction for attenuation, g, and correlational methods.

Following small subject studies, experimental designs are likely to be useful. Research with relatively small groups provides an opportunity to experiment with interventions designed to reduce systematic errors. Cronbach (1957) and Campbell and Fiske (1959) both hoped that their work in construct validity would encourage such measurement experimentation. Cronbach (1957), for example, cited work by Fleishman and Hempel (1954) that found that practice in motor skills changed the factor structure of those skills. Cronbach (1957) maintained that Fleishman and Hempel's efforts "force upon us a theory which treats abilities as a product of learning, and a theory of learning in which previously acquired abilities play a major role" (p. 676). For the most part, however, experimental approaches have yet to be integrated into psychological measurement (cf. Tryon, 1991).

Why are items dropped? The focus in traditional test construction is on finding items which meet the consistency-predictive validity model. Items which do not fit this model are dropped from the test. Little or no study is devoted to understanding why items do not meet test criteria. Test developers commonly assume that dropped items are poorly written, misunderstood by respondents, or fail to tap into the desired construct. In fact, developers seldom know why items fail to be retained.

Loevinger (1957) made a strong case for examining response processes in individual items. Although the empirical selection approaches that dominated at the time advocated against it, Loevinger (1957) maintained that "if

theory is fully to profit from test construction as a part of psychology, every item included in a scoring key must be accounted for; a less strong case can be made for explaining the exclusion of items" (p. 657). Given the fact that more items are dropped than retained during item selection (Golden et al., 1990), the typical test development process may provide information about measurement theory as well as the substantive construct(s) under investigation. If the items are not theoretically based, as with empirical selection approaches, rejection of items has little potential import. But for the purpose of deepening measurement theory, it would be very useful to know why items are retained and deleted. If the items and tasks selected for the initial pool are theoretically based, they should have an equal probability of being selected for the test; an item which fails to meet inclusion criterion has the potential to inform theory.

Assume that a set of items measures a particular trait. If a subset of those items fail to show temporal stability, it would be useful to compare dropped and retained items on such methodological factors as item length and subjects' comprehensibility of items. It would also be useful to examine item content for state material and to compare the internal consistency of the temporally stable and unstable items. The unstable items may in fact be internally consistent, as is the case with the State form of the State-Trait Anxiety Inventory (Spielberger et al., 1970).

In the context of developing trait-based tests, Loevinger (1957) proposed two principles to guide this type of item selection. These principles are based on the idea that structuring the initial item pool allows testing hypotheses about the trait to be measured. First, a set of items should be drawn from a pool broader than the intended trait. Second, the items should sample all alternative theories of the trait. This is rarely performed in contemporary item selection because any contrasting of competing hypotheses occurs at the level of aggregate scores, not items. Both principles are similar to Campbell and Fiske's (1959) proposal for discriminant validity, although their idea pertained to scales and not individual items per se. In Loevinger's (1957) procedure, once the theoretical item pool has been created, items are selected for the trait scale on the basis of tests of structural validity. She suggested that if theory proposed a certain relation between two types of non-test behavior, this relation should also be evident between test items measuring the behavior. Loevinger (1957) noted that aggregating items into total scores would obscure such interitem relations. It remains questionable, however, whether psychological test items have the capacity to function as stand-alone observations; individual items administered repeatedly and then aggregated might possess more utility. Loevinger (1957) believed that interitem correlations would not be subject to the problems associated with individual item response.

If a test developer can identify and control the important factors that influence item response, the developer should be able to reproduce the results of any item selection process. For example, if the test developer creates, based on theory, an initial pool of 100 items, subsequent item analysis frequently results in the selection of 30 or so items. If the processes by which subjects respond to those 100 items are well understood, then the developer should be able to replicate the item selection process. This procedure is not followed in contemporary item development: efforts to cross-validate (i.e., check predictive validity estimates in a new sample) often remain at the level of the total score, not individual items. The failure to demonstrate a link between theory, item creation, and item selection, however, weakens the construct explication process and conclusions about construct validity.

Measurement for Intervention

Why conduct measurements and assessments as part of interventions? Given that one can have no knowledge of the success or failure of the intervention without some type of measurement, the answer seems obvious. But many practitioners who provide counseling and psychotherapy, for example, eschew formal tests, substituting their or the client's qualitative judgment for knowing when the process is complete. This is reasonable in the context that it is typically the client's (or a significant other's) judgment that initiated the intervention; such judgment is also inexpensive. Cost is important for all testing purposes, but particularly so in interventions.

The problem with this logic, of course, is that under the circumstances described in Chapters 2 and 3, clinicians and clients may be unable to produce valid judgments about treatment efficacy. Control groups consisting of persons who do not receive a treatment are typically included in research designed to test the efficacy of psychological interventions. Of relevance to this discussion are placebo control groups where individuals are led to believe they are receiving an effective treatment when, in fact, they are not. The placebo control group is designed to assess the effects of increasing individuals' expectancies for improvement on actual behavioral change. Placebos have been found to be as effective as other psychological interventions in such areas as smoking cessation, test anxiety, and speech anxiety (Heppner et al., 1992).

Model building and assessment. Three categories of intervention-related assessment can be described. In the first, practitioners employ a trial-and-error (eclectic) approach to intervention where they experiment with treatments until one works. Assessment is relatively unimportant in this process; to have a successful outcome, one simply needs the client to report feeling (or thinking or behaving) better. The second option, typified by Lazarus' (1973, 1981) multimodal approach, can be characterized as the reverse of the first in that everything is assessed. In his BASIC ID procedure, the intervenor continually assesses Behavior, Affect, Sensation, Imagery, Cognition, Interpersonal relationships, and Drugs/biological influences.

A third alternative, assessment of variables in a causal model of the client (Maloney & Ward, 1976), suggests that intervenors conduct assessments with as little precision (and accompanying cost) as necessary. In an advanced science, an intervention decision could be made on the basis of decision rules involving test scores. In the clinical realm, for example, a certain profile on a psychological test such as the MMPI-2 might automatically result in the assignment of that individual to a fixed treatment (Cronbach & Gleser, 1965). At present, however, test results generally cannot be employed to make such decisions in psychological interventions (Cronbach & Gleser, 1965; Paul, Mariotto, & Redfield, 1986a). Instead, an adaptive treatment is used where test results and other sources of information provide data for the formation of hypotheses about the causes of the problem, which in turn guide the selection of an intervention. To the extent that the intervention is unsuccessful, hypotheses are revised and a modified intervention implemented. The process is recycled until the intervention is concluded in success or failure. This type of model-building for intervention is very similar to that of explanatory theory-building, but on a smaller scale. Tests employed in this manner may be termed diagnostic and formative (Bloom et al., 1971).

An integration of different approaches to this model-building assessment suggests the following steps:

(a) select multiple constructs, based on a causal model. Multiple operations and methods should also be included when possible;

(b) pre-plan observations. While this step is assumed in theoretical research and in selection, it is often not included as part of intuitive clinical assessment;

(c) attempt to disconfirm and modify indicators of constructs and hypotheses of the model. Initial indicators of a construct may not be the only or best operations. Similarly, initial hypotheses are likely to be revised with additional data;

and (d) apply such measurement rules as immediate recording of data and standardization of stimuli.

In the clinical realm, Maloney and Ward (1976; also see Goldman, 1971) discussed these procedures in terms of conceptual validity, that is, evaluating the usefulness of constructs relating to a client's functioning. Psychologists use assessments to test hypotheses and constructs about individuals instead of about tests or theories (Groth-Marnat, 1990). Ongoing assessment enables the intervenor to test, disconfirm, and modify hypotheses about the causes of client distress. For example, a clinician might work with a pre-med college student who presents with severe test anxiety. The clinician administers an intelligence test and the MMPI to the student and finds evidence of average verbal ability and high depression. The working hypothesis, then, might be that the combination of depression and average ability is impairing the student's ability to adequately prepare for and perform on difficult tests. Given this hypothesis, it would make sense to attempt to alleviate the depression and determine if anxiety subsequently decreased and test performance improved. If anxiety remained high, however, the clinician should begin to explore and test alternative hypotheses about the causes of the student's anxiety. In the absence of general scientific laws, such as dependable ATIs, valid assessment of factors in a causal model is necessary to adjust treatment (Cronbach & Gleser, 1965; Licht, Paul & Power, 1986).

With more difficult problems in individuals, groups, and organizations, intervenors must develop more complex causal models that require precise--and more expensive--measurement. The need for more precise measurement and assessment is likely to depend upon two factors:

(a) the knowledge base of the science. If a science is new, then relatively little knowledge will be available to guide interventions. Thus, intervenors will frequently need to develop causal models and measure variables in the model;

and (b) the knowledge base of the individual intervenor or intervention team. The more knowledge held by the intervenor, the more automatic will be the application of knowledge to the intervention. Novices particularly require the skills of model-building and intervention-related assessment.

Behavioral assessment. Behavioral assessors believe that idiographic measures are more sensitive to behavior change in individuals (e.g., Cone, 1988; Hartman, 1984). Because no individual is likely to display all or just the right combination (i.e., a prototype) of the indicators of a construct, idiographic measures should better indicate change than nomothetic devices. In other words, unless a nomothetic measure has sampled the universe of a construct's indicators, it is unlikely to function with a specific individual as well as an idiographic measure. Bellack and Hersen (1988) suggested that idiographic approaches can aid in determining treatment targets and environmental contingencies. For example, three clients might define anxiety as the amount of hair-pulling, sweating, and self-reported tension, respectively. A therapist might observe or teach a client to self-monitor level of anxiety across different work tasks, social situations, or therapeutic interventions. Behavior would be assessed before treatment (to provide a baseline), during treatment (to guide treatment), and at the conclusion of treatment (to document outcome). To avoid error, assessors would be careful not to change the measurement procedure from one occasion to another (Martin, 1988).

But how are "items" selected in behavioral assessment? Interestingly, research indicates that selection of target behaviors varies by assessor (Evans & Wilson, 1983; Hartman, 1984). Hartman (1984) proposed that behavioral assessors disagree in their selection of target problems because of different beliefs about what is socially important, the relative desirability of alternative responses, different ideas about deviant behavior, and varying experience with the consequences of problem behaviors. Given these assumptions, Hartman (1984) suggested the following criteria for evaluating potential behavioral targets:

(a) the behavior is important to the client or significant others;

(b) the behavior is dangerous to the client or others;

(d) the behavior interferes with the client's functioning;

and (e) the behavior clearly departs from normal functioning.

Precision. Cook and Campbell (1979) observed that measurement of precision can also be discussed in relation to the independent variables of intervention studies. Levels of the intervention can be specified imprecisely, such as when an investigator tests only two levels of an intervention, instead of potential levels 1 through 5. This underspecification may lead to statistically insignificant results and perhaps an improper generalization about the lack of efficacy of the named intervention (Cook & Campbell, 1979). For example, six weeks of job enhancement may demonstrate negligible effects when compared with a placebo control. Significant effects might occur, however, when the intervention proceeds for 52 weeks.

Item and task selection. Popham (1993) maintained that educators prefer criterion-referenced over norm-referenced tests because of the former's superiority for evaluating interventions. The advantage of criterion-referenced tests partially results from differences in test construction. Norm-referenced tests are constructed to maximize variability among individuals (Messick, 1983; Swezey, 1981); such dispersion increases the efficiency of selection decisions. However, items which measure infrequent behaviors are not likely to be included in norm-referenced tests. Jackson (1970), for example, suggested that items endorsed by less than 20% of the test development sample be dropped because they will not contribute to total score variability. However, those dropped items may be the very ones of interest to change agents. Criterion-referenced tests, in contrast, are composed of items based on criteria for which the intervention is targeted, regardless of the frequency of endorsement.

Distinctions similar to norm- and criterion-referenced testing are apparent in formative and summative testing (Bloom et al., 1971). Summative tests provide an overall evaluation of an individual's performance in an intervention (e.g., a course grade); summative tests provide data convenient for administrative decision-making. Formative tests, in contrast, provide feedback about an individual's performance on components of the intervention so that the intervention or the individuals' place in it can be appropriately adjusted. Test data can be employed for formative and summative interpretations, but differences do exist in their respective test construction strategies. In particular, formative tests require some type of task analysis which separates performance into specific subtasks. Items or tasks tapping those components are then included in the formative test. Bloom et al. (1971) maintained that summative evaluations, often constrained by limited testing time, can sample only portions of the relevant content. In contrast, formative tests must sample all of the relevant content to be useful to the intervenor.

Change-based measurement. In research and practice which is not trait-based (e.g., intervention and longitudinal), it is unreasonable to employ measurements and assessment tasks which deleted state-type items during the test construction process. Collins (1991) noted that in the context of longitudinal research, it is possible to distinguish between static (trait) variables and dynamic (state) variables. Practitioners and researchers attempting to measure change examine intraindividual differences over time. However, Collins (1991) observed:

Little in traditional measurement theory is of any help to those who desire an instrument that is sensitive to intraindividual differences. In fact, applying traditional methods to the development of a measure of a dynamic latent variable amounts to applying a set of largely irrelevant criteria. (p. 138-139) As noted, traditional approaches to scale development retain items that display variability between persons and drop those that do not. For example, Collins and Cliff (1990) observed that no grade school children may be able to perform certain division tasks at the beginning of the school year, but all will do so at the end of the year. Those items would show no variation at either testing point and might be dropped during development of a traditional test of mathematics skill. Collins (1991) suggested that traditional definitions of reliability and precision, which emphasize variance between individuals, may be misleading indicators of the usefulness of measures assessing change (and the lack thereof) over time.

Because many possible types of change patterns are possible, the theoretical task involves specifying the expected pattern as closely as possible. School children, for example, may be expected to demonstrate certain patterns of change in their acquisition of mathematical skills. Children may first learn addition, then subtraction, multiplication, and division, in that order. Such a sequence can be characterized as cumulative (i.e., abilities are retained even as new abilities are gained), unitary (i.e., all individuals learn in the same sequence), and irreversible (i.e., development is always in one direction) (Collins, 1991). This theory, in turn, stipulates the form of measurement necessary to test the theory. As displayed in Table 13, Collins (1991) described a Longitudinal Guttman Simplex (LGS) model in which persons, items, and times can be ordered relative to one another. That is, not only can persons and items be combined, but the matrix can be expanded to include times of measurement. Such a model can be employed to test for developmental sequences. Again, the important consideration is that items are selected on the basis of their ability to reflect change, not their ability to show differences between individuals.

----------------------------------------
Insert Table 13 About Here
----------------------------------------

Criteria for selection of intervention items. Given the above concepts, it seems possible to summarize and propose a set of criteria for the selection of items and tasks suitable for a test of an intervention. First, such items must show change resulting from the presence of an intervention; for example, items should demonstrate expected changes from pre-test to post-test. Second, changes in scores from pre-intervention to post-intervention demand that such alterations not be attributable to measurement error (Tryon, 1991); thus, pre-test and post-test measures must show stability independent of treatment effects. Third, such items should not change when respondents are exposed to placebos or other types of control conditions; that is, item change should not occur solely as a result of expectations about treatment efficacy. Appropriate item change might occur, however, as a result of an intervention that alters individuals' expectations about (a) their personal competence for performing targeted behaviors, or (b) the skills and knowledge actually required to produce desired outcomes (cf. Bandura, 1977).

To gauge the reliability of instruments employed repeatedly, Tryon (1991) proposed the use of the coefficient of variation (CV):

CV = SD/M x 100

where SD is the standard deviation and M is the mean. Tryon (1991) suggested that stability will be demonstrated when repeated measurements show small standard deviations. CV represents the degree of error and is functionally equivalent to 1 - r2 (reliability squared). Thus, a CV of .05 indicates the measure has 5% error. A psychological test with an r of .80--approximately the median reliability value found for the data reported by Meier and Davis (1990) in Chapter 1--indicates that the scale has 36% error.

Administering Tests

Test administration refers to the procedures involved in the preparation and completion of a psychological test. Given that test conditions affect test performance (Murphy & Davidshofer, 1988), the central concern of test administration has been standardization, the establishment of identical or similar test procedures for each respondent. Dawis (1992) credited Binet with introduction of standardization, the major purpose being that test results be unaffected "neither by the bad humor nor the bad digestion of the examiner" (Binet, as cited in DuBois, 1970, p. 33).

Selection

Traits have been the primary focus of measurement psychologists interested in selection. For administrative decisions, psychologists have often assumed that no change in the phenomena is possible (e.g., because the trait is biologically based) or necessary (e.g., because the resources to intervene are lacking). Given the typical constraints on time and cost in administrative situations, selection tests typically must meet several standards. First, they are designed to be administered to large groups at a single administration. Thus, such tests measure stable, nomothetic constructs that should not change over time and that are possessed by all respondents. Instructions for such tests necessarily are brief, as are the response formats. True-false or multiple-choice tests are much more likely to be employed, for example, than performance tasks. As noted previously, some tests were initially developed as substitutes for time-consuming interviews.

Theory-Building

Traditional, standardized psychological tests are externally structured and internally unstructured. Efforts to standardize tests emphasize the arrangement of the environment external to the test-taker. General instructions are frequently provided, including a brief description of the test's purpose, a sample item and an explanation of response format alternatives. Many group-administered tests, however, have no mechanism to determine if test-takers read the instructions, understand the items, or employ the same strategies for answering items throughout the test. Traditional standardization does not guarantee uniformity of testing experience, as Gould's (1981) account of confused individuals completing the original Army Beta indicated. Clearly, more structure could be added to psychological tests and assessments, particularly regarding the test-taker's internal state. It seems possible that increasing the structure of self-reports could improve these instruments in much the same way additional structure has improved the reliability and validity of interviews.

Item response strategies. When retrieving knowledge, MacGregor, Lichtenstein, and Slovic (1988) proposed that individuals use one of two methods: intuitive or analytic. Most people employ an intuitive strategy which MacGregor et al. describe as holistic, inexpensive, portable, approximately correct--and empirically related to systematic biases. In contrast, analytic strategies produce more precisely correct judgments, but with a few large errors (Peters, Hammond & Summers, 1974). As an example of an analytic strategy, MacGregor et al. describe algorithms, a series of steps that produce a solution to a task. Algorithms provide an unambiguous approach to problem-solving and should lead to similar solutions to problems even when applied by different individuals. MacGregor et al.'s experimental comparison of analytic and intuitive strategies, employing college students who completed estimation problems, found the analytic groups to be more accurate, more consistent, and more confident of their estimates. The question remains: could structured knowledge retrieval aides improve self-reports?

Research exploring the strategies employed by test-takers to answer items is relatively recent. Recall the work of Jobe et al. (1990) described in Chapter 6. They examined the effects of prescribing the order of recall--forward (chronological), backward, and free recall--of health visits on the accuracy of that recall. Free recall subjects correctly remembered 67% of their visits, compared to 47% for forward recall and 42% for backward recall. It is important to note that the structure imposed on respondents in this study impaired their performance. Adding structure to tests may be counterproductive if it interferes with respondents' natural processes; on the other hand, identifying and enhancing such styles may potentially improve test performance. In a similar vein, Osberg & Shrauger (1986) reported the results of research that investigated the types of strategies subjects employed in self-predictions. They found that individuals' predictions were often based upon knowledge of the past frequencies of personal behavior, current and expected conditions, knowledge of personal qualities, intent to perform the behavior, and the frequency of the behavior in the general population. Osberg and Shrauger (1986) also found that predictions based upon subjects' knowledge of their past frequencies of behavior and subjects' personal qualities were the most accurate.

The effects of non-test events and interventions. Evidence of the effects of non-test events on test behavior can be found in the literature on instrumentation (Cook & Campbell, 1979), response-shift bias (Howard, 1982), and alpha-beta-gamma change (Golembiewski, Billingsley, & Yeager, 1976). Instrumentation refers to changes in scores from pre-test to post-test that occur because of changes in the measuring instrument, not as a result of an intervention (Cook & Campbell, 1979). Response shift bias occurs when an intervention changes respondents' awareness or understanding of the measured construct (Howard et al., 1979). In response-shift research, respondents first complete an intervention. Respondents then answer a post-test as well as a retrospective pre-test where they rate items in reference to how they perceived those items before the intervention (Howard, 1982). As predicted by the response shift effect, different results are apparent between scores of persons who complete the usual pre-test and post-test and persons who complete a post-test and a retrospective pre-test (Howard, 1982). In addition, some evidence suggests that retrospective pre-tests are free of response styles (Howard, Millham, Slaten & O'Donnell, 1981; Sprangers & Hoogstraten, 1987.

The pre-post changes demonstrated on measuring instruments, Golembiewski et al. (1976) proposed, can result from: (a) alpha change, in which altered scores validly correspond to changes produced by an intervention; (b) beta change, in which respondents alter the intervals of the scale; and (c) gamma change, a shift in the entire meaning or conceptualization of the instrument, perhaps as a result of seeing scale content in a new light. Golembiewski (1989) noted that very little attention has been paid to these ideas when interpreting the results of intervention research in organizational development. While alpha-beta-gamma changes have been recognized as potentially important sources of error, researchers have been limited by a lack of methods appropriate for investigating and demonstrating such changes (Millsap & Hartog, 1988).

Similarly, research indicates that psychotherapy reduces desynchrony in individuals. Such an effect occurs in psychotherapy studies which examine correlations among pre- and post-treatment measurements of client affect, cognition, and behavior. Typically such correlations are low at pretest, but rise after therapy. Correlations should also be greater among members of a treatment group than in a control group. For example, Moore and Haverkamp (1989) reported a study designed to increase the emotional expressiveness of a group of men. Twenty-eight subjects were randomly assigned to treatment or control group in a posttest only design. Subjects completed (a) self-report scales that assessed their perceptions of how often they experience emotion and how often they express emotion and (b) two behavioral tests which required subjects to produce written and oral responses to affect-laden situations presented through videotape and written materials. Moore and Haverkamp found that their treatment affected expressiveness as indicated by one of the two behavioral measures, but not on the two self-report measures. They then wondered whether the intervention "altered the experimental group members' perception of their level of verbal expressiveness" (p. 515). Substantially higher correlations were present in the experimental group between self-report and behavioral measures on three of four comparisons, lending indirect support to the proposition that the intervention increased awareness of verbal expression to the point where it matched actual expression. It may be possible that by teaching test-takers a basic emotional vocabulary, validity of report may be increased.

Active measurement. Traditional testing procedures treat test-takers as passive respondents whose primary task is to validly respond to standardized stimuli. Test-takers can also be viewed, however, as active participants in a dynamic process where test stimuli and tasks can be altered.

Teaching test-takers to respond more appropriately to psychological tests is an old, but relatively unexplored idea. In the laboratories of the early experimental psychologists, subjects were not naive observers, but members of the research team or other observers able to observe psychological phenomena in a methodical manner (Danziger, 1990). There were good reasons for this practice. Danziger (1990) reviewed research in psychophysics documenting the difficulty individuals experience when quantifying perception. Danziger (1990) cited Boring (1942, p. 482) who wrote that "the meaning of the judgment two is indeterminate unless the criterion has been established." In 1946, Cronbach suggested training for subjects to overcome response sets. Acquiescent students, for example, could increase their test-wiseness by learning how many false-marked-true errors they make. Cronbach (1946) similarly believed that "it is relatively easy to teach mature students what is desired in essay examination responses" (p. 489). In contemporary psychophysiological measurement, respondents first complete an adaptation or training period to allow stabilization of physiological variables (Sallis & Lichstein, 1979). Behavior therapists who teach their clients to self-monitor may instruct clients to keep a small notebook with them at all times, record incidents immediately after they occur, and record only a single response at a time (Hayes & Nelson, 1986). In all of these cases, test-takers receive training or employ procedures that increase their ability to perform the required measurement tasks. Assessors have also offered other proposals and condition for improving test performance (e.g., Babor, Brown, & Del Boca, 1990; Fazio & Zanna, 1978; Klimoski & Brickner, 1987; Laing, 1988; Osberg, 1989; Regan & Fazio, 1977).

To the extent that a research program successfully identifies systematic invalidities in item responding, it may also be possible to manipulate the testing situation to minimize unwanted factors. For example, aggregation of item responses across an empirically determined number of occasions or period of time may eliminate state effects and increase trait variance of a construct (Cone, 1991). With the problem of social desirability, some research has provided evidence of the efficacy of a technique called the bogus pipeline effect (Jones & Sigall, 1971). Here respondents are led to believe that the test administrator possesses an objective method of ascertaining their true attitudes or beliefs; the result is often a decrease in socially desirable response tendencies (Botvin, Botvin, Renick, Filazzola, & Allegante, 1984; Brackwede, 1980; Mummendey & Bolton, 1981; Sprangers & Hoogstraten, 1987; Sprangers & Hoogstraten, 1988). Mummendey and Bolton (1981) compared subjects exposed and not exposed to bogus pipeline instructions and found that the former significantly reduced the frequency and intensity of their endorsement of socially desirable items. Sprangers and Hoogstraten (1987) found that a bogus pipeline procedure eliminated response shift bias in a pre-post intervention design, but failed to replicate these results in a second study (Sprangers & Hoogstraten, 1988).

Studying temporal inconsistency: Psychological states. As described in Chapter 4, one solution to the observation that behavior varies is to contrast psychological traits with psychological states. States vary, traits do not. Success in the measurement of transient psychological states has yet to reach the level associated with trait-based tests, as Murphy and Davidshofer (1988) noted:

It is easier to make reliable inferences about stable characteristics of individuals than about characteristics which vary unpredictably. For example, it would be easier to develop a reliable measure of a person's basic values than of a person's mood state. (p. 84) The difficulty in measuring mood and emotions, however, may have less to do with any inherent unpredictability than with the trait-based procedures that have typically been employed. The State-Trait Anxiety Inventory (STAI; Spielberger et al., 1970), for example, contains two 20-item scales to measure state and trait anxiety that differ mainly by instructions: the state measure asks respondents to answer according to how they feel "at this moment," while the trait measure requests responses that reflect how test-takers "generally" feel. The STAI, however, begs the question in an important respect. By imbedding the measurement in a questionnaire administered at one point in time, we still little information about how or why the psychological characteristic varies. Heidbreder saw this in 1933: Even in those systems of psychology which reduce their material to elements, the elements, whether bits of consciousness or bits of behavior, are defined as processes. It is true that, having been defined as processes, they are often, in actual practice, treated as fixed units, for the habit of thinking in terms of fixed units is tenacious. But when attention is turned upon psychological material directly, the character of change presents itself as an inescapable fact. (p. 24) Repeated measurements, then, may be the most effective method of studying psychological states.

Particularly in the measurement of emotions, motivation, and related psychological states, it may also make sense to measure individuals when they are in those states. Although research results have been equivocal, studies in the areas of mood-state-dependency and mood-congruent learning suggest improved recall when mood at testing matches mood at the time of learning (Bower, 1981; Bower, Monteiro, & Gilligan, 1978; Ellis & Ashbrook, 1989; Mecklenbrauker & Hager, 1984). For example, one of my psychotherapy clients was best able to recall and describe past incidents of obsessive behavior during periods within the session when he was feeling anxious. Similarly, Steinweg's (1993) review found research demonstrating that depressed inpatients (a) better recalled negatively valenced over positively valenced memories (Clark & Teasdale, 1982) and (b) recalled negative information more quickly (Lloyd & Lishman, 1975). Research with the Velten mood induction (Velten, 1968) indicated that mood states such as depression or elation can be induced in individuals. Isen, Clark, Shalker, and Karp (1978), for example, utilized success and failure on a computer game to induce elation and depression, respectively; they found that subjects in the success condition recalled more positive than negative words. Finally, Rorschach content and the behavior of the test administrator are designed to maximize the ambiguity experienced by test-takers so that they reveal how they organize stimuli (Groth-Marnat, 1990).

Intervention

One of the more interesting aspects of test administration to investigate is the extent to which tests function both as measurements and interventions. That is, measurements and assessments can alter the amount of the construct they are intended to measure. As Webb et al. (1981) observed, "interviews and questionnaires...create as well as measure attitudes" and other constructs (p. 1). Bailey and Bhagat (1987) reviewed research indicating that the act of completing questionnaires and interviews can create or alter the level of held beliefs and attitudes. For example, Bridge et al. (1977) randomly assigned individuals into distinct groups who were questioned regarding their opinions about cancer and crime, respectively. Respondents repeated the survey several weeks later. While the initial survey did not find differences on attitudes toward cancer and crime between the two groups, Bridge et al. (1977) found that individuals initially questioned about cancer, compared to individuals questioned about crime, increased their assessment of the importance of good health on retest. Bailey and Bhagat (1987) also cited a study by Kraut and McConahay (1973) which found that prospective voters, randomly sampled for a pre-election interview, showed a significantly higher turnout (48%) than the non-interviewed population (21%).

In behavior therapy, clients are frequently assigned to self-monitor problematic behaviors in preparation for an intervention. Interestingly, the simple act of recording such behaviors often leads to a decrease in their frequency (Nelson, 1977a, 1977b). Research indicates that completing simulations can provide measurement data and change test-takers' level of the construct (Fulton et al., 1983; Johnson et al., 1987; Meier, 1988; Meier & Wick, 1991). Meier (1988) described research with If You Drink, a computer-assisted instruction (CAI) alcohol education program. This CAI program consists of modules designed to teach high school and college students facts about alcohol, the effects of alcohol consumption on blood alcohol levels, the effects of combining alcohol with other drugs, and responsible decision-making about alcohol. Two of these modules incorporate computer simulations that allow students to input data about alcohol consumption (e.g., number of drinks consumed at a hypothetical party) and receive feedback about subsequent consequences (e.g., blood alcohol level). Meier (1988) found that this CAI program, compared to a placebo control group, significantly improved college students' attitudes toward alcohol. Meier and Wick (1991) employed the same program with 7th and 9th grade students to investigate the program's viability as an unobtrusive measure of alcohol consumption. As reported in Chapter 6, Meier and Wick found that unobtrusively recorded reports of subjects' alcohol consumption in a simulation was significantly correlated with self-reports of recent drinking behavior, drinking intentions, and attitudes toward alcohol.

Scoring Tests

The purpose of most measurement and assessment procedures is to produce quantitative data that reflect the order and distinctions inherent in the measured phenomena. Most tests have explicit rules that allow objective scoring, that is, identical results are found whenever the rules are followed. While formats such as multiple-choice allow such rules, essays tests and projective tests such as the Rorschach involve considerable judgment about scoring (Murphy & Davidshofer, 1988).

Selection

As noted, summing or averaging item responses is the most common method of scoring tests. If the test purpose is to measure traits, aggregating as many items and occasions of measurement as is feasible should increase the probability of detecting traits. From a practical standpoint, the primary problem with aggregation would seem to be cost. In Epstein's (1979) research, a large number of measurements were required to reach high validity coefficients; in one study, for example, 14 days of measurements completed daily were required to reach .80. Martin (1988) also noted that to increase the correlation between predictor and criterion, aggregate measurements of criterion and predictor are necessary to reduce unreliability in both measures.

While selection testers do aggregate items and individuals, they tend not to aggregate measurements over time. Given any test's ability to measure traits, aggregation of measures over time and occasions would seem a useful step in minimizing random error. In general, it would seem advisable for test administrators to administer tests and criteria more than once and to employ the average of those administrations as trait indicators.

Because the purpose of selection testing is to compare an individual with others for the purpose of making a decision, raw scores are transformed, often into percentile ranks and stanines (Gronlund, 1988). The most common transformation, z scores, are standard scores, that is, scores which reflect the position of an individual in relation to all individuals who took the test. Z scores allow such comparisons across different sets of test scores (even those which have different means and distributions). The following formula computes z:

z = (X1 - M) / SD

where X1 is an individual's score on a test, M is the mean of all individuals' scores, and SD is the standard deviation, a statistic indicating the degree of variability or dispersion among data. A z of 1, for example, indicates that an individual scored 1 standard deviation above the mean. A more familiar type of standard score is the IQ:

IQ = (MA x 100) / CA

where MA is mental age as determined by type of tasks successfully completed by the test-taker and CA is the individual's chronological age. Here mental age is scaled against chronological age, a procedure with an underlying assumption that CA reflects consistent cognitive developmental sequences for individuals, for example, across gender and culture.

The problem with z scores is that the units change depending upon the individuals who complete the test (Tryon, 1991). Thus, changing the test sample is very likely to change the mean and standard deviation and thus the Z-score units. Tryon (1991) observed that with Z scores, persons are measured against other persons instead of other criteria or theoretically meaningful constructs. Thus, "Z-score units are borne of desperation due to the absence of theoretically meaningful units of measure" (Tryon, 1991, p. 6).

Theory-Building

Approaches such as disaggregation, experimentation, and Generalizability Theory (GT) would seem the most appropriate procedures for deepening our understanding of how tests could be scored. Aggregation can be employed to study consistency across situations (Cone, 1991), but it can also mask the influence of situations (Schwenkmezger, 1984). Thus, investigators interested in situational effects could aggregate items, persons, and occasions, but disaggregate situations (i.e., examine variability over situations). At this point in psychology's history, measurement theory is likely to be deepened by studies which disaggregate combinations of items, persons, occasions, and situations.

----------------------------------------
Insert Figure 34 About Here
----------------------------------------

To the extent that psychologists adapt laboratory-type tasks to function as computerized measurement, it may be possible to supplement or substitute for traditional scoring procedures through replication and experimentation. With replication, the same task is repeatedly presented to determine consistency of results. As shown in Figure 34, a simulation could be presented multiple times, with aggregated scores computed for stable, trait constructs and stability of scores compared across administration to provide reliability estimates. Multiple administrations also provide an opportunity to study variables and constructs expected to show change, such as practice and fatigue effects.

In contrast to classical test theory, GT recognizes that individuals have more than one true score on a construct (Cronbach, 1984). That is, various factors influence individuals' test scores or provide a source of variance. Identifying these factors depends upon specifying the conditions of testing we wish to generalize to. Cronbach (1984) provided an example of ratings of friendliness for a preschool child. An observer watches the child for 5 minutes each in the sandbox, on playground equipment, and drinking juice. Potential sources of measurement variation in this example include the observer, the situation, and the occasion (i.e., different instances within the situations, over time). Studies of these different sources could provide an indication of how much each source influences test scores. In general, GT procedures might be useful to investigate variation contributed by (a) dynamic or state constructs, (b) different measurement methods and modes, (c) different types of test-taker characteristics, (d) different types of test structure and test-taker training, and (e) situational differences.

Intervention

Swezey (1981) reported two types of scoring systems for criterion-referenced tests. Noninterference scoring occurs when a test-taker completes an item or task without input from the test administrator. With assist scoring, the administrator corrects the test-taker when an error occurs, and the test-taker then completes the remainder of the task. Swezey suggested that assist scoring is appropriate in diagnostic situations where the test administrator wishes to discover which components of the task require further intervention for a particular test-taker.

As an example of the more precise data that can be provided by formative tests, Bloom et al. (1971) described procedures developed by Smith and Tyler (1942). Item response categories can be designed to permit an analysis of errors. For example, students can respond to a test item by marking true, probably true, insufficient data, probably false, and false. These response categories allow identification of the following errors:

(a) general accuracy, that is, how well students' responses match the correctly keyed responses;

(b) caution, the extent to which students underestimate the data available to answer the item;

provide answers with greater certainty beyond that inherent in the item;

and (d) crude errors, the number of true or probably true responses to items keyed false or probably false, and vice versa.

Teachers who conduct an analysis of such student errors should be able to specify more precisely the type of instruction that should occur.

Interpreting Tests

How test data are interpreted partially depends upon the purpose of the test. Gronlund (1988) observed that "strictly speaking, the terms norm-referenced and criterion-referenced refer only to the method of interpreting test results" (p. 11). Suppose an individual received a score of 95% on a classroom test. A norm-referenced interpretation would restate that score as "higher than 94% of the rest of the class"; a criterion-referenced statement would be "correctly completed 95 of 100 questions." Criterion-referenced interpretations simply describe performance in relation to a standard other than other persons.

Selection

Norm-referenced tests allow comparison of an individual's test score with all others who have taken the test. When interpreting tests, this is the usual approach to making sense of scores. For example, students who score 600 and above on the Verbal and Quantitative components of the Graduate Record Examination might be retained among a pool of students who receive further consideration for graduate school admission.

The purpose of testing also affects who receives and acts upon the test interpretation. Hough (1962) considered the hypothetical case of a personality test, employed to match couples for marriage, which could reduce the divorce rate by 10 per cent. Despite this benefit, Hough suggested few individuals would place the choice of spouse in the hands of such a test. If social administrators decided to decrease the divorce rate, however, I venture they would use such a test. Paul et al. (1986a) indicated that assessment in mental health settings can provide information for the clinician, facility director, government facility monitor, researcher, client, and family members. All of these individuals may be interested in different components and levels of test data. Clients, for example, may be concerned about whether the assessment data indicates they can be discharged from treatment, while facility directors may focus on the relative efficacy of different treatment units in reaching client discharge by target dates or periods. Tests' utility may partially be evaluated on the basis of the number of relevant interpretations provided to decision-makers (Cronbach & Gleser, 1965).

Theory-Building

Construct explication. From the perspective of theory-building, one of the most important interpretations of test data concerns the extent to which the test adequately measures a construct. The key questions are: What constructs does the test measure? How well does the test measure the construct(s) of interest?

While selection measurement is concerned with predicting outcomes such as performance on a criterion, measurement for explanatory theory-building is focused on process, the system of operations that produces the outcome. Understanding process means that you can create a model of that phenomenon and connect the variables in that model to the real world. Torgerson (1958) believed that "the development of a theoretical science...would seem to be virtually impossible unless its variables can be measured adequately" (p. 2). The initial step in developing adequate measurement is explication of constructs in terms of measurable operations, and it is this task that is most crucial and difficult in new sciences such as psychology (Torgerson, 1958). Until constructs are precisely explicated, scientists in any new discipline will devote "an immense amount of time...to construction of complex and elaborate theoretical superstructures based on unexplicated, inexact constructs" (Torgerson, 1958, p. 8). The scientific literatures will be broad and unconnected.

Cronbach and Meehl (1955) maintained that "unless the network makes contact with observations, and exhibits explicit, public steps of inference, construct validation cannot be claimed" (p. 291). Similarly, Cone (1991) stated that "When scores on an instrument are totally controlled by objective features of the phenomenon being measured, we say the instrument is accurate" (p. viii). As shown in Figure 28, test validity--and the corresponding usefulness of that test data for such purposes as theory development, intervention, and selection--significantly depends upon the linkages between the phenomenon being studied, the data produced by the test process, and any transformations of that data. In many research areas, psychologists do not possess a very detailed understanding of the processes of responding to psychological test items and tasks. In other words, we do not know with much certainty whether those tests are good explications of the constructs they are intended to measure. And without knowledge of a test's construct validity, it becomes difficult to evaluate and modify the construct in question (Mischel, 1968).

----------------------------------------
Insert Figure 28 About Here
----------------------------------------

Psychological researchers often assume that their constructs can be measured through any procedure, with the default method being self-report. Dawis (1987) is among those, however, who have suggested that the method of measurement should be an integral part of construct definition and explication. If the meaning of a construct is affected by the form of its evidence (Kagan, 1988), then theorists must realize that in practice, there is no such thing as a construct or a method, but only construct-methods.

Precision. The construct explication process frequently focuses on whether or not operations accurately reflect the named construct. The methods described by Campbell and Fiske (1959) and Cook and Campbell (1979) attempt to evaluate the naming of a construct through two criteria: (a) convergence, whether different measures of the same construct correlate, even when measurement methods are different, and (b) divergence, whether measures of different constructs fail to correlate even when measurement methods are similar. Here we are searching for the method bias commonly found to affect psychological measurement. Studies of convergence and divergence produce a nomological net of relations which assist the researcher in naming the construct. Finding the predicted relations does not finish the task, however, as another construct may also explain those relations (Cook & Campbell, 1979). In this case, an additional study would be necessary to construct competing explications of the two constructs and then compare them (Platt, 1977). Regardless of the procedure, the first task of construct validity is to name the construct by referencing it to other related and distant constructs (Torgerson, 1958).

A second and underappreciated reason constructs may fail to be represented adequately by measurements is imprecision. The goal of measurement theory and practice should be to generate devices that produce data that mirror the distinctions that occur in nature. In contrast, many test developers validate tests by simply contrasting groups who possess more or less of the construct in question. For example, developers of a measure of depression may contrast mean scores of depressed patients with normal controls. While such differences certainly support validity of the test as a selection device, they constitute weak evidence for other purposes. Theoretically, we would expect to find more than simple mean differences between depressed and non-depressed individuals. High scores on a depression inventory would also provide little information for deciding between types of treatment.

Cook and Campbell (1979) maintained that an inadequate preoperational explication of constructs is a major threat to the construct validity of studies and measures. Cook and Campbell (1979) proposed that "a precise explication of constructs is vital for high construct validity since it permits tailoring the manipulations and measures to whichever definitions emerge from the explication" (p. 65). In a new science, however, constructs tend to be operationalized inadequately; thus, the initial goal of most research programs should be to create more precise measurement devices. This is the essence of the process of theory leading to data and the data enabling the theorist to refine theoretical constructs.

Cook and Campbell (1979) presented such an example in the work of Feldman (1968). Feldman employed five measures--giving street directions, mailing a lost letter, returning money, giving correct change, and charging the correct taxi fare--to determine whether foreigners or compatriots would be more likely to receive cooperation. Feldman found that two of the measures, giving directions and mailing the letter, related to the experimental manipulation differently than the other three measures of cooperation. Feldman interpreted these results to mean that cooperation should be differentiated into low-cost favors and foregoing financial advantage. Work such as Feldman's, however, tends to be the exception. The construct-data-construct revision process can be short-circuited when researchers (and journal reviewers) view data which disconfirms the study's operations as a failure rather than an opportunity for revision and replication. Although Cook and Campbell (1979) noted that often "the data force one to be more specific in one's labelling than originally planned" (p. 69), they also believed that since Feldman's "respecification of the constructs came after the data were received we can place less confidence in them than might otherwise have been warranted" (p. 69). Because such results may or may not be due to chance, the next step would be a replication of the more precise construction of cooperation. This is the process of making error into substance (McGuire, 1969).

During the item development process, test developers often appear to assume that they have access to the entire domain or at least a representative sample of the construct's operations and relations with other variables. Golden, Sawicki, and Franzen (1990) stated that:

A thorough understanding of what the test is expected to measure will guide both initial validation research and later clinical interpretation of individual results. Such theoretical understanding of what a given test is expected to produce also guides the development of an initial item pool. (p. 22). The explication of a construct, however, begins with only a partial knowledge of the construct. This is the major reason for employing error in the measurement models which dominate psychological testing. To deepen knowledge, test developers must explicate constructs, test them empirically, revise the constructs, and recycle the process. Until the process reaches a sufficient degree of precision--as evidenced, for example, by replication of the item pool--researchers employing the measurement device who encounter negative results will not know whether those results should be attributed to incorrect theory or invalid measurement (Torgerson, 1958).

Feldman's (1968) method is one of disaggregating constructs to increase validity: that is, two types of cooperation fit the data better than one. Cook and Campbell (1979) called this construct underrepresentation. But there also exists the problem of surplus construct irrelevancies (Cook & Campbell, 1979, which refer to operations containing factors unconnected to the construct in question. Underrepresentation and irrelevances are other ways of stating that the two primary concerns of construct validity are appropriate naming and sufficient precision. We must be able to say what a construct is and is not, and we must be able to match the natural levels of the construct with data produced by measurement devices.

As an example of a construct being underrepresented in measurement, Cook and Campbell (1979) noted that attitudes are typically defined as consistent responses to an object across cognitive, affective, and behavioral domains or across time. They observed that most measurements of attitudes are one-shot measures that do not meet these requirements. As an example of construct irrelevancies, Cook and Campbell (1979) discussed hypothesis-guessing by subjects in studies. If subjects guess how the experimenter wants or expects them to behave, subsequent scores may not solely represent the construct of interest. In addition, measurements can both underrepresent constructs and contain irrelevancies. Cook and Campbell (1979) suggested that this occurs with single measures of constructs: aggregating more than one measure of a construct should decrease error due to invalid factors and increase the possibility that more aspects of the construct are tapped. Similar problems are created when only one method (e.g., self-report) is employed to measure a construct; the single method may contain factors irrelevant to the construct.

Multiple methods, like multiple operations, should decrease error in the construct explication process. But the process is not always straightforward. While multiple operations and methods may strengthen validity, it is also true that changing one's method and operation of measurement may alter what is measured. Examples include (a) Arisohn, Bruch and Heimberg's (1988) findings that the magnitude of self-efficacy ratings for assertive behavior was influenced by method of situation presentation and response generation; (b) Cacioppo and Tassinary's (1990) explanation of how psychophysiological effects can change radically according to different measurement procedures; and (c) Watson's (1988) report that relatively small changes in affective descriptors and response format led to significant changes in correlations between scales measuring positive and negative affect. For some purposes, one may aggregate across methods and operations, but that leaves open the questions of (a) why the methods and operations differ and (b) whether one of the methods or operations is a better measure of the construct.

Expecting broad classes of psychological and physiological phenomena to correlate may represent a contemporary extension of the mistake committed by early psychologists. They expected to find relations between many different types of physical tasks, physiological activities, and intelligence, but discovered that such correlations were largely absent. Over 100 years after early psychologists began the task, Cacioppo and Tassinary (1990) found that attempts to link physiological states to psychological operations remain problematic because of confusion about the relations among the categories of events measured. As shown in Figure 29, they proposed that such relations be conceptualized as:

(a) outcomes, where many physiological events vary as a function of a single psychological operation within a specified context or group of individuals;

(b) markers, where a single physiological event varies with a single psychological operation within a specified context or group of individuals;

(c) concomitants, where many physiological events vary with a single psychological operation across a broad range of situations and individuals;

and (d) invariants, where a single physiological event varies with a single psychological operation across a broad range of situations and individuals.

----------------------------------------
Insert Figure 29 About Here
----------------------------------------

Desynchrony and construct explication. Desynchrony has important implications for the validity of measurement. If individuals were totally synchronous, then measures of B-A-C would be substitutable. One could validate a measure of behavior, for example, by correlating that measure with a corresponding measure of cognition or affect. A more extensive validation procedure, involving a multitrait-multimethod correlation matrix (Campbell & Fiske, 1959), would consist of evaluating two or more psychological phenomena along the modes of behavior, affect and cognition. For example, one might attempt to validate a measure of job satisfaction by (a) explicating behavioral, affective and cognitive indicators of job satisfaction and then (b) correlating those measures with B-A-C indicators of similar (e.g., occupational stress) and different (e.g., personal orderliness or neatness) constructs. But if the Law of Initial Values influences affective measures of job satisfaction and occupational stress, correlations with cognitive and behavioral measures may be attenuated.

Human response modes may also be idiographic. That is, the interrelations among modes may differ by individual. Bandler & Grinder (1975), for example, proposed that individuals differ in their mode preferences for perceiving environmental information and expressing psychological states. Some individuals may think about a stimulus and ignore their feelings before executing behavior. Idiographic desynchrony has important implications for anyone conducting psychological interventions. If desynchrony is caused by idiographic ordering of response modes, then it becomes crucial to understand these linkages in the individuals for which you wish to facilitate change. The B-A-C causal sequences would influence the intervention design and the assignment of individuals to those interventions. Although no standard procedure exists for measuring the B-A-C sequence, Evans (1986) noted that clinical assessment involves a great deal of effort at understanding how individuals response modes are organized. In Lazarus' (1981) BASIC ID system, the assessor attempts to observe sequences among modes to develop intervention strategies (Lazarus, 1981; Nay, 1979).

Most contemporary theorists appear to assume that self-reports of B-A-Cs are sufficient operationalizations of the psychological constructs they propose. If self-report requires primarily cognitive activities, however, then most theorists are assuming that cognitive reports of all modes are adequate. The evidence for desynchrony, however, suggests that (a) theorists must specify what mode(s) should be measured to detect those phenomena, and (b) that theorists should consider the possibility that cognitive reports may be less valid when applied to other modes. Desynchronous individuals may not have cognitive access to valid information about their affect and behavior.

One of the major tasks of the measurement theorist may be to develop precise models that allow prediction and explanation of synchrony and desynchrony. In this context, it is important to emphasize that synchrony does occur (Evans, 1986). For example, decreases in pupil diameter are highly correlated with cognitive task demands (Beatty & Wagoner, 1978). Cardiac deceleration is correlated with an intention to perform a voluntary act (Lacey, 1967). Obrist (1981) conducted extensive research to map out synchronous processes among physiological modes. Pennebaker, Hughes and Uhlmann (in DeAngelis, 1992) found that 50% of their subjects demonstrated a significant covariation between their written expressed affect and skin conductance recorded simultaneously.

Desynchrony implications for explanations and predictions. If theory and research indicate that under certain conditions human modes function independently (i.e., they are parallel and non-redundant), the practical implication is that two very different measurement approaches should be employed for explanatory and predictive purposes. If the research objective is to explain relations among human modes, it would be important to include measures of all modes to detect desynchrony effects. Examples of such research include studies to determine the causal sequence of B-A-C operations in individuals or the effects of psychotherapy. Researchers conducting these studies are likely to be those concerned with general psychological laws.

On the other hand, researchers interested chiefly in prediction might choose a priori which mode they aim to assess and measure that mode only. Not only might past behavior best predict future behavior, but past cognition might best predict future cognition and so forth. To the extent that modes operate independently, discriminant validity would be not an issue because no correlation should exist among the modes of affect, cognition and behavior. One would still be well-advised to employ multiple measures (Cook & Campbell, 1979), but multiple measures in the same mode. To maximize prediction, the task would involve specifying the mode one is interested in measuring and measuring the mode in a setting that approximates as closely as possible the criterion setting. Predicting which potential clients might make the best group therapy participants, for instance, could involve establishing a simulated group in which group behavior could be observed. In fact, Yalom (1985) described research that found that direct sampling of relevant group behaviors is a better predictor of individuals' group behavior than personality inventories. The success of work assessment centers may also be explained by the fact that such centers typically employ tasks closely matching those found in the workplace (Howard, 1983; Klimoski & Brickner, 1987).

Construct explication and phenomenon-data distance. A major argument of idiographic proponents is that the scores of large groups of individuals are too distant from the psychological phenomena being measured to inform any theory of those phenomena (Cone, 1988; Loevinger, 1957). As illustrated in Figure 30, traditional self-reports combine the item responses of individuals, thereby aggregating such systematic errors as social desirability and acquiescence. The resulting scores contain difficult to separate validities and invalidities (Wiley, 1990; Meehl, 1991).

----------------------------------------
Insert Figure 30 About Here
----------------------------------------

In contrast to such phenomenon-distant research, seminal work in such areas as learning, sensation and perception, and cognitive development has been performed using repeated observation of single subjects. The history of psychological science contains many examples (e.g., Fechner, Wundt, Piaget, Ebbinghaus, Pavlov, and Skinner) in which investigators' work with one or a few subjects produced data sufficient for the start of important research programs or for practical applications. Such research, while time-consuming, can be phenomenon-close: that is, the investigator observes the phenomenon as it occurs. The phenomenon may be multiply determined and the observer may commit errors. Nevertheless, systematic, repeated observation of a single subject, in naturalistic and experimental settings, often represents the simplest available method for which to study a phenomenon. The data provided by methods such as single subject designs, qualitative research, and protocol analysis is likely to be of sufficient quality to permit more precise construct explications. In contrast, an investigator who begins a research program by creating and administering a test to large groups may ultimately possess little sense of how the phenomenon operates, particularly when interpreting test scores influenced by multiple, difficult to identify factors.

Intervention

Criterion-referenced tests allow intraindividual comparisons of progress toward performance of specific criteria or objectives. Like Gronlund (1988), Swezey (1981) observed that norm- and criterion-referenced tests can appear identical in terms of instructions, items, and format. The purpose of testing directs the types of data employed in test interpretation.

Criterion-referenced tests have considerable utility in evaluating interventions (Swezey, 1981). The typical procedure is to develop a criterion-referenced test and administer it pre- and post-intervention. Pre-test scores might help select individuals in or out of the intervention. Post-test scores indicate who has reached the necessary performance level as well as who requires additional intervention. As noted previously, selection tests can also be useful at the beginning and end of interventions. To the extent that resources are limited, placement tests may be employed to decide who should receive a treatment, job, or school admission. Similarly, summative tests may be useful in certifying the completion of an intervention or training; in certain circumstances, summative, norm-referenced tests may be appropriate.

Ideally, tests employed for intervention should be able to provide diagnostic and formative information. That is, test data should be able to provide diagnoses of problems (in personal-emotional tests) and errors committed (in cognitive ability tests) to suggest appropriate interventions. During the intervention, formative tests provide feedback to the intervenor and client that reveal progress and guide adjustment of the intervention. An individual's success or failure on formative test items provides feedback about progress and the focus of intervention efforts. In education, Cross and Angelo (1988) described this process as a loop "from teaching technique to feedback on student learning to revision of the technique" (p. 2). Whereas summative tests focus on an aggregate score (of items and components), administrators of formative tests tend to examine item response patterns (Bloom et al., 1971). Summative tests can suggest initial hypotheses relevant to interventions: for example, a standardized achievement test can describe a student's strengths and weaknesses (compared to other students) across subject areas, information that may be relevant to inclusion or exclusion from an intervention. More sensitive measures will be needed to develop and test those hypotheses, however, and it is here that formative tests can be useful (Bloom et al., 1971; Cross & Angelo, 1988).

Formative tests provide data relevant to individuals and groups. Bloom et al. (1971) discussed how a series of formative tests completed by students correlated with those students' eventual summative scores. Different classes demonstrated different patterns. In one class, only 20% of students demonstrated mastery across 7 summative tests, and Bloom et al. (1971) indicated that about the same percentage achieved mastery on the summative test. In another class, an increasing percentage of students demonstrated mastery on the formative tests, with 67% achieving mastery of the final formative test. Bloom et al. (1971) reported that about the same percentage achieved mastery on the summative test.

Bloom et al. (1971) suggested that formative tests can provide teachers and students feedback about progress in learning. Such feedback may be particularly reinforcing to students as they master small units of learning. A similar effect occurs when behavior therapy clients who begin self-monitoring demonstrate improvements in the monitored behavior (Nelson, 1977a, 1977b). Formative evaluation also allows for correct pacing of learning, indicating when more difficult material should be introduced and when additional time is necessary to master material.

Summary and Implications

Since tests may have multiple purposes, evaluations of validity focus on whether test score interpretations are valid for a particular use. Selection tests have influenced many test developers to embrace nomothetic, norm-referenced, trait-based, single administration, and aggregated measurement procedures. These characteristics, however, may not be optimal even for selection purposes; criterion-referenced tests, for example, may hold more potential for increasing selection tests' predictive validity. Given the influence of selection procedures, most contemporary psychological tests would seem to be more valid for decision-making than for explanatory theory-building or assisting psychological interventions.

Clearly, the link between substantive theory, measurement procedures, and empirical data in psychology must be strengthened. Measurement approaches have relied heavily on analyzing data after it has produced by human subjects, employing statistical techniques to break the code presumably encrypted in the data. Measurement psychologists, more so than those in any other specialty, have believed that "inaccurate data could be made to yield accurate conclusions if the proper statistics were applied (Boring, 1950)" (Barlow & Hersen, 1984, p. 6). I propose that we better understand how those data are produced in the first place. Substantive models of measurement processes must continue to be developed, and substantive psychological theories must be examined for potential measurement implications. Researchers and theorists should spend more resources on explicating and investigating how their constructs should--and should not--be observed and measured. The purpose of measurement should be the major consideration of theorists considering construct explication.

Tryon (1991) discussed the problems that result when scientists have yet to settle upon and employ standard measurement units. He believes that "to say that intelligence is what an intelligence test measures both begs the question and is a correct evaluation of the current conceptualization of intelligence" (p. 4). In contrast, Tryon noted that a component of intelligence such as information processing could be defined as information correctly processed within a set time period (e.g., correct bits per second). Whatever unit is constructed will possess different implications for theory. One can study correct bits per second or the time period it takes to produce one correct bit; the two units are likely to possess different implications for theory. Similarly, Evans (1986) observed that the time intervals aggregated for data analysis can have considerably impact upon the interpretation of research results. Citing work by Baerends (1972), Evans (1986) noted that birds' nest-building and preening behaviors are (a) negatively correlated at short time intervals, because the behaviors are topographically incompatible; (b) uncorrelated at intermediate intervals, because preening, the higher frequency behavior, serves multiple functions; and (c) positively correlated over longer intervals because preening and nest-building are both related to reproduction. Evans (1986) observed similar effects of aggregation in the study of the behavior of severely handicapped children (Evans & Voeltz, 1982; Evans, Voeltz, Freedland, & Brennan, 1981).

Without fundamental measurement units, reliable and valid measurements may also be inaccurate (Tryon, 1991). Tryon provided the following hypothetical experiment in which five investigators independently investigate temperature by filling a glass tube with mercury. They assess reliability by repeatedly placing the tube in ice water and recording the height of the mercury in the tube, with appropriate resting periods between the procedure. This process would likely demonstrate high test-retest reliability for each of the five investigator's measurements. Similarly, the investigators place their tubes in beakers with ice water, water at room temperature, and boiling water. All find that the height of mercury is highest with the boiling water, moderate at room temperature, and lowest with ice-water: a 1.00 correlation. Yet when the five investigators meet together and place their respective tubes in the same beaker of water, all register different temperatures. Because no standard unit of measurement was employed, each investigator's tube will show different heights of mercury depending upon the tubes' different volume and the different amount of mercury employed by each.

I have described in previous chapters numerous constructs that might form the basis for the creation of units of measurement for psychological phenomena. Such constructs include:

(a) persons,

(b) response modes (e.g., cognition, affect, and behavior),

(d) group status (e.g., culture and ethnicity),

(e) measurement methods (e.g., physiological, self-report, behavioral),

(f) observer status (e.g., self or other),

(g) reactive or nonreactive methods,

(h) traits,

(i) states,

(j) situations (e.g., environmental types, such as those proposed in vocational choice theories such as Holland, 1985),

(k) time (e.g., occasions, frequency, duration),

and (l) level of aggregation (items and aggregated scores).

Further theoretical and empirical work is needed to explore the utility of combining these units. Such efforts in applied psychology have already yielded substantial progress. Most measurement psychologists would agree that Campbell and Fiske's (1959) conception of tests as trait-method units has been very useful for thinking about and evaluating tests. Selection tests have benefitted through maximizing the aggregation of persons and items. Validity generalization suggests that if persons, items, and tests are combined within appropriately defined occupations, such aggregation is sufficient to permit moderate to high levels of predictive validity (Schmidt, 1992; Schmidt & Hunter, 1977). Collins (1991) demonstrated the utility of constructing person-item-occasion measurements for the study of longitudinal change.

To the extent possible, the goal of science is to develop a dependable theory-measurement basis upon which to make selection and intervention decisions. Psychology is a new science, and this kind of knowledge base is available for relatively few problem-treatment combinations. In most contemporary intervention contexts, the intervenor's task is to develop a model of the client, measure the variables in that model, and implement treatments in a trial-and-error fashion. It should be no surprise that the majority of psychotherapists and counselors identify themselves as eclectic (Meier & Davis, 1993). As the knowledge base deepens, one would expect fixed treatments to replace adaptive ones, and eclectic practitioners to be replaced by specialists. This process, however, heavily depends upon the quality of the science's measurement theory and methods.