Note: Excerpted from the Canadian Psychological Association (1996). Guidelines for educational and psychological testing. Ottawa: Canadian Psychological Association. © CPA 1996
Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. Test validation is the process of accumulating evidence to support such inferences. A variety of inferences may be made from scores produced by a given test, and there are many ways of accumulating evidence to support any particular inference. Validity, however, is a unitary concept. Although evidence may be accumulated in many ways, validity always refers to the degree to which that evidence supports the inferences that are made from the scores. It is the inferences regarding specific uses of a test that are validated, not the test itself.
Traditionally, the various means of accumulating validity evidence have been grouped into categories called content-related, criterion-related, and construct-related evidence of validity. These categories are convenient, as are other more refined categorizations (e.g., the division of the criterion-related category into predictive and concurrent evidence of validity), but the use of the category labels should not be taken to imply that there are distinct types of validity or that a specific validation strategy is best for each specific inference or test use. Rigorous distinctions between the categories are not possible. Evidence normally identified with the criterion-related or content-related categories, for example, also is relevant in the construct-related category.
Professional judgement should guide the decisions regarding which forms of evidence are most necessary and feasible in light of the intended uses of the test and likely alternatives to testing. The quality of the evidence is also of importance.
The gathering of evidence may involve not only the examination of the present instrument in the present situation but also the available evidence on the use of the same or similar instruments in similar situations. This process involves the concept of generalization, either on the basis of common elements as in synthetic validation or on the basis of overall job similarity as in validity generalization.
Construct-Related Evidence
The evidence classed in the construct-related category focuses primarily on the test score as a measure of the psychological characteristic of interest. This characteristic may be the construct of general intelligence or, if separate abilities are of interest, reasoning ability, spatial visualization, and reading ability may be the relevant constructs. Sociability and introversion are examples of constructs of personality characteristics. Endurance is a frequently used construct in athletics. Studies of leadership behaviour often refer to constructs such as consideration for subordinates (for example, giving praise, explaining reasons for action, asking opinions) and initiating structure (e.g., setting goals, keeping on schedule). Such characteristics are referred to as constructs because they are theoretical constructions about the nature of human behaviour. In this regard it should be noted that the validity of a measure of a construct is a problem distinct from that of the use of that measure in predicting a second measure, though the latter can often contribute to construct validation.
The construct of interest should be embedded in a conceptual framework, no matter how imperfect that framework may be. The conceptual framework specifies the meaning of the construct, distinguishes it from other constructs, and indicates how measures of the construct should relate to other variables.
The process of compiling construct-related evidence for test validity starts with test development and continues until the pattern of empirical relationships between test scores and other variables clearly indicates the meaning of the test score. Especially when multiple measures of a construct are not available -- as in many practical testing applications -- the validation of inferences about a construct also requires careful attention to aspects of measurement, such as test format, administration conditions, or language level, that may materially affect test meaning and interpretation.
Evidence for the construct interpretation of a test may be obtained from a variety of sources. Intercorrelations among items may be used to support the assertion that a test measures primarily a single construct. Substantial relationships of a test to other measures purportedly of the same construct, and the absence of relationships to measures purportedly of different constructs, support both the identification of constructs and distinctions among them. Relationships across different methods of measurement and across various non-test variables similarly sharpen and elaborate the meaning and interpretation of constructs. Another line of evidence derives from analyses of individual responses. Detailed questioning of test takers regarding their performance strategies or responses to particular items, or the probing of raters regarding the reasons for their ratings, can yield hypotheses that enrich the definition of a construct. Theoretical models of psychological processes involved in the construct can be developed and evaluated on the basis of analysis of test scores. Furthermore, evidence from content- and criterion-related validation studies, to be described in the following sections, contribute to construct interpretations. The choice of one or more approaches to gathering evidence for interpreting constructs -- those described here or others -- will depend on the particular validation problem and the extent to which validation is focused on construct meaning.
Content-Related Evidence
In general, content-related evidence demonstrates the degree to which the sample of items, tasks, or questions on a test are representative of some defined universe or "domain" of content. The methods often rely on expert judgements to assess the relationship between parts of the test and the defined universe, but certain logical and empirical procedures can also be used. For example, the major facets of an academic subject-matter domain can be specified, and then experts in that subject can be asked to assign test items to categories defined by those facets. The representativeness of the sample of items can then be judged. Sometimes algorithms or rules can be constructed to generate items that differ systematically on various domain facets, thus assuring representativeness. As another example, systematic observations of behaviour in a job may be combined with expert judgements to construct a representative or critical sample of job tasks, which then can be administered under standardized conditions in an off-the-job setting. Expert judgements can be used to assess the relative importance or criticality of various parts of the job, instructional program, or item universe (e.g., identifying aspects of job performance that are critical in preventing accidents). A job sample test can then be made to cover those aspects more thoroughly. Also if some job tasks are judged relatively unimportant, they may be excluded from the test sample.
The use of content-related evidence of validity as a validation strategy is not restricted to job sample measures. Content validity is also an appropriate validation strategy for tests of knowledge, skills, abilities and aptitudes. The distinction between content and construct validity is extremely difficult to maintain.
The first task for test developers is the adequate specification of the universe of content the test is designed to represent, given the proposed uses of the test. Test users who are considering an available test for a purpose other than that for which the test was originally developed need to judge the appropriateness of the original domain definition for the proposed new use. For educational decisions, it is important to determine the agreement between the test and the curricular or instructional domain it is meant to cover.
Another important task is the determination of the degree to which the format and response properties of the sample of items or tasks in a test are representative of the universe. Items included in a test may bear superficial similarity to those in the domain of interest and yet require different kinds of skills than those in the job performance universe, for example. On the other hand, superficial dissimilarity between test and universe does not necessarily constitute evidence against a claim of validity. Methods classed in the content-related category thus should often be concerned with the psychological construct underlying the test as well as with the character of test content. There is often no sharp distinction between test content and test construct.
Content-related evidence for test validity is a central concern during test development, whether such development occurs in a research setting, in a publishing house, or in the context of daily professional practice. Expert professional judgement should play an integral part in developing the definition of what is to be measured: describing the universe of content, generating or selecting the content sample, and specifying the item format and scoring system. Thus, inferences about content are linked to the process of test construction as well as to the process of establishing evidence of validity after the test has been developed and chosen for use.
Criterion-Related Evidence
Criterion-related evidence demonstrates that test scores are systematically related to one or more outcome criteria. In this context the criterion is the variable of primary interest, as determined by a school system, the management of a firm, or clients, for example. The choice of the criterion and the measurement procedures used to obtain criterion scores are of central importance. Logically, the value of a criterion-related study is dependent upon the validity of the criterion measure that is used.
The relationship between scores on a test and a criterion measure may be expressed in a variety of ways, but the fundamental question is always: "How accurately can criterion performance be predicted from scores on the test?" Whether a given degree of accuracy is judged to be high or low or useful or not useful depends on the context in which the decision is to be made.
Two designs for obtaining criterion-related evidence, predictive and concurrent, may be distinguished. A predictive study obtains information about the accuracy with which criterion scores obtained in the future can be estimated from earlier test data. A concurrent study serves the same purpose, but obtains prediction and criterion information at approximately the same point in time. The use of a predictive or a concurrent design depends upon the type of test, the use to be made of the test, economic considerations and, most importantly, upon professional judgement.
A decision theory framework can be used to judge the value of a predictor test. One judgement would be that the most important error to avoid is selecting someone who will subsequently fail. Another judgement would focus on avoiding false negatives, the persons who would have succeeded but are not selected. The relative cost assigned to each kind of error is again a value judgement; depending on that judgement, the subsequent interpretation of the utility of testing may differ. Value judgements are always involved in selection decisions, if only implicitly. The question of what value judgements are appropriate in individual applications is not addressed in these Guidelines.
In contrast to selection decisions, classification decisions attempt to allocate individuals within an institution according to a particular outcome criterion in a way that is optimal for the institution and/or for the individuals. Test validation for classification decisions requires a demonstration of statistical interaction between the test variable(s) and the classification variable(s). The evidence required depends upon the test application. For instance, it is possible for tests to be highly predictive of performance across different jobs without providing the information necessary to make a comparative judgement of the efficacy of assignment. Test validation for such selection decisions will not necessarily require the same type of evidence as that for classification decisions. Careful attention should be paid to the decision being made, the criterion used, and the various classifications used.
Validity Generalization
An important issue in educational and employment settings is the degree to which criterion-related evidence of validity obtained in one situation can be generalized to another situation without further study of validity in the new situation. In validity generalization, validity results for a given job type/family and test type are accumulated across situations. The lower the amount of variance in true validity coefficients remaining after the effects of statistical artifacts are removed from the distribution of observed validities, the greater the generalization that is possible to a new situation involving similar job and test types. When generalization is extensive, situation-specific validation is not required. Criterion-related evidence of validity can be demonstrated through such validity generalization without the need to conduct a local criterion-related validation study. When generalization involves only one prior situation (i.e., the validity result for a given job type and test type in one situation is generalized to a new situation) the process is sometimes, more appropriately, referred to as validity transportability. In validity transportability, the requirement to establish situational similarity between the new and prior situations is understandably more stringent than in validity generalization.
Guideline 1.1
Evidence of validity should be presented for the major types of inference for which use of the test is recommended.
Guideline 1.2
Statements about validity should refer to the validity of particular interpretations or of particular types of decisions [Comment: It is incorrect to use the unqualified phrase "the validity of the test". No test is valid for all purposes or in all situations].
Guideline 1.5
The composition of the validation sample should be described in as much detail as practicable. Available data on selective factors that might reasonably be expected to influence validity should be described [Comment: For example, if a validity study's subjects are patients, the diagnoses of the patients could be reported and the severity of the diagnosed condition stated when feasible. For tests used in educational settings, relevant information may include community characteristics or relevant selection policies as well as the gender and ethnic composition of the sample].
Guideline 1.6
When content-related evidence serves as a significant demonstration of validity for a particular test use, a clear definition of the universe represented, its relevance to the proposed test use, and the procedures followed in generating test content to represent that universe should be described. When the content sampling is intended to reflect criticality rather than representativeness, the rationale for the relative emphasis given to critical factors in the universe should also be carefully described.
Guideline 1.7
When subject-matter experts have been asked to judge whether items are an appropriate sample of a universe or are correctly scored, or when criteria are composed of rater judgements, the relevant training, experience and qualification of the experts should be described. Any procedure used to obtain consensus among judges as to the appropriate specifications of the universe and the representativeness of the samples for the intended objective(s) should also be described.
Guideline 1.9
When a test is proposed as a measure of a construct, evidence should be presented to show that the score is more closely related to that construct measured by different methods than it is to substantially different constructs.
Guideline 1.10
Construct-related evidence of validity should demonstrate that the test scores are more closely associated with variables of theoretical interest than they are with variables not included in the theoretical network.
Guideline 1.11
A report of a criterion-related validation study should provide a description of the sample and the statistical analysis used to determine the degree of predictive accuracy. Basic statistics should include numbers of cases (and the reasons for any eliminated cases), measures of central tendency and variability, and a description of any marked tendency toward nonnormality of distribution.
Guideline 1.12
All criterion measures should be accurately described, and the rationale for their choice as relevant criteria should be made explicit. [Comment: When appropriate, attention should be drawn to significant aspects of performance that the criterion measure does not reflect].
Reliability refers to the degree to which a test score is free from errors of measurement. A test taker may perform differently on one occasion than on another for reasons unrelated to the purpose of measurement. A person may try harder, be more fatigued or anxious, have greater familiarity with the content of questions on one test form but not on another, or simply guess correctly on more questions on one occasion than on another. For these and other reasons, a person's score will not be perfectly consistent from one occasion to the next. Indeed, an individual's scores on two forms of a test that are intended to be interchangeable will rarely be precisely the same. Even the most careful matching of item content and difficulty on two forms of a test cannot ensure that an individual who knows the answer to a particular question on Form A will know the answer to a matched counterpart on Form B. Differences between scores from one form to another or from one occasion to another are attributable to what is commonly called errors of measurement. Such errors reduce the reliability (and therefore the generalizability) of the score obtained for a person from a single measurement. The magnitude of the error notwithstanding, however, the importance of a particular source of error depends on the specific use of a test.
Fundamental to the proper evaluation of a test are the identification of major sources of measurement error, the size of the errors resulting from these sources, the indication of the degree of reliability to be expected between pairs of scores under particular circumstances, and generalizability of results across items, forms, raters, administrations, and other measurement facets.
Typically, test developers and publishers have the primary responsibility for obtaining and reporting evidence concerning reliability and errors of measurement adequate for the intended uses. The typical user generally will not conduct separate reliability studies. Users do have a responsibility, however, for determining that the available information regarding reliability and measurement error is relevant to their intended uses and interpretations and, in the absence of such information, for providing the necessary evidence.
"Reliability coefficient" is a generic term. Different reliability coefficients and estimates of components of measurement error can be based on various types of evidence; each type of evidence suggests a different meaning. A reliability coefficient based on the relationship between alternate forms of a test administered on two separate occasions is affected by several sources of error, including random response variability, changes in the individuals taking the tests, differences in the content of the forms, and differences in administration. On the other hand, analyses of part scores or item scores from a single administration of a test do not give information on response variability due to the latter three sources.
It is essential, therefore, that the method used to estimate reliability takes into account those sources of error of greatest concern for a particular use and interpretation of a test. Not all sources of error are expected to be relevant for a given test. Thus, the estimation of clearly labeled components of observed and error score variance is the most informative outcome of a reliability study, both for the test developer wishing to improve the reliability of an instrument and for the user desiring to interpret test scores in particular circumstances with maximum understanding. The reporting of standard errors, confidence intervals or other measures of imprecision of estimates is also helpful.
Estimates of the reliability of a test might consider not only the relevant sources of error but also the types of decisions anticipated to be based on the test scores and the expected level of aggregation of the test scores (individual versus groups of test takers). For example, tests are sometimes used as the primary basis for making dichotomous decisions. In testing to determine certification for successful completion of a course of study, the primary interest is in the decision. Of course, there may be more than two categories, but the pass-fail or mastery-nonmastery decision is common.
Guideline 2.1
For each score, subscore, or combination of scores that is reported and interpreted, estimates of relevant reliabilities and standard errors of measurement should be provided in adequate detail to enable the test user to judge whether scores are sufficiently accurate for the intended use of the test.
Guideline 2.2
The procedures used to obtain samples of individuals, groups, or observations for estimating reliabilities and standard errors of measurement and the nature of the population involved should be described. The number of individuals in each sample used to obtain the estimates and score means and standard deviations should also be reported.
Guideline 2.3
Each method of estimating a reliability that is reported should be defined clearly and expressed in terms of variance components, correlation coefficients, standard errors of measurement, or equivalent statistics. The conditions under which the reliability estimate was obtained and the situations to which it may be applicable should be clearly explained. [Comment: Because there are many ways of estimating reliability, each influenced by different sources of measurement error, it is unacceptable to say simply, "The reliability of test X is .90". A better statement would be, "Based on the correlation between alternate test forms A and C administered on successive days to a sample of 200 freshman medical students, the alternate form reliability is estimated to be .90, with an approximate 95% confidence interval of (.82 - .98)"].
Guideline 2.5
Estimates of reliability based on alternate forms of a test administered to the same sample of individuals on two separate occasions should indicate the order of administration of forms and the interval between administrations as well as a rationale for the interval chosen. Means and standard deviations of both forms should be provided as well as standard errors of measurement and the estimate of the alternate-form reliability.
Guideline 2.6
Coefficients of reliability (internal consistency, alternate-form, or estimates of stability over time) should not be interpreted as substitutes for one another.
Guideline 2.8
Where judgemental processes enter into the scoring of a test, evidence on the degree of agreement between independent scorings should be provided. If such evidence has not yet been provided, attention should be drawn to scoring variations as a possible significant source of error of measurement [Comment: Variance component analyses are especially helpful for judgementally scored tests; they provide separate variance estimates for questions, raters, scales used in the rating process, and time allowance, for example.]
Guideline 2.9
When reported scores are derived from models based on item response theory, evidence should be available regarding the degree to which the item response curves, defined by the estimated item parameters, fit the observed data. (Secondary)