EXTERNAL VALIDITY



This discussion of internal and external validity provides the opportunity to contrast random assignment with random selection. These sound like one and the same thing, but in practice they are quite different. Random assignment is the act of assigning subjects in a random fashion to experimental and control groups. The subjects--which constitute the entire sample--could themselves have been selected randomly or by other means (e.g., out of convenience or by way of matching). Random selection is the random process of selecting the sample from a larger, defined population.

Recall that random assignment is an attempt to ensure initial equivalence among the experimental and control groups. This is important in order to "achieve" internal validity, since researchers want to be able to attribute changes in the dependent variable to the treatment and not to subject characteristics (which might be the case with nonequivalent groups). That said, random assignment does not ensure group equivalence, it only can ensure the absence of systematic bias in group composition.

Simply put, random selection improves external validity while random assignment improves internal validity.
 
 

INTERNAL
VALIDITY
EXTERNAL
VALIDITY
random 
assignment
random 
selection

Random selection allows the use of inferences based on statistics and probability to generalize to the population from which the sample was randomly drawn. In fact, non-random selection precludes the use of statistical methods of inference. In such situations the researcher or research consumer must make the decision regarding the scalability of the findings; that is, she must make the judgment about the external validity of the research findings.

Often times a sample is drawn from what is called the accessible population. Accessible populations are so named because they are conveniently accessed by the researcher. Greater access to the population usually permits tighter control over the experiment. But the researcher is typically interested in generalizing her findings to a larger population--beyond that of her accessible population. This larger population is called the target population. Statistical methods cannot be used as a basis for generalizing from the accessible population to this target population because inferential statistics rely on the laws of probability which rely on random selection. Instead, the generalization is based on judgment and logical reasonings. See the figure below.

Let's talk External Validity.

The intent of almost all experimenters is to generalize their findings to some group of subjects and set of conditions that are not included in the experiment. To the extent that the results of an experiment can be generalized to different subjects, settings, and experimenters, the experiment possesses external validity.

There are two types of external validity: population validity and ecological validity. Population validity refers to the generalizability of findings from the sample to a population. Ecological validity involves generalizing findings to another setting.

Questions surrounding population validity include...

  1. Do results of the study generalize to some population of individuals?
  2. Do results apply to just the sample or to a broader group?
Questions concerning ecological validity include...
  1. Are treatment effects dependent to some extent on the use of certain audio-visual aids?
  2. Is the physical setting (size, shape of room, temperature) a factor in the treatment effects?
  3. Are the treatment effects independent of the time of day?
In addition to generalizing results to a population of persons (population validity), the researcher wishes to say the same effect will obtain under other environmental conditions (ecological validity). The researcher is hopeful that the experimental effect is independent of the experimental environment.

Just as there were several potential threats to the internal validity of an experiment, so too are there many possible threats to external validity. Threats to external validity are things which cause the effects of a treatment to be specific to some limited population of people or set of conditions.

The Threats to External Validity that follow are presented in terms of population and ecological validity.
 



Threats to External Validity


 


Population External Validity...

Noncomparability Threat

This threat occurs when the results do not generalize well from the accessible to target population. This, in essence, defines the absence of external validity. Consider the following example.

sample --> accessible (if random selection) use statistical inference 
accessible --> target use logical inference

 
SAMPLE 4 classrooms in the district
ACCESSIBLE district population
TARGET state population

If in the example above the researcher chose or had available as her accessible population the entire state, then a random selection (RS) of this accessible population would closer represent her target population. BUT, there is a tradeoff because the researcher would then have to manage the experiment over a larger area and may lose control (protocols, implementation, etc.). It would take considerable resources to ensure the treatments were being administered properly--to ensure treatment fidelity. Faced with such a tradeoff, many would lean toward tighter control of the experiment and face greater uncertainty regarding its generalizability to the target population.
 

Interaction of Subject Characteristics and Treatment

This occurs when the effect of the treatment is limited to individuals with certain subject characteristics.

Let's say that advanced organizers enhance reading comprehension among children in 4th grade and grades higher (but not 3rd graders and below). Or that advanced organizers enhance reading comprehension moreso among individuals with high ability. The effect of the treatment will not work when applied to other populations that don't possess these critical characteristics. These are examples where the subject characteristics determine the extent of generalizability.
 
 

Ecological External Validity...

Noncomparability of Research Setting to Natural Setting

Here is a Catch-22: researchers are trying to control for extraneous variables by isolating or controlling for the environment (or research setting), but this limits the generalizability to more realistic settings. It's a tradeoff between internal validity and external validity. An appropriate approach may be to first look for an effect in a laboratory setting and then see if the same treatment works in a more natural environment.
 

Demand Characteristics

Demand characteristics are cues in the research setting that alert subjects that they are involved in a study or are suggestive as to the purposes of the study (special rooms, white coats, special measuring devices, pull-out of classroom). If these characteristics have an influence on the results, obviously the results do not generalize very well beyond the culture of the experiment. These demand influences won't exist if the treatment is applied in a non-experimental situation (which is what generalization is all about).

One way to mitigate this threat is to limit communication between the subjects and the researcher.
 

Hawthorne Effect

The Hawthorne Effect (H.E.) is a type of demand characteristic (e.g., a feeling of special treatment).

Origin of H.E.: Hawthorne Plant of Western Electric Co.

The H.E. can be mistaken for the treatment effect. The results of a study would not generalize to the natural setting where no study is being conducted and the clients have no reason to feel specially treated.

The H.E. can also characterize a participant who is motivated by a high regard for science or is motivated by social desirability and wants to respond in the expected manner.
 

Novelty & Disruption Effects

In short, new programs or new interventions lead to high morale and enthusiasm. While this may be a desired response, it is usually fleeting. The novelty of programs threatens external validity because results are unlikely to replicate if the study were repeated the next year (after the novelty wears off). The generalizability of the treatment effect is limited in that it was really due to the novelty of the program rather than the program itself.

The disruption effect occurs when the novelty creates not high morale, but problems. When service providers are unfamiliar with a new program they may be poorly trained and use it ineffectively. This is common in the early stages of a new program (where evaluation reflects poor results). Beware of evaluations that are critical in the early goings--they could be more indicative of start-up problems than worth of program. Some have conjectured that the novelty and disruption effects could cancel each other out.

To diminish these influences, conduct the study after the program has been in place for a while.
 

Experimenter Effects

The effect of a study may not generalize beyond the experimenter(s) with similar characteristics. For instance, the original experimenters could be highly motivated, skilled communicators, very organized, or of a particular gender. Replicating the "experiment" or intervention via a facilitator who is lacking these characteristics may produce different results.

The experimenter can also bias the findings in other more direct ways. For instance, the experimenter that anticipates findings may find them when they aren't really there. Some have suggested that the experimenter should be another independent variable in the study because of the potential influence on the findings.
 

Task Effects

For example, analyzing the effect of text-embedded with pictures on how well young students understand stories, it could be the case that the story itself facilitates student comprehension, and not the pictures. The story is considered one of the "tasks" of the experiment. Other tasks include the service provider, the setting, etc.
 

Definition of Independent Variable

Ambiguous or loosely defined independent variables are a threat to external validity.
Let's say that a study found individualized instruction to be of great benefit to children. What does individualized instruction actually entail and how is it defined? Obviously, it can be defined many ways, so be looking for an explicit description of independent variables.

Clear description of procedures and variables are necessary if replication by other researchers is to occur. Clear description is also important to the consumer of research, who must judge the generalizability to other settings/situations.
 

Placebo Effects

A placebo is a treatment that in theory should not influence the dependent variable. If I was trying to determine the effects of vitamin C on reducing illness and I divided the class into two groups (experiment and control), and gave the experimental group OJ and the control group water, you wouldn't expect the water to do much. But if the water did do something (increased hydration, made subjects feel good that I cared enough to give them water), then there might be a placebo effect.

Here's another scenario. A large group of subjects who suffer from agoraphobia are assigned to one of three treatment groups:  the experimental treatment group, the placebo group, and the control group. The treatment group receives psychotherapy for one hour a week for several weeks. The placebo group meets for the same amount of time but doesn't receive actual psychotherapeutic treatment--just general consolation and attention. The control group doesn't meet and doesn't receive any attention. In this design, the placebo group allows the researcher to determine the effects of meeting vs. meeting with psychotherapy (vs. no meeting). To the extent that the placebo sessions have an effect on subjects (as measured by the dependent variable) there is a placebo effect. The goal of the researcher is to isolate the treatment effect from the placebo effect so that he can form generalizable knowledge about the treatment. 
 
 

Definition of Dependent Variable

Example: If we were interested in the extent to which children reach their academic potential as students, we could ask teachers to rate their pupils on this variable. Teacher ratings is only one way of measuring this construct. Another method could involve asking the students directly. Altering the method of measuring the dependent variable--even slightly--could greatly influence the research findings. With one measurement you may find an effect, with another you may not.

Example: Let's say you hypothesized that creating a more powerful student council in a school would enhance student empowerment. You found that when the dependent variable (i.e., student empowerment) was measured by a self-report instrument (forced-choice format) that it did not improve. Would you still find the same results if you had measured student empowerment via 30 minute in-depth focus groups?
 

Interaction of Treatment and Time of Measurement

Basically the question is: did the treatment have a lasting effect or just an effect in the short term? For example, rote memorization may be shown to significantly improve recall in the short run, but these abilities are absent beyond three weeks time. Try using follow-up measures to measure the longevity of an effect.