Variable Measurement in Research

Craig Scanlan, EdD, RRT, FAARC

Measurement is the process of collecting and recording data. In research studies, variables typically are measured by either (1) instrumentation (e.g., goniometers, pressure transducers, electrodes, imaging systems, lab tests, etc), (2) observation, or (3) administration of surveys or tests. Quality research depends on quality data. Quality data is data that is objective, reliable and valid.


Objectivity

An objective data collection tool provides measurements that are uninfluenced or undistorted by the beliefs or biases of the researcher who applies it. Most instrumentation provides objective data (e.g., a patient's measured height is not likely to be distorted by the researcher). However, data collected by observation, surveys or tests can lack objectivity (e.g., differences in interpretation of human behaviors, essay exam grading, etc).


Reliability

Reliability is the consistency with which a data collection tool measures whatever it is measuring. A reliable tool provides consistent or repeatable measurement over time and/or between researchers. For example, let's assume that you plan to use a scale to measure you patients' weights. You should expect to obtain essentially the same readings if (1) you immediately repeated the measurement; or (2) if another person reads the scale. If there were unreliability or inconsistency in the weighing process, you would not know which reading was correct or which one to trust. In order for research measurements to be meaningful, they must first be reliable.

In most instances where purely physical measurement is used, you can be fairly confident that the measurement device is yielding reliable or repeatable data. This is because most measurement devices used in research provide a high degree of precision, or freedom from random error. However, just because an instrument is precise, don't assume it is accurate. To determine instrument accuracy, you must always compare its measurement to a known or standard value, a process often referred to as calibration.

Observations, surveys and tests should also be as reliable as possible. One of several methods can be used to assess the reliability of these measurement tools. For observational measurement we are usually interested in the consistency with which the observer or rater gathers data (intra- and inter-rater reliability). For surveys and psychological test, we usually assess one of the following: test-retest reliability, alternate forms reliability, or internal consistency.

Intra-rater reliability represents the consistency of repeated measurements made by the same observer. Inter-rater reliability is the consistency of measurements made by two or more observers. In general, the reliability of observational assessment depends on the training and skills of the raters.

Test-retest reliability is the repeatability of survey or test results when administered twice to the same subjects (the two sets of scores are correlated to evaluate the consistency of results). Alternate forms reliability is similar to test-retest reliability with the exception that similar, but not identical instruments are used twice. When using two measurements with the same or similar instruments is difficult or impossible, we usually assess the internal consistency of the survey or test. This is done in one of two ways. In the first method, the instrument is divided into two equivalent halves, with the reliability coefficient computed by correlating the subjects' performance on each half (also called split-halves reliability). Alternatively, item statistics from every individual question are used to estimate internal consistency (the KR-20 formula or Chronbach's alpha).


Validity

It is not enough that a measurement tool be reliable in providing data. It should provide valid or meaningful measurements. Validity is the extent to which an instrument measures what it is supposed to measure. Although reliability and validity are different concepts, there are related. In general, a measurement tool that is unreliable (does not provide consistent results) cannot be considered valid.

For many measurement instruments, validity is assumed. For example, we know that a thermometer measures temperature or heat intensity and a pressure transducer measures force per unit area. Of course, we also know that using a thermometer to measure pressure would yield invalid (although potentially reliable) data.

Most concerns about validity involve test or survey measurements. Like reliability, there are different types of validity. The most common types of validity are content validity, concurrent validity, predictive validity and construct validity.

Content validity is evaluated by examining a test or survey to see if the content included is representative of what is being measured. For example, a test of students' clinical problem-solving skills that included mainly questions requiring the recall of simple facts would rate low on content validity.

Concurrent validity is evaluated by comparing one tool's results with those on another well-established measure of the same characteristic. For example, we could assess the concurrent validity of our test of clinical problem-solving skills by comparing it to a standardized measure of this ability, such as the Watson-Glazer Critical Thinking Appraisal. You would expect the results from the two measures to be similar if they are both assessing the same ability.

Predictive validity, as the name implies, is the degree to which a measurement tool can predict the performance or behavior of subjects on some future criterion. This type of validity is assessed by correlating the results of the measure to a measure of future behavior. For example, we might want to know if our students' clinical problem-solving skills scores predict subsequent problem-solving ability on the job.

Construct validity is the degree to which an instrument measures some abstract or nonobservable construct, such as anxiety or motivation. Construct validity normally is assessed by logical analysis and the testing of hypothesized relationships among variables. For example, were you developing an instrument to measure patient anxiety, you would first identify behaviors or responses logically consistent with the presence or absence of anxiety. These behaviors might be the criteria with which you would assess validity. If a subject had a score that indicated high anxiety, you would expect certain behaviors; other behaviors would be hypothesized for the subject with low anxiety scores. By observing or measuring the behavior of subjects, you could determine whether or not your instrument was indeed measuring anxiety.