Patient
Self-Reported Outcomes: Reliability and Validity White Paper
Focus
On Therapeutic Outcomes, Inc.
May
17, 2005
Collection
of outcomes following clinical treatment is now commonplace in
rehabilitation. (Hart 2002) Clinicians use outcomes to tract changes in
their patients to assess if the patient is improving with a specific
treatment, (Jette & Delitto 1997) if treatment needs to be changed or
terminated, (Jette & Jette 1997) and if the patient needs to be
referred to another clinician or service. (Jette & Jette 1997)
Administrators use outcomes to compare their department’s success with
other similar departments, to market the department’s services, to
manage resources required to deliver their clinical services, and to
manage their clinicians. (Marino 1997) The federal government has mandated
the collection of outcomes for post-acute rehabilitation in skilled
nursing facilities, nursing homes and in patient rehabilitation hospitals,
and the government has directed the development of patient assessment
instruments designed to collect outcomes. (Johnson 2001) However, debate
continues concerning what to measure for outcomes and how to measure the
outcome. This paper discusses the collection of patient self-reported
outcomes with particular attention to reliability and validity of the
measures.
The
Institute of Medicine (IOM) (2001) examined the American health care
delivery system and determined the “system is in need of fundamental
change”. (p 1) The IOM identified that Americans want and deserve
quality clinical services, but the current system does not routinely
collect measures of quality and therefore cannot assess quality. So, the
IOM proposed six goals for improvement of the 21st century
health care system. Among the goals, health care should be effective
and patient-centered. Effective was operationally defined as
“providing services based on scientific knowledge to all who could
benefit and refraining from providing services to those not likely to
benefit (avoiding underuse and overuse, respectively).” (p 6)
Patient-centered was operationally defined as “providing care that is
respectful of and responsive to individual patient preferences, needs, and
values and ensuring that patient values guide all clinical decisions.”
(p 6) Effective, patient-centered care needs to be measured, so
researchers and clinicians can determine, for a specific patient at a
specific time and intensity, which treatment is effective, which is a
measure of clinical quality. (Hart 2001) For the purpose of this paper,
use of patient self-report of outcomes is proposed as the best means of
obtaining the aim of collecting effective, patient-centered outcomes. (Jette
1993)
Over
the past decades, “one of the more important developments in the health
care field has been the recognition of the centrality of the patient point
of view in monitoring the quality of medical care outcomes”. (Ware 1993
p 2:1; Geigle 1990) “A medical outcome has come to mean the extent to
which a change in a patient’s behavioral functioning or well-being meets
that patient’s needs or expectations.” (Ware 1993 p 2:1) The goal of
medical or rehabilitative care for most patients is the preservation of
function, (Ellwood 1988) although other outcomes, i.e. pain relief, (Fairbank
1980) improved self-efficacy, (Lorig 2001) among other measures are also
important. Although patients
are the best source of information regarding achievement of their goals of
improved function, information from patients about their improved
functional status and treatment has not been routinely collected in
previous clinical research or medical practice. (Ware 1993)
In
a recent report on quality outcomes measurements in post-acute
rehabilitation facilities, (Johnson 2001) researchers acknowledged that
the current federally mandated outcomes tools for post-acute
rehabilitation facilities do not measure whether the patient’s
perception of their functional status was maximized, prepared him/her to
return to and remain in the most independent living environment, and
reintegrated him/her into prior lifestyle. The researchers further
acknowledged that measuring such important information requires input
directly from patients or their proxies as an important aspect of clinical
outcomes. Johnson et al (Johnson 2001) identified that many researchers
focused on patient-centered outcomes believe that functional outcomes
should come from patients because their perception of their function is
more important than “so-called objective measures of function”. (p 13)
Patient self-report of their functional ability represents their
perception of their ability and integrates the relevance of the functional
ability to the patient. If a patient has difficulty performing a
functional task, but the task is not relevant to the patient, the task is
likely of little importance or relevance to his or her life and needs. (p
13) If a clinician had measured the patient’s functional ability to
perform a task that was not relevant to the patient, the clinician may
attach more importance to the task or measure of functional ability than
is appropriate according to the patient.
The existing patient assessment instruments for patients in post-acute
rehabilitation facilities require facility staff to measure outcomes. The
federally mandated instruments include the Functional Independence Measure
(FIM) used in inpatient rehabilitation hospitals, Minimum Data Set (MDS)
used in skilled nursing facilities, and Outcome and Assessment Information
Set (OASIS) used in home health settings. They are all examples of
clinician-generated measures of patient function. The FIM, MDS and OASIS
use clinical staff, i.e. physical or occupational therapists, nurses, etc.
to assess the patient’s functional abilities pertinent to sets of
activities of daily living. The FIM, MDS and OASIS do not consider the
patient’s perception of their functional health status, and therefore
fail to capture the relevance of the individual functional measures to the
patient. Because the goal of rehabilitation is to restore patients to
their previous level of function, capturing their perception of function
both before and after rehabilitation in their terms is essential to
measuring quality and effectiveness of patient-centered treatment.
(Johnson 2001)
Another
complicating problem of using clinician-generated outcomes measures is the
fact that many important clinical outcomes cannot be measured within the
time frames of the delivery of typical rehabilitation services. For
example, it may take time beyond the time when rehabilitation was provided
for a person to maximize their functional abilities, sustain their maximal
level of independence within their normal or modified living environment,
or reintegrate into their prior activities and lifestyle. (Johnson 2001)
If patients continue to improve in their functional health status,
clinician-generated outcomes will not capture the improvement unless the
patient returns to the clinician or clinical team for another assessment,
which costs money. If the patient had been assessed using a patient
self-report process through rehabilitation, the patient could complete
another survey at home using pencil and paper or Internet delivery
mechanisms to capture the change in functional health status. Collecting
patient self-report outcomes data does not require the same clinician or
staff burden as collecting clinician-generated data. Therefore, although
no studies were found comparing cost of collecting patient self-report vs.
clinician-generated outcomes data, it is logical that patient self-report
data collection is cheaper than clinician-generated measures.
When
researchers and clinicians debate collection of measures of outcome, like
functional health status, discussion commonly addresses the psychometrics
of the measures. Outcomes measures need to be assessed for their
reliability, validity and responsiveness. A complete discussion of the
psychometrics of outcomes measures is beyond the score of this paper, but
key elements are crucial to the understanding of outcomes measures.
Reliability
of a measure of outcome is paramount. Reliability means the measure is
reproducible, consistent and free from error. (Portney 2000) If the
patient’s ability is reliable, a measure of that ability should also be
reliable. No measure is free from error, so there is always a need to
assess the degree to which error influences the measure and its
interpretation. In the case of an outcomes measure, it is reasonable to
insist the measure has good test-retest reliability. Test-retest
reliability means the measure of function is almost the same when the
measure is taken twice on the same patient. Test-retest reliability
strikes at the heart of the argument common for clinicians: patient
self-report is “subjective”. In fact, no measure is ever entirely
subjective or objective. The two concepts are connected along a
measurement continuum by the measurement of reliability. When a measure
has poor reliability, the measure is subjective be definition. When a
measure has good reliability, the measure is objective by definition.
Therefore, a subjective phenomenon may be measured objectively. In other
words, the phenomenon may be either subjective or objective, and the
quality of the measurement may be either subjective or objective.
(Rothstein 1993) Take for example the patient’s perception of their
physical functional health status as measured by the SF-36 Physical
Functioning Scale (PF-10). (Ware 1993) The patient is asked the question
(one of the PF-10 items): Does your health now limit you in vigorous
activities, such as running, lifting heavy objects, participating in
strenuous sports? If so, how much? The patient responds “Yes, limited a
lot”, “Yes, limited a little”, or “No, not limited at all”.
At
first glance, clinicians commonly say the patient’s responses are
“subjective”. However, the PF-10 has been studied for several types of
reliability, including test-retest. For patients with diabetes, the
correlation coefficient for repeated measures was .90, and for patients in
a general medical practice, the correlation coefficient was .81. (Ware
1993) For patients receiving outpatient rehabilitation using the FOTO
24-item functional health status instrument with items similar to the
SF-36, (Hart 2001) test-retest reliability intraclass correlational
coefficient (ICC(2,1)) of an overall functional health status measure was
.92. (Hart 2003) Therefore, in spite of the phenomenon (i.e. patient
self-report of their perception of their functional health status)
appearing “subjective”, the quality of the patient self-report measure
of their functional health status is indeed “objective” because the
test-retest reliability statistic for the measure was good.
Patient
self-report assessments are becoming common, and most test-retest
reliability statistics support the objectivity of the measures. Some other
examples include:
Oswestry
Low Back Pain Disability Questionnaire: correlation coefficient .99 (Fairbank)
Lower
Extremity Functional Scale: ICC(2,1)=.94 (Binkley)
Fear-Avoidance
Beliefs Questionnaire: k=.74
(Waddell)
Back
Pain Functional Scale: ICC(2,1)=.88 (Stratford)
The next psychometric quality of interest is validity. “Measurement
validity concerns the extent to which an instrument measures what it is
intended to measure.” (Portney p79) There are many forms of validity,
such as face, content, criterion-related, construct. Validity is never
fully “proven”, but data tend to support or not support the validity
of the measure of interest as that measure was used to assess a specific
concept for a specific group of patients. Therefore, testing validity of
measures is a fluid environment.
“Face
validity indicates that an instrument appears to test what it is
supposed to and that it is a plausible method for doing so.” (Portney p
82) Face validity is the least rigorous method of testing validity, and
there are no standards for assessing how much face validity a measurement
has. (Portney) Face validity of a rating for getting out of a chair as an
activity of daily living (ADL) could be assessed by observing a patient
get out of a chair and determining whether the patient used any assistance
from another person. It could be argued that the observation of the ADL
has face validity because the observation appears to test the activity of
getting out of a chair. Face validity plays almost no role in assessing
patient self-report of functional health status.
Content
validity refers to the adequacy with which the universe of a theoretical
domain of behaviors or characteristics related to a measure, i.e.
functional health status, are sampled by a test. (Portney) Content
validation in the field of health status is a challenge because of the
breadth of the health status domain. Health status is difficult to define,
can cover many different constructs and there is no standard to which the
results of the content validation process. The developers of the SF-36
(Ware 1992 & 1993; McHorney 1993 & 1994) met this challenge by
comparing the content of the SF-36 to other widely used health status
surveys and by applying factor analytic techniques to see if the
constructs covered by the items of the SF-36 were similar to the
constructs assessed by other health instruments. The items in the SF-36
and its subscales covered most of the content areas of most health
instruments. Factor analytic techniques also supported content validity of
the patient self-report SF-36. (Ware 1993)
FOTO
uses 72 functional status items to assess improvement in patient
self-reported functional status. The items are separated into two
measures: one for physical functional status, and one for mental
functional status. Analyses using item response theory (IRT) techniques support
what is clinically logical: the two constructs should be kept separate for
better understanding of the patient’s functional status. Further
analyses supported the content validity of the 50-item physical functional
status item bank.
Using
the Oswestry scale to assess disability in patients with low back pain, (Fairbank)
patients self-reported their perception of their disability, and
clinicians assess the patient’s observed disability and symptoms. The
two scores were matched over treatment time and produced the predicted
plot, thus supporting concurrent, criterion-related validity of the
patient self-report measure of disability.
“Construct
validity reflects the ability of an instrument to measure an abstract
concept, or construct.” (Portney p 87) Most constructs that are measured
are not directly observable and “exist only as concepts that are
constructed to represent an abstract trait”. (Portney p 87) Health
status is a good example. Clinicians have difficulty agreeing on the
operational definition of health status, and therefore, “the definition
of such a construct as “health status” can be determined only by the
instrument used to measure it.” (Portney p 87-88) The construct of
physical functional health status has been described by applying
instruments designed to assess health status to people who received
physical or occupational therapy in outpatient facilities for the primary
goal of improvement in physical functioning. (Hart et al 2001; Hart 2001;
Hart & Wright 2002) Using the known groups method (Portney) of
construct validity, change in physical functional health status as
assessed by the FOTO 24-item health status instrument was shown to
discriminate (discriminant validity) patients receiving workers’
compensation benefits who received acute work rehabilitation compared to
patients receiving work conditioning or had acute vs. chronic symptoms.
(Hart 2001) Using the SF-36, patients were discriminated by their level of
severity of several medical and psychiatric conditions. (McHorney 1993)
The FOTO health status instrument (Hart 2001) and SF-36 (Ware 1993) both
demonstrated good convergent and discriminant validity by having the
physical scales correlate well with each other and the mental scales
correlate well with each other while physical and mental scales did not
correlate with each other.
In
a recent unpublished analysis (Hart DL. Predictive validity of a patient
self-report outcomes process. Abstract submitted to the Combined Sections
Meeting of the American Physical Therapy Association for 2004.),
risk-adjusted change in physical functional health status (FHS) and visits
per patient could be successfully predicted 76% to 78% of the time (90%
confidence interval) for patients receiving physical or occupational
therapy using the 50-item FOTO physical FHS computer adaptive testing
process (unpublished; Hart DL. Computerized adaptive testing: application
in outcomes measurement. Abstract presented at the American Physical
Therapy Association Combined Sections Meeting in 2002.).
Patient
self-report of health status instruments require patients to have
cognitive abilities to successfully read, understand and answer each
question. When cognitive abilities are impaired, answers to patient
self-report instruments become less reliable and valid, and proxy report
of the patient’s functional health status becomes an option. However,
measures of functional health status from a proxy who is intimately aware
of the functioning of the patient with cognitive deficits are not exactly
the same as a patient’s perception of their own functional abilities
even under normal conditions. Literature to date has been equivocal at
best for support of proxy health status when the patient has cognitive
deficits. (Ostbye 1997; Pickard 1999; Murrell 1999; Andresen 1999) When
patients are cognitively normal, caregivers as proxies can provide
reliable assessments of the patient’s ability to complete ADLs, but
agreement between patient and proxy assessment of ADL or functional
ability decreases as severity of dementia and (Ostbye 1997) age, (Pickard
1999) or as older adults become more medically ill. (Tamim 2002)
Validation of outcomes instruments is rare for instruments designed to
assess health status of patients with cognitive deficits. So, when the
patient’s cognitive deficits are in question, use and interpretation of
patient self-report of health status should be approached with caution. FOTO
supports use of proxy FHS measures if the patient does not have a
cognitive deficit and cannot complete the survey. When the patient has a
cognitive deficit, FOTO uses measures collected from proxies to assess
change over time, but because we cannot correlate the proxy’s assessment
of the patient’s functional status to the patient’s assessment of
their own functional status, interpretation and validity of the functional
status measures for patients with cognitive deficits cannot be confirmed.
Therefore, in patients who are cognitively normal, patient self-report of
health status has a long history of reliability and validity. When
measures have good reliability, the measures are objective. Because the
measures have been supported as valid using several different statistical
techniques, the measures can be used for tracking change in health status
with many populations of patients.
The
future vision of the new health care delivery system as presented by the
IOM (IOM) will use patient-centered, self-report measures that, at least
in rehabilitation, will assess change in physical functional health status
at a minimum. Because researchers in health status have the mathematical
techniques (Cella 2000; Hambleton 2000; Hays 2000; Lohr 2000; McHorney
1997 & 2000; Ware 2000) and computer power, which along with the
Internet are readily available to the public as well, patient-self report
will be transformed into efficient, easy to use computerized adaptive
testing processes (CAT) (Mills 2002) that will be used any time of the day
or night, any day of the week, all year long by patients and clinicians.
CAT processes will direct functional FHS items one-at-a-time to the
patient that are impairment- and ability-specific. This represents the
epitome of patient-centered assessment. CATs will function like this. Once
the computer knows the patient’s impairment, the computer will ask a
physical FHS item of median level difficulty pertinent to the patient’s
impairment, i.e. a patient who just has the ability to get out of bed will
not be asked about climbing stairs. If the item is too easy, i.e. the
patient is functioning at a level higher than the item describes, the
computer will ask more difficult items until an item is above the
patient’s physical FHS level. Conversely, if the item is too difficult,
i.e. the patient is functioning at a level lower than the item describes,
the computer will ask less difficult items until an item is below the
patient’s physical FHS level. In this way, CATs will reduce the burden
associated with collecting patient self-report physical FHS data while
maintaining a high level of measure precision. Another way to develop a
CAT is to ask the median level difficulty item first, then subsequent
items could be selected if they provide the maximum amount of information
(Lord 1980) given the current estimate of the patient’s functional
ability. (Hart 2005)
FOTO,
Inc. has several functioning CAT processes for physical functional health
status that are commercially operational in hundreds of outpatient clinics
across the country. Experience has demonstrated patients like the CAT
process better than completing paper and pencil forms or even computer
administered surveys. First, the patients like seeing one question at a
time. Second, the letter font is larger, which benefits those with visual
difficulties. Third, the computer produces an aura of credibility, which
the patients like. Fourth, patients like the shorter time, which
translates into reduced patient burden to collect the data. This is
particularly beneficial for older patients.
In
summary, CAT processes will be used for patient self-report of health
status more and more. As more knowledge is gained, mathematical models
will be improved, measures will become more precise, and items will become
more impairment- and ability-specific to each patient. The process is
patient-centered, and future CATs will be directive for the patients who
are interested in knowing more about their condition and clinicians who
have track records for obtaining good outcomes treating patients with
similar conditions.
References
Andresen EM, Gravitt GW, Aydelotte ME, Podgorski CA. Limitations of the
SF-36 in a sample of nursing home residents. Age Ageing.
1999;28(6):562-6.
Binkley
JM, Stratford PW, Lott SA, et al. The Lower Extremity Functional Scale (LEFS):
Scale development, measurement properties, and clinical application. Phys
Ther. 1999;79(4):371-383.
Cella
D, Chang CH. A discussion of Item Response Theory and its applications in
health status assessment. Med Care. 2000;38(suppl II):II-66-II-72.
Ellwood PM. Outcomes management: a technology of patient experience
[Shattuck Lecture]. N Eng J Med. 1988;318:1549-1556.
Fairbank JC, Couper J, Davis JB, O’Brien JP. The Oswestry low back pain
disability questionnaire. Physiotherapy. 1980;66(8):271-3.
Geigle
R, Jones SB. Outcomes measurement: a report from the front. Inquiry.
1990;27:7-13.
Hambleton
RK. Emergence of item response modeling in instrument development and data
analysis. Med Care. 2000;38(Suppl II):II-60-II-65.
Hart DL. The power of outcomes: FOTO Industrial Outcomes Tool – Initial
assessment. Work. 2001;16:39-51.
Hart
DL. Test-retest reliability of an abbreviated self-report overall health
status measure. J Orthop Sports Phys Ther. 2003;33(12):734-742.
Hart
DL, Mioduski JE, Stratford PW. Simulated computerized adaptive
tests for measuring functional status were efficient with good
discriminant validity in patients with hip, knee, or foot/ankle
impairments. J Clin Epidemiol. 2005;58(6):629-638.
Hart DL, Tepper S, Lieberman D. Changes in health status for persons with
wrist or hand impairments receiving occupational or physical therapy. Am
J Occup Ther. 2001;55:68-74.
Hart DL, Wright BD. Development of an index of physical functional health
status. Arch Phys Med Rehabil. 2002;83(5):655-665.
Hays
RD, Morales LS, Reise SP. Item response theory and health outcomes
measurement in the 21st century. Med Care. 2000;38(suppl
II):II-28-II-42.
Institute
of Medicine. Crossing The Quality Chasm. Washington, DC: National
Academy Press; 2001.
Jette
AM. Using health-related quality of life measures in physical therapy
outcomes research. Phys Ther. 1993;73(8):528-537.
Jette
AM, Delitto A. Physical therapy treatment choices for musculoskeletal
impairments. Phys Ther. 1997;77(2):145-154.
Jette
DU, Jette AM. Professional uncertainty and treatment choices by physical
therapists. Arch Phys Med Rehabil. 1997;78:1346-1351.
Johnson
M, Holthaus D, Harvell J, Coleman E, Eilertsen T, Kramer A. Medicare
Post-Acute Care: Quality Measurement Final Report. US Department of Health
and Human Services. University of Colorado Health Sciences Center.
3/29/01. http://aspe.hhs.gov/daltcp/reports/mpacqm.htm
accessed 4/15/03.
Linacre JM. A User’s Guide to WINSTEPS. Chicago, IL: MESA Press;
2005.
Lohr
KN. Health outcomes methodology symposium. Summary and recommendations. Med
Care. 2000;38(suppl II):II-194-II-208.
Lord
FM. Applications of Item Response Theory To Practical Testing Problems.
Hillsdale, NJ: Lawrence Erlbaum; 1980.
Lorig
KR, Ritter P, Stewart A, Sobel DS, Brown BW, Bandura A, Gonzalez VM,
Laurent DD, Holman HR. Chronic disease self-management program. 2-year
health status and health care utilization outcomes. Med Care.
2001;39(11):1217-1223.
Marino
MJ. Outcomes management: a new paradigm for leadership. J Rehabil
Outcomes Meas. 1997;1(3):58-62.
McHorney
CA. Generic health measurement: past accomplishments and a measurement
paradigm for the 21st century. Ann Intern Med.
1997;127:743-750.
McHorney
CA. Use of item response theory to link 3 modules of functional status
from the Asset and Health Dynamics Among the Oldest Old Study. Arch
Phys Med Rehabil. 2002;83:383-94.
McHorney
CA, Cohen AS. Equating health status measures with Item Response Theory.
Illustrations with functional status items. Med Care. 2000;38(suppl
II):II-43-II-59.
McHorney CA, Ware JE, Lu
JFR, Sherbourne CD: The MOS 36-Item Short-Form Health Survey (SF-36): III.
Tests of data quality, scaling assumptions, and reliability across diverse
patient groups. Med Care. 1994;32:40-66.
McHorney
CA, Ware JE, Raczek AE: The MOS 36-item short-form health survey (SF-36),
II: psychometric and clinical tests of validity in measuring physical and
mental health constructs. Med Care. 1993;31:247-263.
Mills
CN, Potenza MT, Fremer JJ, Ward WC (editors). Computer-Based Testing.
Building The Foundation For Future Assessments. Mahwah, NJ: Lawrence
Erlbaum Associates; 2002.
Murrell
R. Quality of life and neurological illness: a review of the literature. Neuropsychol
Rev. 1999;9(4):209-229.
Ostbye
T, Tyas S, McDowell I, Koval J. Reported activities of daily living:
agreement between elderly subjects with and without dementia and their
caregivers. Age Ageing. 1997;26(2):99-106.
Pickard
AS, Johnson JA, Penn A, Lau F, Noseworthy T. Replicability of SF-36
summary scores by the SF-12 in stroke. Stroke. 1999;30(6):1213-1207.
Portney
LG, Watkins MP. Foundations Of Clinical Research. Second Edition.
Upper Saddle River, NJ: Prentice Hall Health; 2000.
Rothstein
JM, Echternach JL. Primer On Measurement: An Introductory Guide To
Measurement Issues. Alexandria, VA: American Physical Therapy
Association; 1993.
Stratford
PW, Binkley JM, Riddle DL. Development and initial validation of the Back
Pain Functional Scale. Spine. 2000;25(16):2095-2102.
Tamim
H, McCusker J, Dendukuri N. Proxy reporting of quality of life using the
EQ-5D. Med Care.
2002;40(12):1186-1195.
Waddell
G, Newton M, Henderson I, Somerville D, Main CJ. A Fear-Avoidance Beliefs
Questionnaire (FABQ) and the role of fear-avoidance beliefs in chronic low
back pain and disability. Pain. 1993;52(2):157-168.
Ware
JE, Bjorner JB, Kosinski M. Practical implications of Item Response Theory
and computerized adaptive testing. A brief summary of ongoing studies of
widely used headache impact scales. Med Care. 2000;38(suppl
II):II-73-II-82.
Ware
JE, Kosinski M, Keller SD. SF-36
Physical and Mental Health Summary Scales: A User’s Manual. Boston,
MA: The Health Institute, New England Medical Center; 1993.
Ware
JE, Sherbourne CD. The MOS 36-item short-form health survey (SF-36) I.
Conceptual framework and item selection. Med Care. 1992;30:473-483.
|