Patient Self-Reported Outcomes:
 Reliability and Validity White Paper

Focus On Therapeutic Outcomes, Inc.

May 17, 2005

Collection of outcomes following clinical treatment is now commonplace in rehabilitation. (Hart 2002) Clinicians use outcomes to tract changes in their patients to assess if the patient is improving with a specific treatment, (Jette & Delitto 1997) if treatment needs to be changed or terminated, (Jette & Jette 1997) and if the patient needs to be referred to another clinician or service. (Jette & Jette 1997) Administrators use outcomes to compare their department’s success with other similar departments, to market the department’s services, to manage resources required to deliver their clinical services, and to manage their clinicians. (Marino 1997) The federal government has mandated the collection of outcomes for post-acute rehabilitation in skilled nursing facilities, nursing homes and in patient rehabilitation hospitals, and the government has directed the development of patient assessment instruments designed to collect outcomes. (Johnson 2001) However, debate continues concerning what to measure for outcomes and how to measure the outcome. This paper discusses the collection of patient self-reported outcomes with particular attention to reliability and validity of the measures.

The Institute of Medicine (IOM) (2001) examined the American health care delivery system and determined the “system is in need of fundamental change”. (p 1) The IOM identified that Americans want and deserve quality clinical services, but the current system does not routinely collect measures of quality and therefore cannot assess quality. So, the IOM proposed six goals for improvement of the 21st century health care system. Among the goals, health care should be effective and patient-centered. Effective was operationally defined as “providing services based on scientific knowledge to all who could benefit and refraining from providing services to those not likely to benefit (avoiding underuse and overuse, respectively).” (p 6) Patient-centered was operationally defined as “providing care that is respectful of and responsive to individual patient preferences, needs, and values and ensuring that patient values guide all clinical decisions.” (p 6) Effective, patient-centered care needs to be measured, so researchers and clinicians can determine, for a specific patient at a specific time and intensity, which treatment is effective, which is a measure of clinical quality. (Hart 2001) For the purpose of this paper, use of patient self-report of outcomes is proposed as the best means of obtaining the aim of collecting effective, patient-centered outcomes. (Jette 1993)

Over the past decades, “one of the more important developments in the health care field has been the recognition of the centrality of the patient point of view in monitoring the quality of medical care outcomes”. (Ware 1993 p 2:1; Geigle 1990) “A medical outcome has come to mean the extent to which a change in a patient’s behavioral functioning or well-being meets that patient’s needs or expectations.” (Ware 1993 p 2:1) The goal of medical or rehabilitative care for most patients is the preservation of function, (Ellwood 1988) although other outcomes, i.e. pain relief, (Fairbank 1980) improved self-efficacy, (Lorig 2001) among other measures are also important.  Although patients are the best source of information regarding achievement of their goals of improved function, information from patients about their improved functional status and treatment has not been routinely collected in previous clinical research or medical practice. (Ware 1993)

In a recent report on quality outcomes measurements in post-acute rehabilitation facilities, (Johnson 2001) researchers acknowledged that the current federally mandated outcomes tools for post-acute rehabilitation facilities do not measure whether the patient’s perception of their functional status was maximized, prepared him/her to return to and remain in the most independent living environment, and reintegrated him/her into prior lifestyle. The researchers further acknowledged that measuring such important information requires input directly from patients or their proxies as an important aspect of clinical outcomes. Johnson et al (Johnson 2001) identified that many researchers focused on patient-centered outcomes believe that functional outcomes should come from patients because their perception of their function is more important than “so-called objective measures of function”. (p 13) Patient self-report of their functional ability represents their perception of their ability and integrates the relevance of the functional ability to the patient. If a patient has difficulty performing a functional task, but the task is not relevant to the patient, the task is likely of little importance or relevance to his or her life and needs. (p 13) If a clinician had measured the patient’s functional ability to perform a task that was not relevant to the patient, the clinician may attach more importance to the task or measure of functional ability than is appropriate according to the patient.

The existing patient assessment instruments for patients in post-acute rehabilitation facilities require facility staff to measure outcomes. The federally mandated instruments include the Functional Independence Measure (FIM) used in inpatient rehabilitation hospitals, Minimum Data Set (MDS) used in skilled nursing facilities, and Outcome and Assessment Information Set (OASIS) used in home health settings. They are all examples of clinician-generated measures of patient function. The FIM, MDS and OASIS use clinical staff, i.e. physical or occupational therapists, nurses, etc. to assess the patient’s functional abilities pertinent to sets of activities of daily living. The FIM, MDS and OASIS do not consider the patient’s perception of their functional health status, and therefore fail to capture the relevance of the individual functional measures to the patient. Because the goal of rehabilitation is to restore patients to their previous level of function, capturing their perception of function both before and after rehabilitation in their terms is essential to measuring quality and effectiveness of patient-centered treatment. (Johnson 2001)

Another complicating problem of using clinician-generated outcomes measures is the fact that many important clinical outcomes cannot be measured within the time frames of the delivery of typical rehabilitation services. For example, it may take time beyond the time when rehabilitation was provided for a person to maximize their functional abilities, sustain their maximal level of independence within their normal or modified living environment, or reintegrate into their prior activities and lifestyle. (Johnson 2001) If patients continue to improve in their functional health status, clinician-generated outcomes will not capture the improvement unless the patient returns to the clinician or clinical team for another assessment, which costs money. If the patient had been assessed using a patient self-report process through rehabilitation, the patient could complete another survey at home using pencil and paper or Internet delivery mechanisms to capture the change in functional health status. Collecting patient self-report outcomes data does not require the same clinician or staff burden as collecting clinician-generated data. Therefore, although no studies were found comparing cost of collecting patient self-report vs. clinician-generated outcomes data, it is logical that patient self-report data collection is cheaper than clinician-generated measures.

When researchers and clinicians debate collection of measures of outcome, like functional health status, discussion commonly addresses the psychometrics of the measures. Outcomes measures need to be assessed for their reliability, validity and responsiveness. A complete discussion of the psychometrics of outcomes measures is beyond the score of this paper, but key elements are crucial to the understanding of outcomes measures.

Reliability of a measure of outcome is paramount. Reliability means the measure is reproducible, consistent and free from error. (Portney 2000) If the patient’s ability is reliable, a measure of that ability should also be reliable. No measure is free from error, so there is always a need to assess the degree to which error influences the measure and its interpretation. In the case of an outcomes measure, it is reasonable to insist the measure has good test-retest reliability. Test-retest reliability means the measure of function is almost the same when the measure is taken twice on the same patient. Test-retest reliability strikes at the heart of the argument common for clinicians: patient self-report is “subjective”. In fact, no measure is ever entirely subjective or objective. The two concepts are connected along a measurement continuum by the measurement of reliability. When a measure has poor reliability, the measure is subjective be definition. When a measure has good reliability, the measure is objective by definition. Therefore, a subjective phenomenon may be measured objectively. In other words, the phenomenon may be either subjective or objective, and the quality of the measurement may be either subjective or objective. (Rothstein 1993) Take for example the patient’s perception of their physical functional health status as measured by the SF-36 Physical Functioning Scale (PF-10). (Ware 1993) The patient is asked the question (one of the PF-10 items): Does your health now limit you in vigorous activities, such as running, lifting heavy objects, participating in strenuous sports? If so, how much? The patient responds “Yes, limited a lot”, “Yes, limited a little”, or “No, not limited at all”. 

At first glance, clinicians commonly say the patient’s responses are “subjective”. However, the PF-10 has been studied for several types of reliability, including test-retest. For patients with diabetes, the correlation coefficient for repeated measures was .90, and for patients in a general medical practice, the correlation coefficient was .81. (Ware 1993) For patients receiving outpatient rehabilitation using the FOTO 24-item functional health status instrument with items similar to the SF-36, (Hart 2001) test-retest reliability intraclass correlational coefficient (ICC(2,1)) of an overall functional health status measure was .92. (Hart 2003) Therefore, in spite of the phenomenon (i.e. patient self-report of their perception of their functional health status) appearing “subjective”, the quality of the patient self-report measure of their functional health status is indeed “objective” because the test-retest reliability statistic for the measure was good.

Patient self-report assessments are becoming common, and most test-retest reliability statistics support the objectivity of the measures. Some other examples include:

Oswestry Low Back Pain Disability Questionnaire: correlation coefficient .99 (Fairbank)

Lower Extremity Functional Scale: ICC(2,1)=.94 (Binkley)

Fear-Avoidance Beliefs Questionnaire: k=.74 (Waddell)

Back Pain Functional Scale: ICC(2,1)=.88 (Stratford)

The next psychometric quality of interest is validity. “Measurement validity concerns the extent to which an instrument measures what it is intended to measure.” (Portney p79) There are many forms of validity, such as face, content, criterion-related, construct. Validity is never fully “proven”, but data tend to support or not support the validity of the measure of interest as that measure was used to assess a specific concept for a specific group of patients. Therefore, testing validity of measures is a fluid environment.

 “Face validity indicates that an instrument appears to test what it is supposed to and that it is a plausible method for doing so.” (Portney p 82) Face validity is the least rigorous method of testing validity, and there are no standards for assessing how much face validity a measurement has. (Portney) Face validity of a rating for getting out of a chair as an activity of daily living (ADL) could be assessed by observing a patient get out of a chair and determining whether the patient used any assistance from another person. It could be argued that the observation of the ADL has face validity because the observation appears to test the activity of getting out of a chair. Face validity plays almost no role in assessing patient self-report of functional health status.

Content validity refers to the adequacy with which the universe of a theoretical domain of behaviors or characteristics related to a measure, i.e. functional health status, are sampled by a test. (Portney) Content validation in the field of health status is a challenge because of the breadth of the health status domain. Health status is difficult to define, can cover many different constructs and there is no standard to which the results of the content validation process. The developers of the SF-36 (Ware 1992 & 1993; McHorney 1993 & 1994) met this challenge by comparing the content of the SF-36 to other widely used health status surveys and by applying factor analytic techniques to see if the constructs covered by the items of the SF-36 were similar to the constructs assessed by other health instruments. The items in the SF-36 and its subscales covered most of the content areas of most health instruments. Factor analytic techniques also supported content validity of the patient self-report SF-36. (Ware 1993)

FOTO uses 72 functional status items to assess improvement in patient self-reported functional status. The items are separated into two measures: one for physical functional status, and one for mental functional status. Analyses using item response theory (IRT) techniques  support what is clinically logical: the two constructs should be kept separate for better understanding of the patient’s functional status.  Further analyses supported the content validity of the 50-item physical functional status item bank.

Using the Oswestry scale to assess disability in patients with low back pain, (Fairbank) patients self-reported their perception of their disability, and clinicians assess the patient’s observed disability and symptoms. The two scores were matched over treatment time and produced the predicted plot, thus supporting concurrent, criterion-related validity of the patient self-report measure of disability.

“Construct validity reflects the ability of an instrument to measure an abstract concept, or construct.” (Portney p 87) Most constructs that are measured are not directly observable and “exist only as concepts that are constructed to represent an abstract trait”. (Portney p 87) Health status is a good example. Clinicians have difficulty agreeing on the operational definition of health status, and therefore, “the definition of such a construct as “health status” can be determined only by the instrument used to measure it.” (Portney p 87-88) The construct of physical functional health status has been described by applying instruments designed to assess health status to people who received physical or occupational therapy in outpatient facilities for the primary goal of improvement in physical functioning. (Hart et al 2001; Hart 2001; Hart & Wright 2002) Using the known groups method (Portney) of construct validity, change in physical functional health status as assessed by the FOTO 24-item health status instrument was shown to discriminate (discriminant validity) patients receiving workers’ compensation benefits who received acute work rehabilitation compared to patients receiving work conditioning or had acute vs. chronic symptoms. (Hart 2001) Using the SF-36, patients were discriminated by their level of severity of several medical and psychiatric conditions. (McHorney 1993) The FOTO health status instrument (Hart 2001) and SF-36 (Ware 1993) both demonstrated good convergent and discriminant validity by having the physical scales correlate well with each other and the mental scales correlate well with each other while physical and mental scales did not correlate with each other.

In a recent unpublished analysis (Hart DL. Predictive validity of a patient self-report outcomes process. Abstract submitted to the Combined Sections Meeting of the American Physical Therapy Association for 2004.), risk-adjusted change in physical functional health status (FHS) and visits per patient could be successfully predicted 76% to 78% of the time (90% confidence interval) for patients receiving physical or occupational therapy using the 50-item FOTO physical FHS computer adaptive testing process (unpublished; Hart DL. Computerized adaptive testing: application in outcomes measurement. Abstract presented at the American Physical Therapy Association Combined Sections Meeting in 2002.).

Patient self-report of health status instruments require patients to have cognitive abilities to successfully read, understand and answer each question. When cognitive abilities are impaired, answers to patient self-report instruments become less reliable and valid, and proxy report of the patient’s functional health status becomes an option. However, measures of functional health status from a proxy who is intimately aware of the functioning of the patient with cognitive deficits are not exactly the same as a patient’s perception of their own functional abilities even under normal conditions. Literature to date has been equivocal at best for support of proxy health status when the patient has cognitive deficits. (Ostbye 1997; Pickard 1999; Murrell 1999; Andresen 1999) When patients are cognitively normal, caregivers as proxies can provide reliable assessments of the patient’s ability to complete ADLs, but agreement between patient and proxy assessment of ADL or functional ability decreases as severity of dementia and (Ostbye 1997) age, (Pickard 1999) or as older adults become more medically ill. (Tamim 2002) Validation of outcomes instruments is rare for instruments designed to assess health status of patients with cognitive deficits. So, when the patient’s cognitive deficits are in question, use and interpretation of patient self-report of health status should be approached with caution. FOTO supports use of proxy FHS measures if the patient does not have a cognitive deficit and cannot complete the survey. When the patient has a cognitive deficit, FOTO uses measures collected from proxies to assess change over time, but because we cannot correlate the proxy’s assessment of the patient’s functional status to the patient’s assessment of their own functional status, interpretation and validity of the functional status measures for patients with cognitive deficits cannot be confirmed.

Therefore, in patients who are cognitively normal, patient self-report of health status has a long history of reliability and validity. When measures have good reliability, the measures are objective. Because the measures have been supported as valid using several different statistical techniques, the measures can be used for tracking change in health status with many populations of patients.

The future vision of the new health care delivery system as presented by the IOM (IOM) will use patient-centered, self-report measures that, at least in rehabilitation, will assess change in physical functional health status at a minimum. Because researchers in health status have the mathematical techniques (Cella 2000; Hambleton 2000; Hays 2000; Lohr 2000; McHorney 1997 & 2000; Ware 2000) and computer power, which along with the Internet are readily available to the public as well, patient-self report will be transformed into efficient, easy to use computerized adaptive testing processes (CAT) (Mills 2002) that will be used any time of the day or night, any day of the week, all year long by patients and clinicians. CAT processes will direct functional FHS items one-at-a-time to the patient that are impairment- and ability-specific. This represents the epitome of patient-centered assessment. CATs will function like this. Once the computer knows the patient’s impairment, the computer will ask a physical FHS item of median level difficulty pertinent to the patient’s impairment, i.e. a patient who just has the ability to get out of bed will not be asked about climbing stairs. If the item is too easy, i.e. the patient is functioning at a level higher than the item describes, the computer will ask more difficult items until an item is above the patient’s physical FHS level. Conversely, if the item is too difficult, i.e. the patient is functioning at a level lower than the item describes, the computer will ask less difficult items until an item is below the patient’s physical FHS level. In this way, CATs will reduce the burden associated with collecting patient self-report physical FHS data while maintaining a high level of measure precision. Another way to develop a CAT is to ask the median level difficulty item first, then subsequent items could be selected if they provide the maximum amount of information (Lord 1980) given the current estimate of the patient’s functional ability. (Hart 2005)

FOTO, Inc. has several functioning CAT processes for physical functional health status that are commercially operational in hundreds of outpatient clinics across the country. Experience has demonstrated patients like the CAT process better than completing paper and pencil forms or even computer administered surveys. First, the patients like seeing one question at a time. Second, the letter font is larger, which benefits those with visual difficulties. Third, the computer produces an aura of credibility, which the patients like. Fourth, patients like the shorter time, which translates into reduced patient burden to collect the data. This is particularly beneficial for older patients.

In summary, CAT processes will be used for patient self-report of health status more and more. As more knowledge is gained, mathematical models will be improved, measures will become more precise, and items will become more impairment- and ability-specific to each patient. The process is patient-centered, and future CATs will be directive for the patients who are interested in knowing more about their condition and clinicians who have track records for obtaining good outcomes treating patients with similar conditions.

 

References

Andresen EM, Gravitt GW, Aydelotte ME, Podgorski CA. Limitations of the SF-36 in a sample of nursing home residents. Age Ageing. 1999;28(6):562-6.

Binkley JM, Stratford PW, Lott SA, et al. The Lower Extremity Functional Scale (LEFS): Scale development, measurement properties, and clinical application. Phys Ther. 1999;79(4):371-383.

Cella D, Chang CH. A discussion of Item Response Theory and its applications in health status assessment. Med Care. 2000;38(suppl II):II-66-II-72.

Ellwood PM. Outcomes management: a technology of patient experience [Shattuck Lecture]. N Eng J Med. 1988;318:1549-1556.

Fairbank JC, Couper J, Davis JB, O’Brien JP. The Oswestry low back pain disability questionnaire. Physiotherapy. 1980;66(8):271-3.

Geigle R, Jones SB. Outcomes measurement: a report from the front. Inquiry. 1990;27:7-13.

Hambleton RK. Emergence of item response modeling in instrument development and data analysis. Med Care. 2000;38(Suppl II):II-60-II-65.

Hart DL. The power of outcomes: FOTO Industrial Outcomes Tool – Initial assessment. Work. 2001;16:39-51.

Hart DL. Test-retest reliability of an abbreviated self-report overall health status measure. J Orthop Sports Phys Ther. 2003;33(12):734-742.

Hart DL, Mioduski JE, Stratford PW. Simulated computerized adaptive tests for measuring functional status were efficient with good discriminant validity in patients with hip, knee, or foot/ankle impairments. J Clin Epidemiol. 2005;58(6):629-638.

Hart DL, Tepper S, Lieberman D. Changes in health status for persons with wrist or hand impairments receiving occupational or physical therapy. Am J Occup Ther. 2001;55:68-74.

Hart DL, Wright BD. Development of an index of physical functional health status. Arch Phys Med Rehabil. 2002;83(5):655-665.

Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21st century. Med Care. 2000;38(suppl II):II-28-II-42.

Institute of Medicine. Crossing The Quality Chasm. Washington, DC: National Academy Press; 2001.

Jette AM. Using health-related quality of life measures in physical therapy outcomes research. Phys Ther. 1993;73(8):528-537.

Jette AM, Delitto A. Physical therapy treatment choices for musculoskeletal impairments. Phys Ther. 1997;77(2):145-154.

Jette DU, Jette AM. Professional uncertainty and treatment choices by physical therapists. Arch Phys Med Rehabil. 1997;78:1346-1351.

Johnson M, Holthaus D, Harvell J, Coleman E, Eilertsen T, Kramer A. Medicare Post-Acute Care: Quality Measurement Final Report. US Department of Health and Human Services. University of Colorado Health Sciences Center. 3/29/01. http://aspe.hhs.gov/daltcp/reports/mpacqm.htm accessed 4/15/03.

Linacre JM. A User’s Guide to WINSTEPS. Chicago, IL: MESA Press; 2005.

Lohr KN. Health outcomes methodology symposium. Summary and recommendations. Med Care. 2000;38(suppl II):II-194-II-208.

Lord FM. Applications of Item Response Theory To Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum; 1980.

Lorig KR, Ritter P, Stewart A, Sobel DS, Brown BW, Bandura A, Gonzalez VM, Laurent DD, Holman HR. Chronic disease self-management program. 2-year health status and health care utilization outcomes. Med Care. 2001;39(11):1217-1223.

Marino MJ. Outcomes management: a new paradigm for leadership. J Rehabil Outcomes Meas. 1997;1(3):58-62.

McHorney CA. Generic health measurement: past accomplishments and a measurement paradigm for the 21st century. Ann Intern Med. 1997;127:743-750.

McHorney CA. Use of item response theory to link 3 modules of functional status from the Asset and Health Dynamics Among the Oldest Old Study. Arch Phys Med Rehabil. 2002;83:383-94.

McHorney CA, Cohen AS. Equating health status measures with Item Response Theory. Illustrations with functional status items. Med Care. 2000;38(suppl II):II-43-II-59.

McHorney CA, Ware JE, Lu JFR, Sherbourne CD: The MOS 36-Item Short-Form Health Survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med Care. 1994;32:40-66.

McHorney CA, Ware JE, Raczek AE: The MOS 36-item short-form health survey (SF-36), II: psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care. 1993;31:247-263.

Mills CN, Potenza MT, Fremer JJ, Ward WC (editors). Computer-Based Testing. Building The Foundation For Future Assessments. Mahwah, NJ: Lawrence Erlbaum Associates; 2002.

Murrell R. Quality of life and neurological illness: a review of the literature. Neuropsychol Rev. 1999;9(4):209-229.

Ostbye T, Tyas S, McDowell I, Koval J. Reported activities of daily living: agreement between elderly subjects with and without dementia and their caregivers. Age Ageing. 1997;26(2):99-106.

Pickard AS, Johnson JA, Penn A, Lau F, Noseworthy T. Replicability of SF-36 summary scores by the SF-12 in stroke. Stroke. 1999;30(6):1213-1207.

Portney LG, Watkins MP. Foundations Of Clinical Research. Second Edition. Upper Saddle River, NJ: Prentice Hall Health; 2000.

Rothstein JM, Echternach JL. Primer On Measurement: An Introductory Guide To Measurement Issues. Alexandria, VA: American Physical Therapy Association; 1993.

Stratford PW, Binkley JM, Riddle DL. Development and initial validation of the Back Pain Functional Scale. Spine. 2000;25(16):2095-2102.

Tamim H, McCusker J, Dendukuri N. Proxy reporting of quality of life using the EQ-5D.  Med Care. 2002;40(12):1186-1195.

Waddell G, Newton M, Henderson I, Somerville D, Main CJ. A Fear-Avoidance Beliefs Questionnaire (FABQ) and the role of fear-avoidance beliefs in chronic low back pain and disability. Pain. 1993;52(2):157-168.

Ware JE, Bjorner JB, Kosinski M. Practical implications of Item Response Theory and computerized adaptive testing. A brief summary of ongoing studies of widely used headache impact scales. Med Care. 2000;38(suppl II):II-73-II-82.

Ware JE, Kosinski M, Keller SD. SF-36 Physical and Mental Health Summary Scales: A User’s Manual. Boston, MA: The Health Institute, New England Medical Center; 1993.

Ware JE, Sherbourne CD. The MOS 36-item short-form health survey (SF-36) I. Conceptual framework and item selection. Med Care. 1992;30:473-483.

  privacy statement | feedback | site credits
©2005 Focus on Therapeutic Outcomes, Inc.
All Rights Reserved.
 
 

 Past
  Papers Published
   Using FOTO Data
Patient Self-Report Reliability
   and Validity White Paper
Frequently Asked Questions
Research Awards
CAT References
Present
 


Recent Papers Published
   Using FOTO Data
Pay-for-Performance Grant
Self-efficacy Project
   (Fear-avoidance)
 
Student Research Project
Future
Important Links
  Cat Central
Qmetric's Headache Cat
METRIC