Fisseha Berhane, PhD

Data Scientist

443-970-2353 fisseha@jhu.edu CV Resume Linkedin GitHub twitter twitter

Ranking Hospitals

To compare the performance of hospitals in the USA, let's use data that come from the Hospital Compare web site (http://hospitalcompare.hhs.gov) run by the U.S. Department of Health and Human Services. The purpose of the web site is to provide data and information about the quality of care at over 4,000 Medicare-certfied hospitals in the U.S. This dataset essentially covers all major U.S. hospitals. This dataset is used for a variety of purposes, including determining whether hospitals should be fined for not providing high quality care to patients.

The Hospital Compare web site contains a lot of data and we will only look at a small subset for this analysis. The files used in this analysis are:
outcome-of-care-measures.csv: Contains information about 30-day mortality and readmission rates for heart attacks, heart failure, and pneumonia for over 4,000 hospitals.
hospital-data.csv: Contains information about each hospital.
Hospital_Revised_Flatfiles.pdf: Descriptions of the variables in each file (i.e the code book). From the pdf file, mainly we will focus on the variables for Number 19 (Outcome of Care Measures.csv") and Number 11 (Hospital Data.csv").

30-day mortality rates for heart attack

Let's start our exploration by plotting a histogram of 30-day mortality rates for heart attack.

Let's first read the outcome data into R via the read.csv function and look at the first few rows.

In [18]:
outcome <- read.csv("outcome-of-care-measures.csv", colClasses = "character")

There are many columns in this dataset. We can see how many by typing ncol(outcome) (we can see the number of rows with the nrow function). In addition, we can see the names of each column by typing names(outcome) (the names are also in the PDF document).

Let's make a simple histogram of the 30-day death rates from heart attack (column 11 in the outcome dataset).

In [19]:
options(jupyter.plot_mimetypes = 'image/png') # making my plots inline
outcome[, 11] <- as.numeric(outcome[, 11])
 
hist(outcome[, 11],xlab="30-day death rates from heart attack",
     main ="Histogram of the 30-day death rates from heart attack",
    col='skyblue',border='white')

Finding the best hospital in a state

Let's write a function called best that take two arguments: the 2-character abbreviated name of a state and an outcome name. The function reads the outcome-of-care-measures.csv file and returns a character vector with the name of the hospital that has the best (i.e. lowest) 30-day mortality for the specified outcome in that state. The hospital name is the name provided in the Hospital.Name variable. The outcomes can be one of "heart attack", "heart failure", or "pneumonia". We will exclude hospitals that do not have data on a particular outcome when deciding the rankings. Further, if there is a tie for the best hospital for a given outcome, then the hospital names will be sorted in alphabetical order and the first hospital in that set will be chosen (i.e. if hospitals "b", "c",and "f" are tied for best, then hospital "b" will be returned).

The function will check the validity of its arguments. If an invalid state value is passed to best, the function will throw an error via the stop function with the exact message "invalid state". If an invalid outcome value is passed to best, the function will throw an error via the stop function with the exact message "invalid outcome".

In [20]:
best <- function(state, outcome) {
    
    data <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
    
 
    
    if (state %in% data$State & outcome %in% c("heart failure","heart attack","pneumonia")){
     if(outcome=="heart attack"){
         a<-data[,c(2,7,11)]
     }
     
     if(outcome=="heart failure"){
         a<-data[,c(2,7,17)]
     }
     
     if(outcome=="pneumonia"){
         a<-data[,c(2,7,23)]
     }
     
     b<-split(a,a$State)
     c<-b[[state]]
     c<-c[,c(1,3)]
     colnames(c)<-cbind('Hospital.Name',outcome)
     c[,2]<-as.numeric(c[,2])
                              
     d<- c[which(c[,2]==min(c[,2],na.rm=TRUE)),]

     best_hospitals<-d$Hospital.Name
     best_hospital<-sort(best_hospitals)[1]
     
    }
    
    else if (state %in% data$State ==FALSE) { stop("invalid state")}
    
    else if (outcome %in% c("heart failure","heart attack","pneumonia")==FALSE){
        stop("invalid outcome")
    }
    
    best_hospital   
}

Let's use the function and see best hospitals in some states based on the outcomes.

In [27]:
best("TX", "heart attack")
Out[27]:
"CYPRESS FAIRBANKS MEDICAL CENTER"
In [28]:
best("TX", "heart failure")
Out[28]:
"FORT DUNCAN MEDICAL CENTER"
In [29]:
best("MD", "heart attack")
Out[29]:
"JOHNS HOPKINS HOSPITAL, THE"
In [30]:
best("MD", "pneumonia")
Out[30]:
"GREATER BALTIMORE MEDICAL CENTER"
In [31]:
best("BB", "heart attack")
Error in best("BB", "heart attack"): invalid state
In [32]:
best("NY", "hert attack")
Error in best("NY", "hert attack"): invalid outcome

Great, the function does what we want

Ranking hospitals by outcome in a state

Now, let's write a function called rankhospital that takes three arguments: the 2-character abbreviated name of a state (state), an outcome (outcome), and the ranking of a hospital in that state for that outcome (num). Then the function reads the outcome-of-care-measures.csv file and returns a character vector with the name of the hospital that has the ranking specified by the num argument. For example, the call rankhospital("MD", "heart failure", 5) will return a character vector containing the name of the hospital with the 5th lowest 30-day death rate for heart failure. The num argument can take values "best", "worst", or an integer indicating the ranking (smaller numbers are better). If the number given by num is larger than the number of hospitals in that state, then the function will return NA. Hospitals that do not have data on a particular outcome will be excluded from the set of hospitals when deciding the rankings. If multiple hospitals have the same 30-day mortality rate for a given cause of death, ties will be broken by using the hospital name.

In [33]:
rankhospital <- function(state, outcome,num='best') {
    
    if(is.character(num)){
        if (num %in% c('best','worst')==FALSE){
            stop('Invalid Rank')
        } }
    
    data <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
    
    if (state %in% data$State ==FALSE) { stop("invalid state")}
    
    if (outcome %in% c("heart failure","heart attack","pneumonia")==FALSE){
        stop("invalid outcome")}
    
   
 
    
    if (state %in% data$State & outcome %in% c("heart failure","heart attack","pneumonia")){
     if(outcome=="heart attack"){
         a<-data[,c(2,7,11)]
     }
     
     if(outcome=="heart failure"){
         a<-data[,c(2,7,17)]
     }
     
     if(outcome=="pneumonia"){
         a<-data[,c(2,7,23)]
     }
     
     b<-split(a,a$State)
     c<-b[[state]]
     c<-c[,c(1,3)]
     colnames(c)<-cbind('Hospital.Name',outcome)
     c[,2]<-as.numeric(c[,2])
     
     
     if (num=='best'){
                              
     d<- c[which(c[,2]==min(c[,2],na.rm=TRUE)),]

     hospitals<-d$Hospital.Name
     hospital<-sort(hospital)[1]
     
     }
     
     if (num=='worst'){
         
         d<- c[which(c[,2]==max(c[,2],na.rm=TRUE)),]
         
         hospitals<-d$Hospital.Name
         hospital<-sort(hospitals)[1]
         
     }
     
    if(is.numeric(num) & num < length(c$Hospital.Name)){
        
        
        l<-c[!is.na(c[,2]),]
        ll<-unique(sort(l[,2]))
        
         m<-c()
        
        for ( i in 1:length(ll)){
            temp<-which(l[,2]==ll[i])
            m<-rbind(m,l[temp,])
        }
        
        k<-m[num,2]
        p1<-which(m[,2]<k)
        p<-which(m[,2]==k)
        
        num2<-num-length(p1)
        
        hospitals<-sort(m[p,1])
        
        hospital<-hospitals[num2] 
        
    }
     
    if (is.numeric(num)){
        if (num > length(c$Hospital.Name)){ 
            return(NA)}}

    }
    
    hospital   
}

Let's use our function now.

In [34]:
rankhospital("TX", "heart failure", 4)
Out[34]:
"DETAR HOSPITAL NAVARRO"
In [35]:
rankhospital("MD", "heart attack", "worst")
Out[35]:
"HARFORD MEMORIAL HOSPITAL"
In [36]:
rankhospital("MN", "heart attack", 5000)
Out[36]:
[1] NA

That is great! The function retursn what we want!

Ranking hospitals in all states

Finally, let's rank hospitals in all states.

Let's write a function called rankall that takes two arguments: an outcome name (outcome) and a hospital ranking (num). The function reads the outcome-of-care-measures.csv file and returns a 2-column data frame containing the hospital in each state that has the ranking specified in num. For example the function call rankall("heart attack", "best") will return a data frame containing the names of the hospitals that are the best in their respective states for 30-day heart attack death rates. The function will return a value for every state (some may be NA). The first column in the data frame is named hospital, which contains the hospital name, and the second column is named state, which contains the 2-character abbreviation for the state name. Hospitals that do not have data on a particular outcome will be excluded from the set of hospitals when deciding the rankings.


The rankall function will handle ties in the 30-day mortality rates in the same way that the rankhospital function handles ties.

In [37]:
rankall <- function(outcome,num='best') {
    
    if(is.character(num)){
        if (num %in% c('best','worst')==FALSE){
            stop('Invalid Rank')
        } }
    
    data <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
    
    
    if (outcome %in% c("heart failure","heart attack","pneumonia")==FALSE){
        stop("invalid outcome")}
    
   
 
    
    if (outcome %in% c("heart failure","heart attack","pneumonia")){
     if(outcome=="heart attack"){
         a<-data[,c(2,7,11)]
     }
     
     if(outcome=="heart failure"){
         a<-data[,c(2,7,17)]
     }
     
     if(outcome=="pneumonia"){
         a<-data[,c(2,7,23)]
     }}
    
    
    a[,3]<-as.numeric(a[,3])
    
    
    colnames(a)<-cbind('hospital','state','value')
    
    states<-sort(unique(a$state))
    
    b<-split(a,a$state)
    
    ans_state<-c()
    ans_hospital<-c()
    
    
    
    
    for (letter in states){

   
    g<-b[[letter]]
    
     g<-g[,c(1,3)]

    if (num=='best'){
        
        d<- c[which(g[,2]==min(g[,2],na.rm=TRUE)),]
        
        hospitals<-d$hospital
        hospital<-sort(hospitals)[1]
        
    }
    
    if (num=='worst'){
        
        d<- g[which(g[,2]==max(g[,2],na.rm=TRUE)),]
        
        hospitals<-d$hospital
        hospital<-sort(hospitals)[1]
        
    }
    
    if(is.numeric(num) & num < length(g$hospital)){
        
        
        l<-g[!is.na(g[,2]),]
        ll<-unique(sort(l[,2]))
        
        m<-c()
        
        for ( i in 1:length(ll)){
            temp<-which(l[,2]==ll[i])
            m<-rbind(m,l[temp,])
        }
        
        k<-m[num,2]
        p1<-which(m[,2]<k)
        p<-which(m[,2]==k)
        
        num2<-num-length(p1)
        
        hospitals<-sort(m[p,1])
        
        hospital<-hospitals[num2]
    }
    
    if (is.numeric(num)){
        if (num > length(g$hospital)){ 
            hospital<-NA}}
    
    ans_state<-c(ans_state,letter)
    ans_hospital<-c(ans_hospital,hospital)
    
    }
    
    df = data.frame(ans_hospital, ans_state)       # df is a data frame 
    colnames(df)<-c('hospital','state')
    
    df
}

Let's see some examples.

In [38]:
head(rankall("heart attack", 20), 10)
Out[38]:
hospitalstate
1NAAK
2D W MCMILLAN MEMORIAL HOSPITALAL
3ARKANSAS METHODIST MEDICAL CENTERAR
4JOHN C LINCOLN DEER VALLEY HOSPITALAZ
5SHERMAN OAKS HOSPITALCA
6SKY RIDGE MEDICAL CENTERCO
7MIDSTATE MEDICAL CENTERCT
8NADC
9NADE
10SOUTH FLORIDA BAPTIST HOSPITALFL


In [39]:
tail(rankall("pneumonia", "worst"), 3)
Out[39]:
hospitalstate
52MAYO CLINIC HEALTH SYSTEM - NORTHLAND, INCWI
53PLATEAU MEDICAL CENTERWV
54NORTH BIG HORN HOSPITAL DISTRICTWY




comments powered by Disqus