Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshape long to wide help

    Hello all:

    1) I'm working with clinical data that is currently in long format and I wanted to reshape to wide. The data was collected using multiple forms and also has repeat instances for daily visits. I wanted to reshape from long to wide and than label the repeat instances for daily visit 1,2,3,etc. Please see an example of the data below.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 sid str13 form byte(repeat_instance age) str6 sex double temp str5 status
    "bn_01" "clinical_data" . 23 "male"      . ""     
    "bn_01" "daily_visit"   1  . ""         38 ""     
    "bn_01" "daily_visit"   2  . ""         37 ""     
    "bn_01" "30_day_status" .  . ""          . "alive"
    "bn_02" "clinical_data" . 67 "female"    . ""     
    "bn_02" "daily_visit"   1  . ""       36.2 ""     
    "bn_02" "daily_visit"   2  . ""       36.4 ""     
    "bn_02" "daily_visit"   3  . ""         37 ""     
    "bn_02" "30_day_status" .  . ""          . "dead"
    end
    2) I tried to reshape using the code below and got the output below. I could work with that and clean it up but wanted to know if there was a more efficient strategy.

    drop form repeat_instance
    sort sid
    by sid: gen num = _n
    reshape wide age-status, i(sid) j(num)

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 sid byte age1 str6 sex1 double temp1 str5 status1 byte age2 str6 sex2 double temp2 str5 status2 byte age3 str6 sex3 double temp3 str5 status3 byte age4 str6 sex4 double temp4 str5 status4 byte age5 str6 sex5 double temp5 str5 status5
    "bn_01" 23 "male"   . "" . ""   38 "" . ""   37 "" . ""  . "alive" . "" . ""    
    "bn_02" 67 "female" . "" . "" 36.2 "" . "" 36.4 "" . "" 37 ""      . "" . "dead"
    end

    3) Ideally, I want an output like that below. Again, I wanted the temp variable (which is repeated daily while the patient is hospitalized) to be labeled 1,2,3, etc to indicate days of hospital say

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 sid byte age str6 sex str5 status double(temp1 temp2) byte temp3
    "bn_01" 23 "male"   "alive"   38   37  .
    "bn_02" 67 "female" "dead"  36.2 36.4 37
    end
    Thanks for any insights.

    DM





  • #2
    There are a few obstacles to overcome to get to your desired result. First, the variables age, sex, and status are reported only once per sid, but in order for them to both be retained in the result but neither block -reshape- nor themselves be made wide (with most of the values missing) you need to spread their values out to all of an sid's observations. For that to work properly, you also have to verify that somewhere in the data there isn't some sid that has non-missing values of those in two or more places (instead of just one as in the example) and those values contradict each other! In my experience, large clinical data sets are often riddled with internal inconsistencies like that.

    Once you do that, it's a straightforward -reshape-.
    Code:
    //  MARK AND NOTE ORIGINAL SORT ORDER
    sort sid, stable
    by sid: gen long seq = _n
    
    //  SPREAD VALUES OF AGE, SEX, AND STATUS TO ALL OBSERVATIONS
    //  WHILE VERIFYING THAT THERE ARE NO INCONSISTENT REPORTS
    foreach v of varlist age sex status {
        capture confirm string var `v', exact
        if c(rc) == 0 { // STRING VARIABLES
            by sid (`v'), sort: assert missing(`v') | `v' == `v'[_N]
            by sid (`v'): replace `v' = `v'[_N]
        }
        else {  //  NUMERIC VARIABLES
            by sid (`v'), sort: assert missing(`v') | `v' == `v'[1]
            by sid (`v'): replace `v' = `v'[1]
        }
    }
    
    keep sid age sex status temp seq
    drop if missing(temp)
    by sid (seq), sort: replace seq = _n // PATCH THE NUMBERING
    reshape wide temp, i(sid) j(seq)
    If the code gives you an -assertion is false r(9)- error message and halts without completing, that means it has found inconsistencies, and you will need to chase them down and fix them, and then try again.

    Added: By the way, do you have a good reason for transforming this data to wide layout? Usually in Stata you are better off with a long layout and converting to wide just makes life difficult.
    Last edited by Clyde Schechter; 10 Jan 2023, 14:40.

    Comment


    • #3
      Thanks Clyde. I received an error message but I was able to sort out the inconsistencies.

      To answer your question, I wanted to transform the data because most of the data we are collecting for this study will be in the wide format. The version I am working with is currently in a long format but that is largely because of the way the electronic data collection system we are using stores the different forms of the data. Some of our forms will actually be in paper and will get uploaded in a wide format.

      Just one question for you if you have the time, could you explain what the code is doing here: sort: assert missing(`v') | `v' == `v'[_N]
      I tried to figure it out but still unsure.

      Thanks again

      Comment


      • #4
        Let's unpack
        Code:
        by sid (`v'), sort: assert missing(`v') | `v' == `v'[_N]
        This line of code appears in the branch of the code that applies when `v' is a string variable. In Stata, missing values of string variables are represented as "" and in sorting they come first. If a given sid has only one non-missing value of `v', that non-missing value will appear last among that sid's observations in that designated sort order. -assert- is the command that is used to verify that assumptions/requirements of code are met: it tests the logical expression in the command. If the assertion is true, it leaves no output and execution proceeds to the next command. If the assertion is false, it prints an error message and halts execution. In this case, the logical expression asserts that values of `v' are either missing or equal to the value of `v' that sorted last (i.e., the non-missing value.) Now, this particular logical expression will also be true if there is more than one non-missing value of `v' but they all agree with each other. So this command verifies that all of the non-missing values of `v' for a given sid are equal.

        Comment


        • #5
          Thanks Clyde

          Comment

          Working...
          X