Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Summary stats - checking associations between a categorical and continuous variables

    Hi All

    I've been trying to study relationships between the following three variables, one of which is slightly unusual; the 'labtest' variable indicates various tests conducted for a patient during a clinic visit. Ideally, the single 'labtest' variable below would be several different variables, each one for particular test such as BMI, pH, cholesterol etc. This is a panel (longitudinal) dataset where a subject can have between 1 and 12 clinic visits. Ethnicity and labtest are categorical variables with ethnicity having four categories and labtest having five categories (coded 1 to 5) indicating the the kind of test run.

    Code:
    ID   ethnicity labtest   labtestvalue
    
    1   white   bmi    22
    1   white   pH     6.4
    1   white   tsh    12
    1   white   chol   18
    2   mixed   bmi    23
    2   mixed   pH     7.4
    2   mixed   tsh    18
    3   indian  bmi    26
    3   indian  pH     4.3
    3   indian  tsh    18
    3   indian  chol   21
    3   indian  height 182
    If I wanted to find out the mean bmi by ethnic group (above is just a snap shot of three subjects), I tried the following:

    Code:
    tabstat labtestvalue, by(ethnicity) stat(mean sd n), if labtestvalue==1
    However, the above does not return what I'm looking for. How does one get what I'm after?



    A second question: How would I go about creating a new variable BMI by extracting BMI values from the 'labtest' and 'labtestvalue' variables as follows. The labtest indicates which subject has a recorded BMI and the labtestvalue provides the actual value.

    Code:
    ID   ethnicity labtest   labtestvalue   bmi 
    
    1   white   bmi    22    22
    1   white   pH     6.4   22
    1   white   tsh    12    22
    1   white   chol   18    22
    2   mixed   bmi    23    23
    2   mixed   pH     7.4   23
    2   mixed   tsh    18    23
    3   indian  bmi    26    26
    3   indian  pH     4.3   26
    3   indian  tsh    18    26
    3   indian  chol   21    26
    3   indian  height 182   26
    Thanks!

    /Amal

  • #2
    While I and others are often pressing people to put their data into long rather than wide layout, your situation is one where the wide layout will make life easier. Having labtest as a variable and then having a labtestvalue that contains a very heterogeneous bunch of results doesn't make a lot of sense given how those data are likely to be used (and given, specifically, how you are looking to use them here).

    So the first step, which actually answers your second question and then some, is:

    Code:
    rename labtestvalue _
    reshape wide _, i(id) j(labtest) string
    rename _* *
    That will give you data with separate variables for bmi, pH, tsh, chol, and height. Now you can find means of these by ethnic group in several ways. If you just want a table, you can do something like

    Code:
    tabstat bmi pH tsh chol height, by(ethnicity) stat(mean sd n)
    If you need to create variables containing ethnic means in your data set, see -help egen-. If you need a data set containing one observation per ethnicity and the corresponding means, see -help collapse-.....

    Comment


    • #3
      Hi Clyde

      Thanks for you suggestion. I also prefer to work with data in the long format, especially in this case as I plan to run longitudinal analysis at a later stage.

      However, I did try to reshape my dataset - from long to wide - but received an error message, that i() and j() do not uniquely identify the observations in the long dataset. My example above was just a very basic made-up version (I'm not allowed to upload or the actual data in any way). In the actual dataset, a subject may have repeated rows of the same data, for example below subject 1, the recording of BMI=22.3 is entered twice but it is essentially the same recording taken at the same visit. Similarly, the BMI recording=23.4 is entered three times but they are all from the same visit. This data was entered in this manner, I think, to accommodate other variables such as drug prescriptions. This is the first time I'm working with a dataset in this format.

      Code:
      ID  Visitdate labtest labtestvalue ethnicity drugname
      1   20/5/2014  BMI     22.3  white    A
      1   20/5/2014  BMI     22.3  white    B
      1   11/7/2015  BMI     23.4  white    A
      1   11/7/2015  BMI     23.4  white    B
      1   11/7/2015  BMI     23.4  white    A
      All other variables such as visit dates, ethnicity, age etc are entered consecutively for all rows/observations. Is there a way to work around the above? I tried to reshape the data to wide indicating the drugname variable and others similarly coded but I get the same error message.

      Thanks

      /Amal

      Comment


      • #4
        As Clyde points out, the quest is for separate variables. So, consider using separate:

        Code:
        clear
        input ID  str6 ethnicity str6 labtest   labtestvalue
        1   white   bmi    22
        1   white   pH     6.4
        1   white   tsh    12
        1   white   chol   18
        2   mixed   bmi    23
        2   mixed   pH     7.4
        2   mixed   tsh    18
        3   indian  bmi    26
        3   indian  pH     4.3
        3   indian  tsh    18
        3   indian  chol   21
        3   indian  height 182
        end
        separate labtestvalue, by(labtest) veryshortlabel
        tabstat labtestvalue1, by(ethnicity) stat(mean sd n)
        Note that in your original

        Code:
        ...  if labtestvalue==1
        can't be expected to work as that variable isn't 1 in your dataset and that criterion would not select BMI even in principle.

        Comment


        • #5
          Nick gives excellent advice. Another possibility is that if the entries you describe are fully duplicates, that is, there are no other variables on which they differ, you could just elminate them all in one line:

          Code:
          duplicates drop
          and the rest of your work will be simpler. Evidently, if there are other variables which take on different values for these partially-duplicate observations, this approach would lose information and should not be undertaken (or, at least, not undertaken lightly.)

          I would add one more caution to Nick's advice. If you have two values of BMI for the same patient on the same date, do you want to count them both when you calculate your statistics? If not, I would do something like this, after -separate-:

          Code:
          egen flag = tag(ID date)
          tabstat labtestvalue1 if flag, by(ethnicity) statistics(mean sd n)
          I would also give strong consideration to renaming labtestvalue* to bmi, chol, etc. to make further analyses easier and to make your logs more understandable when you come back to them another day.
          Last edited by Clyde Schechter; 08 Mar 2016, 11:06.

          Comment

          Working...
          X