Summary stats - checking associations between a categorical and continuous variables

Amal Khanolkar

Join Date: Feb 2015

Posts: 142
#1

Summary stats - checking associations between a categorical and continuous variables

08 Mar 2016, 06:20

Hi All

I've been trying to study relationships between the following three variables, one of which is slightly unusual; the 'labtest' variable indicates various tests conducted for a patient during a clinic visit. Ideally, the single 'labtest' variable below would be several different variables, each one for particular test such as BMI, pH, cholesterol etc. This is a panel (longitudinal) dataset where a subject can have between 1 and 12 clinic visits. Ethnicity and labtest are categorical variables with ethnicity having four categories and labtest having five categories (coded 1 to 5) indicating the the kind of test run.

Code:

ID ethnicity labtest labtestvalue 1 white bmi 22 1 white pH 6.4 1 white tsh 12 1 white chol 18 2 mixed bmi 23 2 mixed pH 7.4 2 mixed tsh 18 3 indian bmi 26 3 indian pH 4.3 3 indian tsh 18 3 indian chol 21 3 indian height 182

If I wanted to find out the mean bmi by ethnic group (above is just a snap shot of three subjects), I tried the following:

Code:

tabstat labtestvalue, by(ethnicity) stat(mean sd n), if labtestvalue==1

However, the above does not return what I'm looking for. How does one get what I'm after?

A second question: How would I go about creating a new variable BMI by extracting BMI values from the 'labtest' and 'labtestvalue' variables as follows. The labtest indicates which subject has a recorded BMI and the labtestvalue provides the actual value.

Code:

ID ethnicity labtest labtestvalue bmi 1 white bmi 22 22 1 white pH 6.4 22 1 white tsh 12 22 1 white chol 18 22 2 mixed bmi 23 23 2 mixed pH 7.4 23 2 mixed tsh 18 23 3 indian bmi 26 26 3 indian pH 4.3 26 3 indian tsh 18 26 3 indian chol 21 26 3 indian height 182 26

Thanks!

/Amal
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#2

08 Mar 2016, 09:38

While I and others are often pressing people to put their data into long rather than wide layout, your situation is one where the wide layout will make life easier. Having labtest as a variable and then having a labtestvalue that contains a very heterogeneous bunch of results doesn't make a lot of sense given how those data are likely to be used (and given, specifically, how you are looking to use them here).

So the first step, which actually answers your second question and then some, is:

Code:

rename labtestvalue _ reshape wide _, i(id) j(labtest) string rename _* *

That will give you data with separate variables for bmi, pH, tsh, chol, and height. Now you can find means of these by ethnic group in several ways. If you just want a table, you can do something like

Code:

tabstat bmi pH tsh chol height, by(ethnicity) stat(mean sd n)

If you need to create variables containing ethnic means in your data set, see -help egen-. If you need a data set containing one observation per ethnicity and the corresponding means, see -help collapse-.....
Comment
Amal Khanolkar

Join Date: Feb 2015

Posts: 142
#3

08 Mar 2016, 10:37

Hi Clyde

Thanks for you suggestion. I also prefer to work with data in the long format, especially in this case as I plan to run longitudinal analysis at a later stage.

However, I did try to reshape my dataset - from long to wide - but received an error message, that i() and j() do not uniquely identify the observations in the long dataset. My example above was just a very basic made-up version (I'm not allowed to upload or the actual data in any way). In the actual dataset, a subject may have repeated rows of the same data, for example below subject 1, the recording of BMI=22.3 is entered twice but it is essentially the same recording taken at the same visit. Similarly, the BMI recording=23.4 is entered three times but they are all from the same visit. This data was entered in this manner, I think, to accommodate other variables such as drug prescriptions. This is the first time I'm working with a dataset in this format.

Code:

ID Visitdate labtest labtestvalue ethnicity drugname 1 20/5/2014 BMI 22.3 white A 1 20/5/2014 BMI 22.3 white B 1 11/7/2015 BMI 23.4 white A 1 11/7/2015 BMI 23.4 white B 1 11/7/2015 BMI 23.4 white A

All other variables such as visit dates, ethnicity, age etc are entered consecutively for all rows/observations. Is there a way to work around the above? I tried to reshape the data to wide indicating the drugname variable and others similarly coded but I get the same error message.

Thanks

/Amal
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35709

08 Mar 2016, 10:52

As Clyde points out, the quest is for separate variables. So, consider using separate:

Code:

clear
input ID  str6 ethnicity str6 labtest   labtestvalue
1   white   bmi    22
1   white   pH     6.4
1   white   tsh    12
1   white   chol   18
2   mixed   bmi    23
2   mixed   pH     7.4
2   mixed   tsh    18
3   indian  bmi    26
3   indian  pH     4.3
3   indian  tsh    18
3   indian  chol   21
3   indian  height 182
end
separate labtestvalue, by(labtest) veryshortlabel
tabstat labtestvalue1, by(ethnicity) stat(mean sd n)

Note that in your original

Code:

...  if labtestvalue==1

can't be expected to work as that variable isn't 1 in your dataset and that criterion would not select BMI even in principle.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#5

08 Mar 2016, 11:02

Nick gives excellent advice. Another possibility is that if the entries you describe are fully duplicates, that is, there are no other variables on which they differ, you could just elminate them all in one line:

Code:

duplicates drop

and the rest of your work will be simpler. Evidently, if there are other variables which take on different values for these partially-duplicate observations, this approach would lose information and should not be undertaken (or, at least, not undertaken lightly.)

I would add one more caution to Nick's advice. If you have two values of BMI for the same patient on the same date, do you want to count them both when you calculate your statistics? If not, I would do something like this, after -separate-:

Code:

egen flag = tag(ID date) tabstat labtestvalue1 if flag, by(ethnicity) statistics(mean sd n)

I would also give strong consideration to renaming labtestvalue* to bmi, chol, etc. to make further analyses easier and to make your logs more understandable when you come back to them another day.

Last edited by Clyde Schechter; 08 Mar 2016, 11:06.
Comment

Announcement

Summary stats - checking associations between a categorical and continuous variables

Comment

Comment

Comment

Comment