Reshaping dataset

Karoline Andersen

Join Date: Feb 2017

Posts: 13
#1

Reshaping dataset

21 Mar 2017, 07:54

Hi
I have a dataset of 2017 observations equalling hospital admissions in 2015. Some admissions are the same patient being hospitalized more than once, therefore the dataset includes a variable of a patient identification number. I would like to describe the group of patients by age gender and civilstatus. When I do this I do not want patients who are hospitalized more than once to figure several times. I have tried to use the command "reshape wide" but I have some problems:

Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 2070 -> 1691
Number of variables 5 -> 6211
j variable (2070 values) contactid -> (dropped)
xij variables:
civilstatus -> civilstatus318142 civilstatus320197 ... civilstatus10763667
gender -> gender318142 gender320197 ... gender10763667
age -> age318142 age320197 ... age10763667

Stata is telling me that I have exceeded my maxvar on 5000

I am not sure what I am doing wrong...

I have made the following dataset as an example:

input long patient_id contact_id civilstatus gender age
1124 1054758 3 1 58
1124 1067050 3 1 58
1264 5679291 2 1 77
1600 5758910 2 2 46
519 1224024 1 2 77
519 9833793 1 2 77
end
label define civilstatus 1 "Married" 2 "Unmarried" 3"Divorced"
label values civilstatus civilstatus
label define gender 1 "Female" 2 "Male"
label values gender gender
Tags: None
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#2

21 Mar 2017, 08:57

It would be good if you could better describe what you are after, in terms of analysis or data structure. Very most likely, reshaping to wide will not be the correct way to approach the problem. A guess is that you would be better helped with collapse: http://www.stata.com/manuals13/dcollapse.pdf, or the creation of new variables (totals, averages, etc) by the groups you are interested in.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#3

21 Mar 2017, 09:11

Karoline:
elaborating a bit on Jorrit's helpful reply, you may want to try something along the following lines:

Code:

collapse (mean) civilstatus gender age, by( patient_id) label values gender gender label values civilstatus civilstatus

The code above ruled out -contact_id- as it changes for the same patients, too if he has been hospitalized >1 time.
As an aside, the code above also implies that demographics are time-invarying for the same patient (I would check whether, say, -civilstatus- changes across the hospitalizations).
Closing-out remark: remember to save the original dataset before -collapse-!

Kind regards,
Carlo
(Stata 19.0)
Comment
Karoline Andersen

Join Date: Feb 2017

Posts: 13
#4

04 Apr 2017, 02:21

Dear Jorrit and Carlo
I am so thankful that you take time trying to help me with Stata.
Maybe this collapse command is a soluation to a part of my question. I'm not sure. The generated mean might help me with the problem that some individuals have aged since their last admission, or changed civilstatus.
Still, I end up with 2017 observations instead of decreasing the dataset to the number of individuals. To answer your question Jorrit, I would like to analyse the proportion of males/females, agegroups (extracted from age) and civilstatus of the individuals that have been admitted to the hospital. If I analyze this by each observation (hospitalization) some individuals will count for more than one individual because they have been hospitalized more than once, and that may bias the estimated proportions.
Do you have any soluation to this problem?
Comment
Max Piper

Join Date: Dec 2015

Posts: 61
#5

04 Apr 2017, 06:46

There are some problems I see here:
The maxvar problem is quite clear. When reshaping from long to wide, you have less observations but more variables. In your case, the variable count goes above 5000.
You can try to correct this using set maxvar 7000. This depends on your version of Stata.

What exactly is the reshape command you are trying out that leads to this odd result? What are your specifications?

Then, if a person goes to the hospital twice, why not counting them twice? After all, they use the resources (on average) twice as much (expectedly) as a person that is admitted to the hospital only once. Hence, their age and gender should be accounted for twice in the statistics.

Last edited by Max Piper; 04 Apr 2017, 06:49.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

04 Apr 2017, 09:25

Perhaps you do not need to reshape or collapse your data, but rather to pick just one observation to represent each individual, and then restrict your commands to that individual. In the example below I choose the record with the highest age (ties broken arbitrarily). Note that I changed one of the ages to differ to better demonstrated this.

Code:

clear
input long patient_id contact_id civilstatus gender age
1124 1054758 3 1 58
1124 1067050 3 1 59
1264 5679291 2 1 77
1600 5758910 2 2 46
519 1224024 1 2 77
519 9833793 1 2 77
end
label define civilstatus 1 "Married" 2 "Unmarried" 3"Divorced"
label values civilstatus civilstatus
label define gender 1 "Female" 2 "Male"
label values gender gender
bysort patient_id (age): generate oldest = _n==_N
list,sepby(patient_id)
tab age gender if oldest
summarize age if oldest

Code:

. list,sepby(patient_id)

     +---------------------------------------------------------+
     | patien~d   contac~d   civilst~s   gender   age   oldest |
     |---------------------------------------------------------|
  1. |      519    9833793     Married     Male    77        0 |
  2. |      519    1224024     Married     Male    77        1 |
     |---------------------------------------------------------|
  3. |     1124    1054758    Divorced   Female    58        0 |
  4. |     1124    1067050    Divorced   Female    59        1 |
     |---------------------------------------------------------|
  5. |     1264    5679291   Unmarried   Female    77        1 |
     |---------------------------------------------------------|
  6. |     1600    5758910   Unmarried     Male    46        1 |
     +---------------------------------------------------------+

. tab age gender if oldest

           |        gender
       age |    Female       Male |     Total
-----------+----------------------+----------
        46 |         0          1 |         1 
        59 |         1          0 |         1 
        77 |         1          1 |         2 
-----------+----------------------+----------
     Total |         2          2 |         4 


. summarize age if oldest

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |          4       64.75    15.10794         46         77

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#7

05 Apr 2017, 00:46

I am a little confused with your question now, actually.
First, the collapse command as Carlo suggested really should result in just one observation per patient_id, so I'm not sure what the problem is, not without seeing a data example.

As for the analysis, I guess the tabulate or summarize commands suggested by William get you what you want? If not, clarify the research questions a bit more.
Whether or not you should or should not consider duplicate patient_id's info depends on your research questions a lot. For hospitalization cost, for example, it might make more sense to take the sum of all cost rather than the mean of cost of all hospital visits, whilst summing age would be silly. Collapse can give different statistics for different variables, e.g.:

Code:

collapse (mean) civilstatus gender age (sum) cost, by( patient_id)
Comment
Karoline Andersen

Join Date: Feb 2017

Posts: 13
#8

26 Apr 2017, 07:33

I have now tried William Lisowski's suggestion, and the new variable "oldest" works wonders. I can use this variable to select the latest admission for patients who have been admitted several times in 2015. So thank you very much.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment