Combining categorical variables

Michael Perkin

Join Date: Nov 2015

Posts: 3
#1

Combining categorical variables

05 Nov 2015, 10:06

I am trying to summarise a categorical variable in stata that has been asked repeatedly in a cohort study. I would like to pool the results for posseting for these three months, i.e. the never category at each of the time points would be added, as would each of the other categories. I’m after a new variable that represents the distribution of posseting in these categories for the three months combined. None of the various egen commands do what I require and whilst I suspect that there may be a straight forward solution I have not been able to deduce what this is so far.

Many thanks,

Michael

. tab tcposset_q4m

posseted since |
3m visit | Freq. Percent Cum.
-----------------+-----------------------------------
never | 94 7.83 7.83
monthly or less | 103 8.58 16.42
weekly | 107 8.92 25.33
2-4 times a week | 185 15.42 40.75
5-6 times a week | 109 9.08 49.83
daily | 230 19.17 69.00
more than daily | 372 31.00 100.00
-----------------+-----------------------------------
Total | 1,200 100.00

. tab tcposset_q5m

posseted since |
3m visit | Freq. Percent Cum.
-----------------+-----------------------------------
never | 117 10.07 10.07
monthly or less | 96 8.26 18.33
weekly | 134 11.53 29.86
2-4 times a week | 209 17.99 47.85
5-6 times a week | 102 8.78 56.63
daily | 214 18.42 75.04
more than daily | 290 24.96 100.00
-----------------+-----------------------------------
Total | 1,162 100.00

. tab tcposset_q6m

posseted since |
3m visit | Freq. Percent Cum.
-----------------+-----------------------------------
never | 174 15.30 15.30
monthly or less | 160 14.07 29.38
weekly | 171 15.04 44.42
2-4 times a week | 194 17.06 61.48
5-6 times a week | 96 8.44 69.92
daily | 178 15.66 85.58
more than daily | 164 14.42 100.00
-----------------+-----------------------------------
Total | 1,137 100.00

I would like to pool the results for posseting for these three months, i.e. the never category at each of the time points would be added, as would each of the other categories. I’m after a new variable that represents the distribution of posseting in these categories for the three months combined. Does this make sense?!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35730
#2

05 Nov 2015, 10:17

Not to me. Underneath the value labels are presumably numeric values, say 1 to 6. What are you rules for combining 1 to 6? There are in principle 6 cubed = 216 possible joint values, so how are they to be reduced to a composite? If it is just straight addition, which you seem to be implying, then that is

Code:

gen tcposset = tcposset_q4m + tcposset_q5m + tcposset_q6m egen tcposset = rowtotal(tcposset_q?m)

But it can't be that, as you have explained that egen does not help.
Comment
Michael Perkin

Join Date: Nov 2015

Posts: 3
#3

05 Nov 2015, 11:29

Thanks for the prompt response! To clarify the new variable representing posseting during the three months combined which I am trying to create would have the following values:

Never (which as you has the underlying numeric value 1) = 385 (94+117+174)
Monthly or less (numeric value 2) = 359 (103+96+160)
etc etc

The percentage distribution of this new variable would represent the relative frequency of posseting over the three month period combined.

I hope that helps.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35730

05 Nov 2015, 11:40

I see: you want a combined contingency table. That isn't anything to do with a new variable that could be consistent with your present data structure. Think of this way: in which observations would those values go?

tabm (from tab_chi (SSC)) is one way to get that table. Here's a sandbox and a demonstration.

Code:

clear 
set obs 1200 
set seed 2803 

forval j = 1/3 { 
     gen y`j' = ceil(6 * runiform()) 
} 

* next line done just once
ssc inst tab_chi 


. tabm y?, transpose  

           |             variable
    values |        y1         y2         y3 |     Total
-----------+---------------------------------+----------
         1 |       191        196        202 |       589 
         2 |       193        193        218 |       604 
         3 |       202        200        204 |       606 
         4 |       202        221        176 |       599 
         5 |       206        201        199 |       606 
         6 |       206        189        201 |       596 
-----------+---------------------------------+----------
     Total |     1,200      1,200      1,200 |     3,600

If you want to do anything else with the results, tabm has an option to save that table as a new dataset.

Comment

Michael Perkin

Join Date: Nov 2015

Posts: 3
#5

06 Nov 2015, 14:43

Thanks Nick,

tabm did exactly what I needed. I then used the tabi command to compare the two study groups. Much appreciated.

Michael
Comment

Announcement

Combining categorical variables

Comment

Comment

Comment

Comment