creating the proportion of each categorical value by year

Fuga Iwama

Join Date: Dec 2018

Posts: 28
#1

creating the proportion of each categorical value by year

02 Jan 2019, 08:22

HI,

Could anyone help me out?

I have one categorical variable (1-4 with an increasing level of smoking prevalence) and year dummy (2009-2017).

I want to generate the new variable with the proportion of each category for each year.

category 1 category 2 category 3 category 4 total %

2009 100

2011 100

2012 100

2014 100

To simply put, I want my summarized table shown above filled with percentage numbers.

I have tried according to the link below but did not work as I intended...

Thanks in advance!

Stata | FAQ: Creating percent summary variables

https://www.stata.com

How can I create variables containing percent summaries?
Tags: None

Bruce Weaver

Join Date: May 2014
Posts: 1129

02 Jan 2019, 08:30

Hello Fuga. One way is to use tabulate with the row and nofreq options. Here is an example.

Code:

clear *
webuse citytemp2
* Two-way table of frequencies
tabulate region agecat
* Include row percentages
tabulate region agecat, row
* Show row percentages only
tabulate region agecat, row nofreq

Output from the last tabulate command:

Code:

. tabulate region agecat, row nofreq

    Census |              agecat
    Region |     19-29      30-34        35+ |     Total
-----------+---------------------------------+----------
        NE |     27.71      50.00      22.29 |    100.00
   N Cntrl |     57.04      32.39      10.56 |    100.00
     South |     55.60      27.20      17.20 |    100.00
      West |     62.50      28.52       8.98 |    100.00
-----------+---------------------------------+----------
     Total |     53.03      33.05      13.91 |    100.00

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)

Comment

Fuga Iwama

Join Date: Dec 2018

Posts: 28
#3

02 Jan 2019, 18:11

Bruce, thank you so much for your response. I will use that for now but I will need that percentage variable in the regression analysis later on....

Would you also show me how to create a new variable with proportions as well?

Many thanks,
Comment

David Benson

Join Date: Oct 2018
Posts: 489

02 Jan 2019, 18:32

So I don't know if your data is already summarized as you show it in post #1, but if they are, you could do it like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year byte(cat1 cat2 cat3 cat4)
2009 40 25 11 38
2010 22 38 18 36
2011 16 28 20 38
2012 27 16 11 20
end

egen total = rowtotal(cat1 - cat4)  // this adds var cat1 through cat4. Same as rowtotal(cat1 cat2 cat3 cat4)
list

     +------------------------------------------+
     | year   cat1   cat2   cat3   cat4   total |
     |------------------------------------------|
  1. | 2009     40     25     11     38     114 |
  2. | 2010     22     38     18     36     114 |
  3. | 2011     16     28     20     38     102 |
  4. | 2012     27     16     11     20      74 |
     +------------------------------------------+

Code:

* Creates a var that is the percentage of people in category1
forvalues i = 1/4 {
gen p_cat`i' = cat`i' / total
}

* Just checking the total
format p_cat1 p_cat2 p_cat3 p_cat4 %9.4f  // just limiting to 4 decimal places
egen p_total = rowtotal( p_cat1- p_cat4)

. list

     +----------------------------------------------------------------------------------------+
     | year   cat1   cat2   cat3   cat4   total   p_cat1   p_cat2   p_cat3   p_cat4   p_total |
     |----------------------------------------------------------------------------------------|
  1. | 2009     40     25     11     38     114   0.3509   0.2193   0.0965   0.3333         1 |
  2. | 2010     22     38     18     36     114   0.1930   0.3333   0.1579   0.3158         1 |
  3. | 2011     16     28     20     38     102   0.1569   0.2745   0.1961   0.3725         1 |
  4. | 2012     27     16     11     20      74   0.3649   0.2162   0.1486   0.2703         1 |
     +----------------------------------------------------------------------------------------+

Comment

Bruce Weaver

Join Date: May 2014
Posts: 1129

02 Jan 2019, 18:52

Extending the example in #2, you could do something like this:

Code:

* Add a new variable rowpc that stores the row percentages.
by region agecat, sort: generate freq = _N
by region: generate rowpc = 100 * freq / _N
egen byte tagged = tag(region agecat)
list region agecat rowpc if tagged, noobs sep(3)
tabulate region agecat, row nofreq

Output from the list and tabulate commands:

Code:

. list region agecat rowpc if tagged, noobs sep(3)

  +-----------------------------+
  |  region   agecat      rowpc |
  |-----------------------------|
  |      NE    19-29   27.71084 |
  |      NE    30-34         50 |
  |      NE      35+   22.28916 |
  |-----------------------------|
  | N Cntrl    19-29   57.04226 |
  | N Cntrl    30-34   32.39437 |
  | N Cntrl      35+   10.56338 |
  |-----------------------------|
  |   South    19-29       55.6 |
  |   South    30-34       27.2 |
  |   South      35+       17.2 |
  |-----------------------------|
  |    West    19-29       62.5 |
  |    West    30-34   28.51563 |
  |    West      35+   8.984375 |
  +-----------------------------+

. tabulate region agecat, row nofreq

    Census |              agecat
    Region |     19-29      30-34        35+ |     Total
-----------+---------------------------------+----------
        NE |     27.71      50.00      22.29 |    100.00
   N Cntrl |     57.04      32.39      10.56 |    100.00
     South |     55.60      27.20      17.20 |    100.00
      West |     62.50      28.52       8.98 |    100.00
-----------+---------------------------------+----------
     Total |     53.03      33.05      13.91 |    100.00

The approach used above is borrowed from this FAQ, by the way.

HTH.

Last edited by Bruce Weaver; 02 Jan 2019, 18:55. Reason: Added reference to Stata FAQ.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)

Comment

Fuga Iwama

Join Date: Dec 2018

Posts: 28
#6

02 Jan 2019, 19:39

Thanks again,

Now that I run your code, I have different results for

Code:

list region agecat rowpc if tagged, noobs sep(3)

and

Code:

tabulate region agecat, row nofreq

for the row pc in your dataset and percentages written on the table.

I have also created the histogram for each year and it looks very similar to one in

Code:

tabulate region agecat, row nofreq

so do you have any idea why is it different?

Thanks,
Comment
Fuga Iwama

Join Date: Dec 2018

Posts: 28
#7

02 Jan 2019, 19:53

to add, the reason why it is different is that for the code

Code:

by region agecat, sort: generate freq = _N

Code:

by region: generate rowpc = 100 * freq / _N

Code:

egen byte tagged = tag(region agecat)

Code:

list region agecat rowpc if tagged, noobs sep(3)

it includes the observation that are missing and hence the percentage shows lower than it should be.(for my dataset)

On the other hand,

Code:

tabulate region agecat, row nofreq

it excludes the missing observations and gives correct percentage.

Could you help me out with the code for the first method using observations that are non-missing ?

Thank you so much in advance
Comment

David Benson

Join Date: Oct 2018
Posts: 489

02 Jan 2019, 21:23

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id int year str4 category
 1 2009 "cat1"
 2 2009 "cat2"
 3 2009 "cat3"
 4 2009 "cat3"
 5 2009 "cat4"
 6 2009 "cat2"
 7 2009 "cat3"
 8 2009 "cat3"
 9 2009 "cat2"
10 2009 "cat4"
11 2009 "cat4"
12 2009 "cat4"
 1 2010 "cat3"
 2 2010 "cat4"
 3 2010 "cat4"
 4 2010 "cat4"
 5 2010 "cat1"
 8 2010 "cat3"
 9 2010 "cat1"
10 2010 "cat1"
11 2010 "cat2"
12 2010 "cat1"
 1 2011 "cat2"
 2 2011 "cat3"
 3 2011 "cat2"
 4 2011 "cat2"
 6 2011 "cat4"
 7 2011 "cat2"
 8 2011 "cat4"
 9 2011 "cat2"
10 2011 "cat1"
end

Code:

. tabulate year category

           |                  category
      Year |      cat1       cat2       cat3       cat4 |     Total
-----------+--------------------------------------------+----------
      2009 |         1          3          4          4 |        12
      2010 |         4          1          2          3 |        10
      2011 |         1          5          1          2 |         9
-----------+--------------------------------------------+----------
     Total |         6          9          7          9 |        31

. tabulate year category, row

+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

           |                  category
      Year |      cat1       cat2       cat3       cat4 |     Total
-----------+--------------------------------------------+----------
      2009 |         1          3          4          4 |        12
           |      8.33      25.00      33.33      33.33 |    100.00
-----------+--------------------------------------------+----------
      2010 |         4          1          2          3 |        10
           |     40.00      10.00      20.00      30.00 |    100.00
-----------+--------------------------------------------+----------
      2011 |         1          5          1          2 |         9
           |     11.11      55.56      11.11      22.22 |    100.00
-----------+--------------------------------------------+----------
     Total |         6          9          7          9 |        31
           |     19.35      29.03      22.58      29.03 |    100.00

Code:

bysort year category: gen cat_total = _N
bysort year category: egen cat_total2 = count( category)  // this one won't include if category==""
* In the above dataset they will be the same because there are no missing data
bysort year: egen yr_total = count( category)  // counting number of obs in each year if category!=""
gen cat_pc = cat_total / yr_total  // the rowpc from previous
* replace cat_pc = cat_pc * 100 if you want 25% to be 25.0 rather than 0.25

* If done correctly, these % should match row percentage from "tabulate year category, row" above
* I made 2010 obs red to make it easier to distinguish different years
. list year id category cat_total cat_total2 yr_total cat_pc , sepby(year category ) noobs abbrev(12)

  +---------------------------------------------------------------------+
  | year   id   category   cat_total   cat_total2   yr_total     cat_pc |
  |---------------------------------------------------------------------|
  | 2009    1       cat1           1            1         12   .0833333 |
  |---------------------------------------------------------------------|
  | 2009    2       cat2           3            3         12        .25 |
  | 2009    9       cat2           3            3         12        .25 |
  | 2009    6       cat2           3            3         12        .25 |
  |---------------------------------------------------------------------|
  | 2009    7       cat3           4            4         12   .3333333 |
  | 2009    4       cat3           4            4         12   .3333333 |
  | 2009    8       cat3           4            4         12   .3333333 |
  | 2009    3       cat3           4            4         12   .3333333 |
  |---------------------------------------------------------------------|
  | 2009   10       cat4           4            4         12   .3333333 |
  | 2009   11       cat4           4            4         12   .3333333 |
  | 2009   12       cat4           4            4         12   .3333333 |
  | 2009    5       cat4           4            4         12   .3333333 |
  |---------------------------------------------------------------------|
  | 2010    5       cat1           4            4         10         .4 |
  | 2010   12       cat1           4            4         10         .4 |
  | 2010   10       cat1           4            4         10         .4 |
  | 2010    9       cat1           4            4         10         .4 |
  |---------------------------------------------------------------------|
  | 2010   11       cat2           1            1         10         .1 |
  |---------------------------------------------------------------------|
  | 2010    8       cat3           2            2         10         .2 |
  | 2010    1       cat3           2            2         10         .2 |
  |---------------------------------------------------------------------|
  | 2010    4       cat4           3            3         10         .3 |
  | 2010    3       cat4           3            3         10         .3 |
  | 2010    2       cat4           3            3         10         .3 |
  |---------------------------------------------------------------------|
  | 2011   10       cat1           1            1          9   .1111111 |
  |---------------------------------------------------------------------|
  | 2011    9       cat2           5            5          9   .5555556 |
  | 2011    1       cat2           5            5          9   .5555556 |
  | 2011    3       cat2           5            5          9   .5555556 |
  | 2011    7       cat2           5            5          9   .5555556 |
  | 2011    4       cat2           5            5          9   .5555556 |
  |---------------------------------------------------------------------|
  | 2011    2       cat3           1            1          9   .1111111 |
  |---------------------------------------------------------------------|
  | 2011    6       cat4           2            2          9   .2222222 |
  | 2011    8       cat4           2            2          9   .2222222 |
  +---------------------------------------------------------------------+

Last edited by David Benson; 02 Jan 2019, 21:40.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35641

03 Jan 2019, 02:44

You can just segregate the missings.

Code:

. sysuse auto, clear
(1978 Automobile Data)

. gen OK = !missing(foreign, rep78)

. bysort OK foreign rep78 : gen freq = _N

. by OK foreign : gen total = _N

. gen pc = 100 * freq / total

. tabdisp rep78 foreign, c(pc)

------------------------------
Repair    |
Record    |      Car type     
1978      | Domestic   Foreign
----------+-------------------
        1 | 4.166667          
        2 | 16.66667          
        3 |    56.25  14.28571
        4 |    18.75  42.85714
        5 | 4.166667  42.85714
        . |      100       100
------------------------------

. tab rep78 foreign, col

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

    Repair |
    Record |       Car type
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
           |      4.17       0.00 |      2.90 
-----------+----------------------+----------
         2 |         8          0 |         8 
           |     16.67       0.00 |     11.59 
-----------+----------------------+----------
         3 |        27          3 |        30 
           |     56.25      14.29 |     43.48 
-----------+----------------------+----------
         4 |         9          9 |        18 
           |     18.75      42.86 |     26.09 
-----------+----------------------+----------
         5 |         2          9 |        11 
           |      4.17      42.86 |     15.94 
-----------+----------------------+----------
     Total |        48         21 |        69 
           |    100.00     100.00 |    100.00

Comment

amanda watson

Join Date: Oct 2022

Posts: 10
#10

11 Dec 2022, 17:39

Is it possible to put the info generate using the below command into a stacked bar charrt?
tabulate region agecat, row nofreq
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35641
#11

12 Dec 2022, 06:10

#10 Naturally. You don't supply a data example or use a standard Stata dataset as reference, so here is an analogue.

As you asked for stacked bar charts, I used catplot from SSC, which must be installed using

ssc install catplot

before you can use it. But I think (much) better plots are possible with tabplot (Stata Journal) and floatplot (SSC).

Code:

sysuse auto, clear tab foreign rep78, row nofreq set scheme s1color catplot rep78 foreign, percent(foreign) asyvars stack bar(1, lcolor(red) fcolor(red*0.6)) bar(2, lcolor(red) fcolor(red*0.2)) bar(3, lcolor(blue) fcolor(blue*0.2)) bar(4, lcolor(blue) fcolor(blue*0.6)) bar(5, color(blue)) legend(row(1)) title(Repair record 1978)

On tabplot https://www.statalist.org/forums/for...updated-on-ssc gives the flavour.

On floatplot my London 2021 talk at https://www.stata.com/meeting/uk21/ perhaps is the best source still.

I am hoping to publish on graphical for ordinal outcomes in 2023.
Comment

Announcement