Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • creating the proportion of each categorical value by year

    HI,

    Could anyone help me out?

    I have one categorical variable (1-4 with an increasing level of smoking prevalence) and year dummy (2009-2017).

    I want to generate the new variable with the proportion of each category for each year.
    category 1 category 2 category 3 category 4 total %
    2009 100
    2011 100
    2012 100
    2014 100
    To simply put, I want my summarized table shown above filled with percentage numbers.

    I have tried according to the link below but did not work as I intended...

    Thanks in advance!
    How can I create variables containing percent summaries?

  • #2
    Hello Fuga. One way is to use tabulate with the row and nofreq options. Here is an example.

    Code:
    clear *
    webuse citytemp2
    * Two-way table of frequencies
    tabulate region agecat
    * Include row percentages
    tabulate region agecat, row
    * Show row percentages only
    tabulate region agecat, row nofreq
    Output from the last tabulate command:
    Code:
    . tabulate region agecat, row nofreq
    
        Census |              agecat
        Region |     19-29      30-34        35+ |     Total
    -----------+---------------------------------+----------
            NE |     27.71      50.00      22.29 |    100.00
       N Cntrl |     57.04      32.39      10.56 |    100.00
         South |     55.60      27.20      17.20 |    100.00
          West |     62.50      28.52       8.98 |    100.00
    -----------+---------------------------------+----------
         Total |     53.03      33.05      13.91 |    100.00

    HTH.

    --
    Bruce Weaver
    Email: [email protected]
    Version: Stata/MP 18.5 (Windows)

    Comment


    • #3
      Bruce, thank you so much for your response. I will use that for now but I will need that percentage variable in the regression analysis later on....

      Would you also show me how to create a new variable with proportions as well?

      Many thanks,

      Comment


      • #4
        So I don't know if your data is already summarized as you show it in post #1, but if they are, you could do it like this:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int year byte(cat1 cat2 cat3 cat4)
        2009 40 25 11 38
        2010 22 38 18 36
        2011 16 28 20 38
        2012 27 16 11 20
        end
        
        egen total = rowtotal(cat1 - cat4)  // this adds var cat1 through cat4. Same as rowtotal(cat1 cat2 cat3 cat4)
        list
        
             +------------------------------------------+
             | year   cat1   cat2   cat3   cat4   total |
             |------------------------------------------|
          1. | 2009     40     25     11     38     114 |
          2. | 2010     22     38     18     36     114 |
          3. | 2011     16     28     20     38     102 |
          4. | 2012     27     16     11     20      74 |
             +------------------------------------------+

        Code:
        * Creates a var that is the percentage of people in category1
        forvalues i = 1/4 {
        gen p_cat`i' = cat`i' / total
        }
        
        * Just checking the total
        format p_cat1 p_cat2 p_cat3 p_cat4 %9.4f  // just limiting to 4 decimal places
        egen p_total = rowtotal( p_cat1- p_cat4)
        
        . list
        
             +----------------------------------------------------------------------------------------+
             | year   cat1   cat2   cat3   cat4   total   p_cat1   p_cat2   p_cat3   p_cat4   p_total |
             |----------------------------------------------------------------------------------------|
          1. | 2009     40     25     11     38     114   0.3509   0.2193   0.0965   0.3333         1 |
          2. | 2010     22     38     18     36     114   0.1930   0.3333   0.1579   0.3158         1 |
          3. | 2011     16     28     20     38     102   0.1569   0.2745   0.1961   0.3725         1 |
          4. | 2012     27     16     11     20      74   0.3649   0.2162   0.1486   0.2703         1 |
             +----------------------------------------------------------------------------------------+

        Comment


        • #5
          Extending the example in #2, you could do something like this:

          Code:
          * Add a new variable rowpc that stores the row percentages.
          by region agecat, sort: generate freq = _N
          by region: generate rowpc = 100 * freq / _N
          egen byte tagged = tag(region agecat)
          list region agecat rowpc if tagged, noobs sep(3)
          tabulate region agecat, row nofreq
          Output from the list and tabulate commands:
          Code:
          . list region agecat rowpc if tagged, noobs sep(3)
          
            +-----------------------------+
            |  region   agecat      rowpc |
            |-----------------------------|
            |      NE    19-29   27.71084 |
            |      NE    30-34         50 |
            |      NE      35+   22.28916 |
            |-----------------------------|
            | N Cntrl    19-29   57.04226 |
            | N Cntrl    30-34   32.39437 |
            | N Cntrl      35+   10.56338 |
            |-----------------------------|
            |   South    19-29       55.6 |
            |   South    30-34       27.2 |
            |   South      35+       17.2 |
            |-----------------------------|
            |    West    19-29       62.5 |
            |    West    30-34   28.51563 |
            |    West      35+   8.984375 |
            +-----------------------------+
          
          . tabulate region agecat, row nofreq
          
              Census |              agecat
              Region |     19-29      30-34        35+ |     Total
          -----------+---------------------------------+----------
                  NE |     27.71      50.00      22.29 |    100.00
             N Cntrl |     57.04      32.39      10.56 |    100.00
               South |     55.60      27.20      17.20 |    100.00
                West |     62.50      28.52       8.98 |    100.00
          -----------+---------------------------------+----------
               Total |     53.03      33.05      13.91 |    100.00
          The approach used above is borrowed from this FAQ, by the way.

          HTH.

          Last edited by Bruce Weaver; 02 Jan 2019, 18:55. Reason: Added reference to Stata FAQ.
          --
          Bruce Weaver
          Email: [email protected]
          Version: Stata/MP 18.5 (Windows)

          Comment


          • #6
            Thanks again,

            Now that I run your code, I have different results for
            Code:
             
             list region agecat rowpc if tagged, noobs sep(3)
            and
            Code:
             
             tabulate region agecat, row nofreq
            for the row pc in your dataset and percentages written on the table.

            I have also created the histogram for each year and it looks very similar to one in
            Code:
             
             tabulate region agecat, row nofreq
            so do you have any idea why is it different?

            Thanks,

            Comment


            • #7
              to add, the reason why it is different is that for the code
              Code:
               
               by region agecat, sort: generate freq = _N
              Code:
              by region: generate rowpc = 100 * freq / _N
              Code:
              egen byte tagged = tag(region agecat)
              Code:
              list region agecat rowpc if tagged, noobs sep(3)
              it includes the observation that are missing and hence the percentage shows lower than it should be.(for my dataset)

              On the other hand,
              Code:
               
               tabulate region agecat, row nofreq
              it excludes the missing observations and gives correct percentage.

              Could you help me out with the code for the first method using observations that are non-missing ?

              Thank you so much in advance

              Comment


              • #8


                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input byte id int year str4 category
                 1 2009 "cat1"
                 2 2009 "cat2"
                 3 2009 "cat3"
                 4 2009 "cat3"
                 5 2009 "cat4"
                 6 2009 "cat2"
                 7 2009 "cat3"
                 8 2009 "cat3"
                 9 2009 "cat2"
                10 2009 "cat4"
                11 2009 "cat4"
                12 2009 "cat4"
                 1 2010 "cat3"
                 2 2010 "cat4"
                 3 2010 "cat4"
                 4 2010 "cat4"
                 5 2010 "cat1"
                 8 2010 "cat3"
                 9 2010 "cat1"
                10 2010 "cat1"
                11 2010 "cat2"
                12 2010 "cat1"
                 1 2011 "cat2"
                 2 2011 "cat3"
                 3 2011 "cat2"
                 4 2011 "cat2"
                 6 2011 "cat4"
                 7 2011 "cat2"
                 8 2011 "cat4"
                 9 2011 "cat2"
                10 2011 "cat1"
                end
                Code:
                . tabulate year category
                
                           |                  category
                      Year |      cat1       cat2       cat3       cat4 |     Total
                -----------+--------------------------------------------+----------
                      2009 |         1          3          4          4 |        12
                      2010 |         4          1          2          3 |        10
                      2011 |         1          5          1          2 |         9
                -----------+--------------------------------------------+----------
                     Total |         6          9          7          9 |        31
                
                . tabulate year category, row
                
                +----------------+
                | Key            |
                |----------------|
                |   frequency    |
                | row percentage |
                +----------------+
                
                           |                  category
                      Year |      cat1       cat2       cat3       cat4 |     Total
                -----------+--------------------------------------------+----------
                      2009 |         1          3          4          4 |        12
                           |      8.33      25.00      33.33      33.33 |    100.00
                -----------+--------------------------------------------+----------
                      2010 |         4          1          2          3 |        10
                           |     40.00      10.00      20.00      30.00 |    100.00
                -----------+--------------------------------------------+----------
                      2011 |         1          5          1          2 |         9
                           |     11.11      55.56      11.11      22.22 |    100.00
                -----------+--------------------------------------------+----------
                     Total |         6          9          7          9 |        31
                           |     19.35      29.03      22.58      29.03 |    100.00
                Code:
                bysort year category: gen cat_total = _N
                bysort year category: egen cat_total2 = count( category)  // this one won't include if category==""
                * In the above dataset they will be the same because there are no missing data
                bysort year: egen yr_total = count( category)  // counting number of obs in each year if category!=""
                gen cat_pc = cat_total / yr_total  // the rowpc from previous
                * replace cat_pc = cat_pc * 100 if you want 25% to be 25.0 rather than 0.25
                
                * If done correctly, these % should match row percentage from "tabulate year category, row" above
                * I made 2010 obs red to make it easier to distinguish different years
                . list year id category cat_total cat_total2 yr_total cat_pc , sepby(year category ) noobs abbrev(12)
                
                  +---------------------------------------------------------------------+
                  | year   id   category   cat_total   cat_total2   yr_total     cat_pc |
                  |---------------------------------------------------------------------|
                  | 2009    1       cat1           1            1         12   .0833333 |
                  |---------------------------------------------------------------------|
                  | 2009    2       cat2           3            3         12        .25 |
                  | 2009    9       cat2           3            3         12        .25 |
                  | 2009    6       cat2           3            3         12        .25 |
                  |---------------------------------------------------------------------|
                  | 2009    7       cat3           4            4         12   .3333333 |
                  | 2009    4       cat3           4            4         12   .3333333 |
                  | 2009    8       cat3           4            4         12   .3333333 |
                  | 2009    3       cat3           4            4         12   .3333333 |
                  |---------------------------------------------------------------------|
                  | 2009   10       cat4           4            4         12   .3333333 |
                  | 2009   11       cat4           4            4         12   .3333333 |
                  | 2009   12       cat4           4            4         12   .3333333 |
                  | 2009    5       cat4           4            4         12   .3333333 |
                  |---------------------------------------------------------------------|
                  | 2010    5       cat1           4            4         10         .4 |
                  | 2010   12       cat1           4            4         10         .4 |
                  | 2010   10       cat1           4            4         10         .4 |
                  | 2010    9       cat1           4            4         10         .4 |
                  |---------------------------------------------------------------------|
                  | 2010   11       cat2           1            1         10         .1 |
                  |---------------------------------------------------------------------|
                  | 2010    8       cat3           2            2         10         .2 |
                  | 2010    1       cat3           2            2         10         .2 |
                  |---------------------------------------------------------------------|
                  | 2010    4       cat4           3            3         10         .3 |
                  | 2010    3       cat4           3            3         10         .3 |
                  | 2010    2       cat4           3            3         10         .3 |
                  |---------------------------------------------------------------------|
                  | 2011   10       cat1           1            1          9   .1111111 |
                  |---------------------------------------------------------------------|
                  | 2011    9       cat2           5            5          9   .5555556 |
                  | 2011    1       cat2           5            5          9   .5555556 |
                  | 2011    3       cat2           5            5          9   .5555556 |
                  | 2011    7       cat2           5            5          9   .5555556 |
                  | 2011    4       cat2           5            5          9   .5555556 |
                  |---------------------------------------------------------------------|
                  | 2011    2       cat3           1            1          9   .1111111 |
                  |---------------------------------------------------------------------|
                  | 2011    6       cat4           2            2          9   .2222222 |
                  | 2011    8       cat4           2            2          9   .2222222 |
                  +---------------------------------------------------------------------+
                Last edited by David Benson; 02 Jan 2019, 21:40.

                Comment


                • #9
                  You can just segregate the missings.

                  Code:
                  . sysuse auto, clear
                  (1978 Automobile Data)
                  
                  . gen OK = !missing(foreign, rep78)
                  
                  . bysort OK foreign rep78 : gen freq = _N
                  
                  . by OK foreign : gen total = _N
                  
                  . gen pc = 100 * freq / total
                  
                  . tabdisp rep78 foreign, c(pc)
                  
                  ------------------------------
                  Repair    |
                  Record    |      Car type     
                  1978      | Domestic   Foreign
                  ----------+-------------------
                          1 | 4.166667          
                          2 | 16.66667          
                          3 |    56.25  14.28571
                          4 |    18.75  42.85714
                          5 | 4.166667  42.85714
                          . |      100       100
                  ------------------------------
                  
                  . tab rep78 foreign, col
                  
                  +-------------------+
                  | Key               |
                  |-------------------|
                  |     frequency     |
                  | column percentage |
                  +-------------------+
                  
                      Repair |
                      Record |       Car type
                        1978 |  Domestic    Foreign |     Total
                  -----------+----------------------+----------
                           1 |         2          0 |         2 
                             |      4.17       0.00 |      2.90 
                  -----------+----------------------+----------
                           2 |         8          0 |         8 
                             |     16.67       0.00 |     11.59 
                  -----------+----------------------+----------
                           3 |        27          3 |        30 
                             |     56.25      14.29 |     43.48 
                  -----------+----------------------+----------
                           4 |         9          9 |        18 
                             |     18.75      42.86 |     26.09 
                  -----------+----------------------+----------
                           5 |         2          9 |        11 
                             |      4.17      42.86 |     15.94 
                  -----------+----------------------+----------
                       Total |        48         21 |        69 
                             |    100.00     100.00 |    100.00

                  Comment


                  • #10
                    Is it possible to put the info generate using the below command into a stacked bar charrt?
                    tabulate region agecat, row nofreq

                    Comment


                    • #11
                      #10 Naturally. You don't supply a data example or use a standard Stata dataset as reference, so here is an analogue.

                      As you asked for stacked bar charts, I used catplot from SSC, which must be installed using


                      ssc install catplot

                      before you can use it. But I think (much) better plots are possible with tabplot (Stata Journal) and floatplot (SSC).

                      Code:
                      sysuse auto, clear
                      tab foreign rep78, row nofreq
                      
                      set scheme s1color 
                      
                      catplot rep78 foreign, percent(foreign) asyvars stack bar(1, lcolor(red) fcolor(red*0.6)) bar(2, lcolor(red) fcolor(red*0.2)) bar(3, lcolor(blue) fcolor(blue*0.2)) bar(4, lcolor(blue) fcolor(blue*0.6)) bar(5, color(blue)) legend(row(1)) title(Repair record 1978)
                      Click image for larger version

Name:	catplot4.png
Views:	1
Size:	16.6 KB
ID:	1693061


                      On tabplot https://www.statalist.org/forums/for...updated-on-ssc gives the flavour.

                      On floatplot my London 2021 talk at https://www.stata.com/meeting/uk21/ perhaps is the best source still.

                      I am hoping to publish on graphical for ordinal outcomes in 2023.

                      Comment

                      Working...
                      X