Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Contract multiple categorical variables one-way with same levels?

    I must be missing something really basic here, but how can I contract the frequencies of multiple categorical variables that have the same levels in a single line of code?

    For example, I want to transform this dataset:

    Code:
    clear
    
    set obs 20
    
    gen id = _n
    
    set seed 1234
    gen year1 = round(runiform(1, 3))
    
    set seed 19283
    gen year2 = round(runiform(1, 3))
    
    set seed 91827
    gen year3 = round(runiform(1, 3))
    
    label define categories 1 "Low" 2 "Medium" 3 "High"
    label values year* categories
    Into a dataset with the simple counts, one row per level:

    Code:
    Category      year1     year2     year3
    Low           8         1         5
    Medium        2         10        9
    High          10        9         6
    Something really inefficient would be the following, but what very basic command am I missing to do this quickly?

    Code:
    foreach var of varlist year* {
        preserve
        contract `var', freq(`var'_count)
        rename `var' category
        save `var', replace
        restore
    }
    
    use year1, clear
    merge 1:1 category using year2, nogen
    merge 1:1 category using year3, nogen

  • #2

    Nothing downloadable from ssc springs to mind.

    Here is another way that does not require looping or merge.
    Code:
    reshape long year, i(id) j(j)
    rename year category
    collapse (count) year=id, by(category j)
    reshape wide year, i(category) j(j)
    list
    Here is the resulting dataset.
    Code:
    . list
    
         +----------------------------------+
         | category   year1   year2   year3 |
         |----------------------------------|
      1. |      Low       8       1       5 |
      2. |   Medium       2      10       9 |
      3. |     High      10       9       6 |
         +----------------------------------+

    Comment


    • #3
      I don't think you are missing any simple command to do this in one or two steps. The reason is that you are asking Stata to transform the data in a way that discards all of the information about the correspondence between values of year1, year2, and year3. You are just interested in their totals as if they were unrelated variables. The most inefficient part of your approach is that it thrashes the disk. You can do this all in memory by passing the data through a matrix instead, although unless your real data set is very large the difference in performance probably will not be noticeable:
      Code:
      clear
      
      set obs 20
      
      gen id = _n
      
      set seed 1234
      gen year1 = round(runiform(1, 3))
      
      set seed 19283
      gen year2 = round(runiform(1, 3))
      
      set seed 91827
      gen year3 = round(runiform(1, 3))
      
      label define categories 1 "Low" 2 "Medium" 3 "High"
      label values year* categories
      
      forvalues i = 1/3 {
          tab year`i', matcell(count`i') matrow(values)
      }
      matrix M = values, count1, count2, count3
      matrix colnames M = category year1 year2 year3
      drop _all
      svmat M, names(col)
      label values category categories
      Added: Crossed with #2.
      Last edited by Clyde Schechter; 12 Aug 2023, 11:09.

      Comment


      • #4
        Many thanks to both of you, Jeff and Clyde, for these clever approaches. Much appreciated!

        Comment


        • #5
          tabm from tab_chi on SSC can help here too.

          Code:
          . clear
          
          . 
          . set obs 20
          Number of observations (_N) was 0, now 20.
          
          . 
          . gen id = _n
          
          . 
          . set seed 1234
          
          . gen year1 = round(runiform(1, 3))
          
          . 
          . set seed 19283
          
          . gen year2 = round(runiform(1, 3))
          
          . 
          . set seed 91827
          
          . gen year3 = round(runiform(1, 3))
          
          . 
          . label define categories 1 "Low" 2 "Medium" 3 "High"
          
          . label values year* categories
          
          . 
          . tabm year*, transpose replace 
          
                     |             variable
              values |     year1      year2      year3 |     Total
          -----------+---------------------------------+----------
                 Low |         8          1          5 |        14 
              Medium |         2         10          9 |        21 
                High |        10          9          6 |        25 
          -----------+---------------------------------+----------
               Total |        20         20         20 |        60 
          
          . 
          . list 
          
               +------------------+
               | _stack   _values |
               |------------------|
            1. |  year1      High |
            2. |  year1       Low |
            3. |  year1      High |
            4. |  year1      High |
            5. |  year1       Low |
               |------------------|
            6. |  year1      High |
            7. |  year1      High |
            8. |  year1      High |
            9. |  year1       Low |
           10. |  year1      High |
               |------------------|
           11. |  year1      High |
           12. |  year1       Low |
           13. |  year1    Medium |
           14. |  year1      High |
           15. |  year1       Low |
               |------------------|
           16. |  year1      High |
           17. |  year1       Low |
           18. |  year1       Low |
           19. |  year1       Low |
           20. |  year1    Medium |
               |------------------|
           21. |  year2       Low |
           22. |  year2    Medium |
           23. |  year2    Medium |
           24. |  year2      High |
           25. |  year2      High |
               |------------------|
           26. |  year2    Medium |
           27. |  year2    Medium |
           28. |  year2      High |
           29. |  year2      High |
           30. |  year2      High |
               |------------------|
           31. |  year2      High |
           32. |  year2    Medium |
           33. |  year2    Medium |
           34. |  year2      High |
           35. |  year2    Medium |
               |------------------|
           36. |  year2    Medium |
           37. |  year2      High |
           38. |  year2    Medium |
           39. |  year2    Medium |
           40. |  year2      High |
               |------------------|
           41. |  year3    Medium |
           42. |  year3      High |
           43. |  year3       Low |
           44. |  year3      High |
           45. |  year3      High |
               |------------------|
           46. |  year3      High |
           47. |  year3    Medium |
           48. |  year3      High |
           49. |  year3      High |
           50. |  year3    Medium |
               |------------------|
           51. |  year3    Medium |
           52. |  year3    Medium |
           53. |  year3    Medium |
           54. |  year3       Low |
           55. |  year3       Low |
               |------------------|
           56. |  year3       Low |
           57. |  year3    Medium |
           58. |  year3    Medium |
           59. |  year3       Low |
           60. |  year3    Medium |
               +------------------+
          tabm goes back to 1998. although the replace option was perhaps not in the original.

          Comment


          • #6
            Thank you so much, Nick! I didn't know your tabm, thanks for sharing! I guess what this does to the data is similar to a reshape, like in Jeff's first line,

            Code:
            reshape long year, i(id) j(_stack)
            But what I would like is the actual data to look like the output that appears in the results window when you run this:

            Code:
            tabm year*, transpose replace
            So I guess I would still need a few extra steps to get to that. In any case, very helpful to learn about tabm, thank you!

            Comment


            • #7
              You can get a row total variable by adding, e.g.


              Code:
              gen Total = year1 + year2 + year3
              Having an extra observation with column totals is un-Stataish, but most people who want that know how to do it in a spreadsheet. Do you need it for any Stata purpose?

              Comment

              Working...
              X