Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating variable with egen count() and conditions

    I'm working with a labour force survey and want to attach to each observation a new variable recording the unemployment rate in the geographical region of that observation.

    I have some code that works, but it seems extremely inelegant to me and I find it hard to believe there isn't a more efficient way. Given I will be running this code repeatedly on sequential quarterly datasets, I'd like it to be cleaner if it can be! (I also want to attach new variables for other statistics, eg. the employment:population ratio. I've left out the code for these since it's structurally the same, but that also makes me want to streamline the code as much as I can.)

    At present my approach:
    1. Creates two dummies, with values if the person is employed / unemployed (respectively), and missing otherwise.
    2. Uses egen count() with by, to create two new variables recording the raw number of employed / unemployed people in the region.
    3. Creates a fifth new variable, the unemployment rate I actually want, by calculation from the raw numbers.
    4. Drops the variables used only for this process.

    Originally I thought I could skip at least one step by using egen count() with an if qualifier. But if that's possible I haven't been able to figure out how, and if there's some other more efficient way of doing it I haven't figured that out either. Any advice appreciated!

    Here's a sample of my code - the original variable empl_stat records 1 for employed, 2 for unemployed, other values for various types of inactivity.

    Code:
    gen unempl=1 if empl_stat==2
    gen empl=1 if empl_stat==1
    
    /*By region*/
    bysort region: egen reg_unempl = count(unempl)
    bysort region: egen reg_empl = count(empl)
    
    gen reg_unempl_r = reg_unempl/(reg_empl+reg_unempl)
    
    drop empl unempl reg_empl reg_unempl

  • #2
    The best approach is to present data (under - dataex - or code delimiters) to work on.

    Also, to start, you won't need to create to opposite dummies.

    That said, something like this code will do fine:

    Code:
    gen unempl1 = empl_stat ==0
    Hopefully that helps.
    Last edited by Marcos Almeida; 19 Jun 2019, 04:54.
    Best regards,

    Marcos

    Comment


    • #3
      Taking Marcos Almeida's idea a little further, you can slim that down to

      Code:
      bysort region: egen reg_unempl = total(empl_stat==2)  
      bysort region: egen reg_empl = total(empl_stat==1)  
      gen reg_unempl_r = reg_unempl/(reg_empl+reg_unempl)
      I wouldn't drop any of those variables. They look useful.
      Last edited by Nick Cox; 19 Jun 2019, 05:03.

      Comment


      • #4
        Thanks Marcos.

        The opposite dummies are one of the things that seems inelegant about it, but I found that I did need to use them: with a single dummy, the count() function of egen counts the 0s as well as 1s.

        For example, in the data extract at the bottom of this post, the North East region has 6 people employed & 0 unemployed. Running the code you suggest:
        Code:
         gen unempl1 = empl_stat ==2
        bysort region: egen reg_unempl = count(unempl1)
        tab region reg_unempl
        assigns value 6 to the new variable for all North East observations, because it's counted the 0s and 1s together. Hence I had to create dummies with missing values rather than zeroes - unless there is some way to put a condition in the egen count. (If there is, then I don't need the dummies at all, which would be ideal.)


        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte(empl_stat region)
        1  4
        1  4
        1  7
        1  7
        2 12
        1  9
        1  9
        1  8
        2  8
        1  1
        1  1
        1  5
        1  5
        1  6
        1  6
        1  6
        1 13
        1 13
        1 10
        1  2
        1  9
        1  9
        1  9
        1 10
        1  8
        1  8
        1  1
        1  1
        1  1
        1  1
        end
        label values empl_stat ILODEFR
        label def ILODEFR 1 "In employment", modify
        label def ILODEFR 2 "ILO unemployed", modify
        label values region GOVTOF2
        label def GOVTOF2 1 "North East", modify
        label def GOVTOF2 2 "North West", modify
        label def GOVTOF2 4 "Yorkshire and Humberside", modify
        label def GOVTOF2 5 "East Midlands", modify
        label def GOVTOF2 6 "West Midlands", modify
        label def GOVTOF2 7 "East of England", modify
        label def GOVTOF2 8 "London", modify
        label def GOVTOF2 9 "South East", modify
        label def GOVTOF2 10 "South West", modify
        label def GOVTOF2 12 "Scotland", modify
        label def GOVTOF2 13 "Northern Ireland", modify

        Comment


        • #5
          Nick Cox's code is giving the right results. (I may have misunderstood your suggestion, Marcos, apologies!)
          I don't quite understand how it's working, or how egen total() functions. But that I can look in the manuals for! Thanks both for your help.

          Comment


          • #6
            The clue is in the name. total() gives the total or sum. If fed a set of values for a variable that are 0 or 1, then the total is precisely the count of observations with value 1.

            If you feed 0 0 0 1 1 1 1 1 1 1 to the function then the total is 7, which is what you want.

            For more, see https://www.stata.com/support/faqs/d...rue-and-false/ on how true or false expression such as
            Code:
              
             empl_stat==2
            are evaluated as 1 if true and 0 if false.

            Comment


            • #7
              Thanks again to all - I've just come back to this project & have a follow-up question.
              I'd also like to generate weighted-up totals for employment and unemployment.

              To recap, t he data structure is still as in my sample above. I am using Nick Cox's code for the simple totals:
              Code:
              bysort region: egen reg_unempl = total(empl_stat==2)
              Using a similar trick, I thought that the following code should give me the totals weighted up to the full population:
              Code:
              bysort region: egen wt_reg_unempl = total(weight*empl_stat==2)
              For some reason this is returning all zeroes, though the expression empl_stat==2 is obviously true for many observations, so should be evaluating to 1. Not sure what I'm missing, would appreciate any advice!

              Comment


              • #8
                Precedence rules are biting you. I will illustrate with local macros, but the idea is generic.

                Code:
                . local weight 42
                
                . local empstat 2
                
                . di `weight' * `empstat' == 2
                0
                
                . di `weight' * (`empstat' == 2)
                42
                Stata first calculates the product of weight and value and then compares the result with 2. You need parentheses to force your intended meaning.

                See 13.2.5 in https://www.stata.com/manuals/u13.pdf

                Comment


                • #9
                  Aha! Great, thanks very much.

                  Comment

                  Working...
                  X