Creating variable with egen count() and conditions

Fergus Peace

Join Date: Jun 2019

Posts: 6
#1

Creating variable with egen count() and conditions

19 Jun 2019, 04:11

I'm working with a labour force survey and want to attach to each observation a new variable recording the unemployment rate in the geographical region of that observation.

I have some code that works, but it seems extremely inelegant to me and I find it hard to believe there isn't a more efficient way. Given I will be running this code repeatedly on sequential quarterly datasets, I'd like it to be cleaner if it can be! (I also want to attach new variables for other statistics, eg. the employment:population ratio. I've left out the code for these since it's structurally the same, but that also makes me want to streamline the code as much as I can.)

At present my approach:
1. Creates two dummies, with values if the person is employed / unemployed (respectively), and missing otherwise.
2. Uses egen count() with by, to create two new variables recording the raw number of employed / unemployed people in the region.
3. Creates a fifth new variable, the unemployment rate I actually want, by calculation from the raw numbers.
4. Drops the variables used only for this process.

Originally I thought I could skip at least one step by using egen count() with an if qualifier. But if that's possible I haven't been able to figure out how, and if there's some other more efficient way of doing it I haven't figured that out either. Any advice appreciated!

Here's a sample of my code - the original variable empl_stat records 1 for employed, 2 for unemployed, other values for various types of inactivity.

Code:

gen unempl=1 if empl_stat==2 gen empl=1 if empl_stat==1 /*By region*/ bysort region: egen reg_unempl = count(unempl) bysort region: egen reg_empl = count(empl) gen reg_unempl_r = reg_unempl/(reg_empl+reg_unempl) drop empl unempl reg_empl reg_unempl
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

19 Jun 2019, 04:40

The best approach is to present data (under - dataex - or code delimiters) to work on.

Also, to start, you won't need to create to opposite dummies.

That said, something like this code will do fine:

Code:

gen unempl1 = empl_stat ==0

Hopefully that helps.

Last edited by Marcos Almeida; 19 Jun 2019, 04:54.

Best regards,

Marcos
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35709
#3

19 Jun 2019, 05:01

Taking Marcos Almeida's idea a little further, you can slim that down to

Code:

bysort region: egen reg_unempl = total(empl_stat==2) bysort region: egen reg_empl = total(empl_stat==1) gen reg_unempl_r = reg_unempl/(reg_empl+reg_unempl)

I wouldn't drop any of those variables. They look useful.

Last edited by Nick Cox; 19 Jun 2019, 05:03.
Comment

Fergus Peace

Join Date: Jun 2019
Posts: 6

19 Jun 2019, 05:12

Thanks Marcos.

The opposite dummies are one of the things that seems inelegant about it, but I found that I did need to use them: with a single dummy, the count() function of egen counts the 0s as well as 1s.

For example, in the data extract at the bottom of this post, the North East region has 6 people employed & 0 unemployed. Running the code you suggest:

Code:

 gen unempl1 = empl_stat ==2
bysort region: egen reg_unempl = count(unempl1)
tab region reg_unempl

assigns value 6 to the new variable for all North East observations, because it's counted the 0s and 1s together. Hence I had to create dummies with missing values rather than zeroes - unless there is some way to put a condition in the egen count. (If there is, then I don't need the dummies at all, which would be ideal.)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(empl_stat region)
1  4
1  4
1  7
1  7
2 12
1  9
1  9
1  8
2  8
1  1
1  1
1  5
1  5
1  6
1  6
1  6
1 13
1 13
1 10
1  2
1  9
1  9
1  9
1 10
1  8
1  8
1  1
1  1
1  1
1  1
end
label values empl_stat ILODEFR
label def ILODEFR 1 "In employment", modify
label def ILODEFR 2 "ILO unemployed", modify
label values region GOVTOF2
label def GOVTOF2 1 "North East", modify
label def GOVTOF2 2 "North West", modify
label def GOVTOF2 4 "Yorkshire and Humberside", modify
label def GOVTOF2 5 "East Midlands", modify
label def GOVTOF2 6 "West Midlands", modify
label def GOVTOF2 7 "East of England", modify
label def GOVTOF2 8 "London", modify
label def GOVTOF2 9 "South East", modify
label def GOVTOF2 10 "South West", modify
label def GOVTOF2 12 "Scotland", modify
label def GOVTOF2 13 "Northern Ireland", modify

Comment

Fergus Peace

Join Date: Jun 2019

Posts: 6
#5

19 Jun 2019, 05:18

Nick Cox's code is giving the right results. (I may have misunderstood your suggestion, Marcos, apologies!)
I don't quite understand how it's working, or how egen total() functions. But that I can look in the manuals for! Thanks both for your help.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35709
#6

19 Jun 2019, 06:52

The clue is in the name. total() gives the total or sum. If fed a set of values for a variable that are 0 or 1, then the total is precisely the count of observations with value 1.

If you feed 0 0 0 1 1 1 1 1 1 1 to the function then the total is 7, which is what you want.

For more, see https://www.stata.com/support/faqs/d...rue-and-false/ on how true or false expression such as

Code:

empl_stat==2

are evaluated as 1 if true and 0 if false.
1 like
Comment
Fergus Peace

Join Date: Jun 2019

Posts: 6
#7

11 Jul 2019, 05:28

Thanks again to all - I've just come back to this project & have a follow-up question.
I'd also like to generate weighted-up totals for employment and unemployment.

To recap, t he data structure is still as in my sample above. I am using Nick Cox's code for the simple totals:

Code:

bysort region: egen reg_unempl = total(empl_stat==2)

Using a similar trick, I thought that the following code should give me the totals weighted up to the full population:

Code:

bysort region: egen wt_reg_unempl = total(weight*empl_stat==2)

For some reason this is returning all zeroes, though the expression empl_stat==2 is obviously true for many observations, so should be evaluating to 1. Not sure what I'm missing, would appreciate any advice!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35709
#8

11 Jul 2019, 05:44

Precedence rules are biting you. I will illustrate with local macros, but the idea is generic.

Code:

. local weight 42 . local empstat 2 . di `weight' * `empstat' == 2 0 . di `weight' * (`empstat' == 2) 42

Stata first calculates the product of weight and value and then compares the result with 2. You need parentheses to force your intended meaning.

See 13.2.5 in https://www.stata.com/manuals/u13.pdf
Comment
Fergus Peace

Join Date: Jun 2019

Posts: 6
#9

11 Jul 2019, 08:05

Aha! Great, thanks very much.
Comment

Announcement