Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difficulty with egen

    Hi all,

    I usually have no problem with egen but I am finding it impossible to complete a relatively easy egen function and I'm hoping someone can help me.

    I have a dataset of schools, years, ethnicities, and the quantity of each ethnicity in each school. The dataset is long and looks like this (a snapshot of a massive dataset):

    School Year Ethnicity quantity
    1 1998 1 20
    1 1998 2 5
    1 1999 1 30
    2 1998 1 25

    In this snapshot I have 2 schools. The first is observed in 1998 and 1999 and the second only in 1998. I have two Ethnicities. The quantity of ethnicity 1 in 1998 in school 1 is 20 learners. The quantitiy of ethnicity 2 in 1998 in school 1 is 5 learners. And so on.

    I want to be able to say "school 1 had 80% learners who were ethnicity 1 in 1998 and school 2 had 100% learners who were ethnicity 1 in 1998". Of course this is easy to see from a browse but this is a massive dataset and I obviously want to automate the calculation.

    I started by creating a variable which identifies the quantity and the learner at the same time using egen:

    egen eth1_quant=mean(quantity) if Ethnicity==1, by(School Year) // mean or max or min are irrelevant since there is only one value per school and year
    egen eth2_quant=mean(quantity) if Ethnicity==2, by(School Year)

    This worked as expected and gave me a dataset which looks like this:

    School Year Ethnicity quantity eth1_quant eth2_quant
    1 1998 1 20 20 .
    1 1998 2 5 . 5
    1 1999 1 30 30 .
    2 1998 1 25 25 .

    Then I wanted to create a variable which identifies the quantity of each ethnicity in each school within the same observation. I used egen and this is where I am having trouble.

    My code is:

    egen eth1_inschool=min(eth1_quant), by(School Year) // min is used but it is irrelevant since there is only one value per school and year
    egen eth2_inschool=min(eth2_quant), by(School Year)

    What I expected to get out was this:

    School Year Ethnicity quantity eth1_quant eth2_quant eth1_inschool eth2_inschool
    1 1998 1 20 20 . 20 5
    1 1998 2 5 . 5 20 5
    1 1999 1 30 30 . 30 .
    2 1998 1 25 25 . 25 .


    But instead I got

    School Year Ethnicity quantity eth1_quant eth2_quant eth1_inschool eth2_inschool
    1 1998 1 20 20 . 20 .
    1 1998 2 5 . 5 . 5
    1 1999 1 30 30 . 30 .
    2 1998 1 25 25 . 25 .


    Essentially the egen did not pick up the value within school and year and simply gave me the row value.

    Any ideas as to why? I've tried using other egen functions rather than min such as total or max or mean but it is all the same.

    Any help will be greatly appreciated.


  • #2
    The calculation has to be broken down into stages.

    Calculating percents does depend here on calculating totals first. That obliges you to work on two scales, as below.

    Calculating means over single observations --- which is what your calculation does for your data example -- doesn't get you any nearer to totals.



    Code:
    clear 
    input School Year Ethnicity quantity
    1 1998 1 20
    1 1998 2 5
    1 1999 1 30
    2 1998 1 25
    end 
    
    egen total = total(quantity), by(School Year)
    egen total1 = total(quantity * (Ethnicity == 1)), by(School Year)
    egen total2 = total(quantity * (Ethnicity == 2)), by(School Year)
    gen pc1 = 100 * total1 / total 
    gen pc2 = 100 * total2 / total 
    
    l, sepby(School Year)
    
         +---------------------------------------------------------------------------+
         | School   Year   Ethnic~y   quantity   total   total1   total2   pc1   pc2 |
         |---------------------------------------------------------------------------|
      1. |      1   1998          1         20      25       20        5    80    20 |
      2. |      1   1998          2          5      25       20        5    80    20 |
         |---------------------------------------------------------------------------|
      3. |      1   1999          1         30      30       30        0   100     0 |
         |---------------------------------------------------------------------------|
      4. |      2   1998          1         25      25       25        0   100     0 |
         +---------------------------------------------------------------------------+
    I can say that it took me a fair amount of practice before I got comfortable with this kind of calculation. Using true-or-false expressions to include what you want and exclude what you want to exclude is helpful for many problems. I've tried to write up some of the most useful trickery at https://www.stata-journal.com/articl...article=dm0055

    As a matter of English usage, I would recommend talking about "number of learners" rather than "quantity of learners" in a report. Mathematically or statistically, the number of people is certainly a quantity, but in ordinary usage, quantity refers more narrowly to measured amounts and doesn't happily apply when people are counted.

    Comment

    Working...
    X