Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dividing weighted variables only returns missing values

    Hello,

    I'm working on wage gaps, using a dataset with weights. I am using code by Mr. Cox as seen in "https://www.statalist.org/forums/forum/general-stata-discussion/general/1479555-problem-using-bysort-egen-mean-and-weight-together" .

    sort year
    bysort year: egen double den = total(weight) if GroupA==1
    bysort syear: egen double weightedwageGroupA = total(hourlywage*weight) if GroupA==1
    replace weightedwageGroupA =weightedwageGroupA/den if GroupA==1
    drop den

    sort year
    bysort year: egen double den = total(weight) if GroupB==1
    bysort syear: egen double weightedwageGroupB = total(hourlywage*weight) if GroupB==1
    replace weightedwageGroupB =weightedwageGroupB/den if GroupB==1
    drop den

    // summarizing shows that code above worked, existing mean, sd,...

    gen wagegap = (weightedwageGroupA / weightedwageGroupB) -1
    "(300,000 missing values generated)"

    There has to be a logic error somewhere. "Weight" is probability weight.
    I was able to bypass this problem, but this only gets me mean wagegap without standard deviation:

    egen meanweightedwageGroupA = mean(weightedwageGroupA)
    egen meanweightedwageGroupB = mean(weightedwageGroupB)
    gen wagegap = (weightedwageGroupA / weightedwageGroupB) -1

    Thank you for your help in advance!
    Best, Aron Mueller.

  • #2
    Without example data, I am reduced to guessing. But based on the kinds of mistakes that I have commonly seen made, here's my best guess what is going on:

    You have calculated weightedwagegroupA values for the subset of your data in which the variable groupA takes on the value 1. You have calculated weightedwagegroupB values for the subset of your data in which the variable groupB takes on the value1. If there are no observations in which both groupA = 1 and groupB = 1, then you will get only missing value results, since either weightedwagegroupA or weightedwagegroupB will be missing in every observation. To check this:
    Code:
    count if groupA == 1 & groupB == 1
    If the answer is 0, you have found your problem.

    I'll go further out on a limb and guess that your variables groupA and groupB are 0/1 variables that, respectively, indicate membership in two different groups (A and B, that is.) I'll go even farther out on that limb and guess that every observation belongs to either group A or group B (but never to both). If so, you will be better off eliminating those variables in favor of a single variable, call it group, that takes on values A and B. Then you can change your code to:
    Code:
    bysort group year: egen double den = total(weight)
    by group year: egen double num = total(weight*hourlywage)
    gen average_weighted_wage = num/den
    From there you want the gap, which will not really be a variable: it is just a single number. So we may as well compute it that way:
    Code:
    summ average_weighted_wage if group == "A", meanonly
    scalar wage_gap = r(mean)
    summ average_weighted_wage if group == "B", meanonly
    scalar wage_gap = (wage_gap/r(mean)) - 1
    display wage_gap

    Comment


    • #3
      Dear Mr. Schlechter,

      thank you for your swift reply. Unfortunately, I am working with disclosed data, therefore not being able show any. Nevertheless, you guessed right, I divided individuals into groups excluding each other such that count if groupA == 1 & groupB == 1 is 0, groups are defined by dummies.
      I applied your code, which does not necessarily return valid mean wages and missing standard deviations, but I think that is related to failures in my code.
      Still, thank you, misspecification of groups explains majority of trouble I ran into.
      Last edited by Aron Mueller; 12 Jun 2023, 10:36. Reason: should have dropped unemployed individuals, which were responsible for low average wages

      Comment


      • #4
        Unfortunately, I am working with disclosed data, therefore not being able show any.
        This is not a reason not to show example data. The example data, in most situations, need not be the real data. The need for example data is to show the overall organization of the data, and metadata (storage types, labels, etc.). So, for example, in your case, you could have shown a data set the looks like yours, but with the actual values of weight and hourly wage changed to random numbers. If in your real data the variables you called groupA and groupB actually had names that were informative about what the groups were, you could just change the names to groupA and groupB. That way you would not breach any confidentiality, but you would provide the information needed for people to help you out.

        While there are situations where the actual data values are needed to provide help, those are uncommon (at least in the type of posts I usually see).

        Comment

        Working...
        X