Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating new variables containing summary statistics with 'importance' weights?

    Dear Statalist,

    I have microdata on individuals, where I have assigned those individuals to geographical locations on a probabilistic basis. In other words, in some cases I know with 100% certainty that individual i is in location z. But in other cases, I might know that there is an 80% likelihood she is in location z, and 20% that she is in location x.

    I have thus generated n copies of individuals i, where n is the number of locations in which i might be located. Each iteration of i has a weight variable a, reflecting the likelihood of being in that particular location (ie a=0.8 or 0.2 or 1).

    What I am trying to do now is generate some new variables that contain location-level summary statistics on certain economic variables like wages and rents. In other words, for each individual i I know their annual wages and their rents, and I am trying to build location-specific mean and median wages and rents.

    Here is a snippet of my data to make thing concrete. In terms of variables, serial is the household identifier; pernum is the person identifier, czone is the location identifier; afact is the probability of being in czone; rent is monthly rent and wage is self-explanatory.

    Code:
    serial    pernum    czone    afact    rent    wage
    10        1         11600    1          35    1200
    11        3        21600    .4866168    10    1370
    11        3        26001    .0246062    10    1370
    11        3        26002    .1607224    10    1370
    11        3        26003    .0144012    10    1370
    11        3        26004    .0839739    10    1370
    11        3        26701    .2296794    10    1370
    So in this case, person 1 in household 10 has a 100 percent chance of being in czone 11600. Whereas person 3 in household 11 could be in 6 different locations. I'm basically ignoring the household level for the moment - it just helps uniquely identify individuals.

    What I am struggling with is how to incorporate the probability weight. A person who has only a 20% chance of being in a location and another who has a 80% of chance should not contribute equally to mean or median wages of that location.

    I started with the collapse command, but realized it cannot handle weights for means or medians. Plus I'm uncertain how the weights I have fit into the standard Stata weight categories.

    What is the right way to do what I want?

    Thanks in advance for helping me think this through.

    Tom



  • #2
    Suppose that there is 1 individual in a zone with a wage of 3000, and a probability of 0.20 that there is a second individual with a wage of 2000. The average wage here is the weighted sum of the average wages of the scenarios where individual 2 is present and not present.

    Avg. age|Individual 2 is present = 2500
    Avg. wage|Individual 2 is not present= 3000

    Avg. wage = 0.2(2500) + 0.8(3000)= 2900

    The issue that you will face with more than one probabilistic scenario per zone is what are the correlations between the various probabilities. Is the probability of individual 2 being present independent of that of individual 3, for example? I do not think that the implementation is the hard part as long as you can correctly calculate the joint probabilities.

    Comment


    • #3
      Thanks for your thoughts Andrew. I had not thought of the problem in this way, though your explanation makes sense.

      On the conceptual front, other than members of the same household I think people can be assumed to independent.

      Given this, I would then need a series of possible location-specific populations. But the scale here is hard. My actual dataset has around 7 million observations (with a fair bit of this being people who occur several times as they might be in different locations, as my data fragment suggests).

      Any ideas on how I can actually do all this at that scale?

      Comment


      • #4
        Note that -collapse- can indeed compute mean and median with weight. You may have to experiment with the different weights, but mean and median do not depend on the normaization chosen, so you don't have to worry too much about which weight you use (no fweights though, sine they must be integers).

        Comment


        • #5
          On the conceptual front, other than members of the same household I think people can be assumed to independent.

          Given this, I would then need a series of possible location-specific populations. But the scale here is hard. My actual dataset has around 7 million observations (with a fair bit of this being people who occur several times as they might be in different locations, as my data fragment suggests).
          With independent probabilities, the joint probability is just the product of the individual probabilities. However, with such a large data set, I would not attempt to specify the probabilities of each scenario and would go with Jean-Claude Arbaut's advice.

          Code:
          collapse (mean) wage rent [pw=afact], by(czone)

          Comment


          • #6
            Thanks both. This was a really helpful conversation!

            Comment

            Working...
            X