Generating new variables containing summary statistics with 'importance' weights?

Tom Kemeny

Join Date: Apr 2016

Posts: 52
#1

Generating new variables containing summary statistics with 'importance' weights?

18 Feb 2019, 02:08

Dear Statalist,

I have microdata on individuals, where I have assigned those individuals to geographical locations on a probabilistic basis. In other words, in some cases I know with 100% certainty that individual i is in location z. But in other cases, I might know that there is an 80% likelihood she is in location z, and 20% that she is in location x.

I have thus generated n copies of individuals i, where n is the number of locations in which i might be located. Each iteration of i has a weight variable a, reflecting the likelihood of being in that particular location (ie a=0.8 or 0.2 or 1).

What I am trying to do now is generate some new variables that contain location-level summary statistics on certain economic variables like wages and rents. In other words, for each individual i I know their annual wages and their rents, and I am trying to build location-specific mean and median wages and rents.

Here is a snippet of my data to make thing concrete. In terms of variables, serial is the household identifier; pernum is the person identifier, czone is the location identifier; afact is the probability of being in czone; rent is monthly rent and wage is self-explanatory.

Code:

serial pernum czone afact rent wage 10 1 11600 1 35 1200 11 3 21600 .4866168 10 1370 11 3 26001 .0246062 10 1370 11 3 26002 .1607224 10 1370 11 3 26003 .0144012 10 1370 11 3 26004 .0839739 10 1370 11 3 26701 .2296794 10 1370

So in this case, person 1 in household 10 has a 100 percent chance of being in czone 11600. Whereas person 3 in household 11 could be in 6 different locations. I'm basically ignoring the household level for the moment - it just helps uniquely identify individuals.

What I am struggling with is how to incorporate the probability weight. A person who has only a 20% chance of being in a location and another who has a 80% of chance should not contribute equally to mean or median wages of that location.

I started with the collapse command, but realized it cannot handle weights for means or medians. Plus I'm uncertain how the weights I have fit into the standard Stata weight categories.

What is the right way to do what I want?

Thanks in advance for helping me think this through.

Tom
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10215
#2

18 Feb 2019, 07:36

Suppose that there is 1 individual in a zone with a wage of 3000, and a probability of 0.20 that there is a second individual with a wage of 2000. The average wage here is the weighted sum of the average wages of the scenarios where individual 2 is present and not present.

Avg. age|Individual 2 is present = 2500
Avg. wage|Individual 2 is not present= 3000

Avg. wage = 0.2(2500) + 0.8(3000)= 2900

The issue that you will face with more than one probabilistic scenario per zone is what are the correlations between the various probabilities. Is the probability of individual 2 being present independent of that of individual 3, for example? I do not think that the implementation is the hard part as long as you can correctly calculate the joint probabilities.
Comment
Tom Kemeny

Join Date: Apr 2016

Posts: 52
#3

18 Feb 2019, 08:44

Thanks for your thoughts Andrew. I had not thought of the problem in this way, though your explanation makes sense.

On the conceptual front, other than members of the same household I think people can be assumed to independent.

Given this, I would then need a series of possible location-specific populations. But the scale here is hard. My actual dataset has around 7 million observations (with a fair bit of this being people who occur several times as they might be in different locations, as my data fragment suggests).

Any ideas on how I can actually do all this at that scale?
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#4

18 Feb 2019, 09:59

Note that -collapse- can indeed compute mean and median with weight. You may have to experiment with the different weights, but mean and median do not depend on the normaization chosen, so you don't have to worry too much about which weight you use (no fweights though, sine they must be integers).
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10215
#5

18 Feb 2019, 13:32

On the conceptual front, other than members of the same household I think people can be assumed to independent.

Given this, I would then need a series of possible location-specific populations. But the scale here is hard. My actual dataset has around 7 million observations (with a fair bit of this being people who occur several times as they might be in different locations, as my data fragment suggests).

With independent probabilities, the joint probability is just the product of the individual probabilities. However, with such a large data set, I would not attempt to specify the probabilities of each scenario and would go with Jean-Claude Arbaut's advice.

Code:

collapse (mean) wage rent [pw=afact], by(czone)
Comment
Tom Kemeny

Join Date: Apr 2016

Posts: 52
#6

19 Feb 2019, 02:39

Thanks both. This was a really helpful conversation!
Comment

Announcement

Generating new variables containing summary statistics with 'importance' weights?

Comment

Comment

Comment

Comment

Comment