help with regression weights

Oscar Ll

Join Date: Mar 2015

Posts: 35
#1

help with regression weights

28 May 2017, 08:53

Dear users,

I have a large dataset, where all observations can belong to one, two or three groups.
Most of these observations belong to only one group. But few of them can simultaneously belong to two, or even three distinct groups.
If an observation belongs to one group, it appears once in the dataset. If an observation belongs simultaneosly to two groups, it appears twice. And if an observation belongs to three groups, it appears three times.

Now, I wan to run a negative binomial regression, and I want to weight observations depending on their frequency of appeareance in the dataset. That is: if one observation appears twice, their weight in the regression should count as half. If one observation appears three times, their weight should count as 1/3.
I did this to create a variable counting the number of times that each observation appears in the dataset:

Code:

. unique id_document Number of unique values of id_document is 6888177 Number of records is 7553910 . unique id_document dom_cat Number of unique values of id_document dom_cat is 7553910 Number of records is 7553910 . bys id_document: gen freq =_n . tab freq freq | Freq. Percent Cum. ------------+----------------------------------- 1 | 6,888,177 91.19 91.19 2 | 644,570 8.53 99.72 3 | 21,163 0.28 100.00 ------------+----------------------------------- Total | 7,553,910 100.00

What I understood from my readings is that:

- For tabulations and descriptives I should add [pweight=freq]
- For regressions, I should add [fweight=freq]

Could you confirm me that I understood it correctly?

Thanks in advance for the help.

Regards,
Oscar
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30046
#2

28 May 2017, 10:19

I'm afraid you did not understand it correctly.

In your situation, you want iweight for everything. Moreover, the iweight you want is 1/freq.

pweights are used when the data are not the product of a simple random sample, and where different observations had different likelihoods of inclusion in the sample to begin with. Nothing you describe about your data suggests that this is the case here. A typical situation in which this arises is when a survey is taken and one person is selected from each household to participate. Then any person from a household of 2 people has a higher probability of being in the sample than a person from a household of 4 people. pweights are not used for simple random samples (nor for censes).

fweights are used when we have contracted the data so that if there are multiple observations that are identical on all relevant variables, we replace that with a single observation having those values plus a count variable showing how many such observations there originally were. This situation typically arises when we have a very large data set and a relatively small number of variables, each with a limited range of values, so that many of the observations are duplicates of each other, even though they come from different units of analysis. So to save on memory, and, for some commands, to reduce computation time, we shrink the data file by having just one record for each combination of variable values and a count of how often it occurred originally. When we do that, fweights tell Stata how to virtually reconstruct the original data from that.
1 like
Comment
Oscar Ll

Join Date: Mar 2015

Posts: 35
#3

28 May 2017, 10:30

Dear Clyde,

Many thanks for the response.
You're totally right, now I got it.

Best regards,
Oscar
Comment

Announcement

help with regression weights

Comment

Comment