Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • help with regression weights

    Dear users,

    I have a large dataset, where all observations can belong to one, two or three groups.
    Most of these observations belong to only one group. But few of them can simultaneously belong to two, or even three distinct groups.
    If an observation belongs to one group, it appears once in the dataset. If an observation belongs simultaneosly to two groups, it appears twice. And if an observation belongs to three groups, it appears three times.

    Now, I wan to run a negative binomial regression, and I want to weight observations depending on their frequency of appeareance in the dataset. That is: if one observation appears twice, their weight in the regression should count as half. If one observation appears three times, their weight should count as 1/3.
    I did this to create a variable counting the number of times that each observation appears in the dataset:


    Code:
    . unique id_document
    Number of unique values of id_document is  6888177
    Number of records is  7553910
    
    . unique id_document dom_cat
    Number of unique values of id_document dom_cat is  7553910
    Number of records is  7553910
    
    . bys id_document: gen freq =_n
    
    . tab freq
    
           freq |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |  6,888,177       91.19       91.19
              2 |    644,570        8.53       99.72
              3 |     21,163        0.28      100.00
    ------------+-----------------------------------
          Total |  7,553,910      100.00

    What I understood from my readings is that:

    - For tabulations and descriptives I should add [pweight=freq]
    - For regressions, I should add [fweight=freq]

    Could you confirm me that I understood it correctly?

    Thanks in advance for the help.

    Regards,
    Oscar



  • #2
    I'm afraid you did not understand it correctly.

    In your situation, you want iweight for everything. Moreover, the iweight you want is 1/freq.

    pweights are used when the data are not the product of a simple random sample, and where different observations had different likelihoods of inclusion in the sample to begin with. Nothing you describe about your data suggests that this is the case here. A typical situation in which this arises is when a survey is taken and one person is selected from each household to participate. Then any person from a household of 2 people has a higher probability of being in the sample than a person from a household of 4 people. pweights are not used for simple random samples (nor for censes).

    fweights are used when we have contracted the data so that if there are multiple observations that are identical on all relevant variables, we replace that with a single observation having those values plus a count variable showing how many such observations there originally were. This situation typically arises when we have a very large data set and a relatively small number of variables, each with a limited range of values, so that many of the observations are duplicates of each other, even though they come from different units of analysis. So to save on memory, and, for some commands, to reduce computation time, we shrink the data file by having just one record for each combination of variable values and a count of how often it occurred originally. When we do that, fweights tell Stata how to virtually reconstruct the original data from that.

    Comment


    • #3
      Dear Clyde,

      Many thanks for the response.
      You're totally right, now I got it.

      Best regards,
      Oscar

      Comment

      Working...
      X