Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using weighted least squares to include additional information when analysing observations within percentiles

    Dear statalist,

    I am doing some analysis on a large survey dataset (including frequency weights). For analytical purposes, I have divided the entire set into percentiles and analyse individuals within these percentiles. Analyses are highly straightforward OLS's.

    The problem I am struggling with is the following: I base the initial percentile division on another variable which I calculated on a group-level. That is, there are more than 100 groups in the set and for every group I calculated an index. The percentile division is then performed along this index. Because I want to maintain (roughly) equal numbers of observations in every percentile, some percentiles include sub-parts of the original groups (they don't perfectly fit within the 100 percentile bins). My initial strategy was to randomly divide a group across two percentiles whenever it fell across a percentile dividing line.

    When doing analysis on the observations within a singly percentile that includes a 'sub-group', i.e. there is at least one group partly covered in that percentile, I would like to use all information in that group: thus include observations beyond the percentile division line as well. I still want to retain the original weight of the group in that specific percentile when estimating the coefficients, however. As an example:

    Consider percentile p, consisting of four groups along which the original percentile division was made. Of these four groups, one group, j, is partly encapsulated by the p'th percentile, and partly by the p+1'th. For ease of representation, let's say 20% of the p'th percentile consists of group j observations and 10% of the p+1'th percentile consists of group j observations. Thus, 2/3 of group j's total observations would currently be used when I would analyse all observations within percentile p. I would, instead, like to use the entire group in analysis, but still let the proportion of information gained from group j only count for the 20% of the observations they actually represent in the p'th percentile.

    My question then is (after a long introduction, apologies) if it is valid to simply apply a weighted least-squares analysis on all four groups (thus on a total N that is 110% of the original N because the entire j'th group is added) and giving a weight of 2/3 to all of group j's observations? My intuition says this is completely fine, but I wanted to run this along you guys since I've only used weighted least squares within the context of heteroskedasticity. The intuition remains similar I feel, though.

    All best,
    Mark
Working...
X