Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Grouping residuals by month and taking the difference

    Using STATA 18 and I am trying to capture variation in the location, (London wards), time fixed effects for property prices in London, using a hedonic pricing model.
    This is using property price data with dummies for property characteristics.

    I am trying to capture that using reghdfe and create two groups, wards with high/low residents born of foreign countries', with the idea of taking the difference in the residuals between those two groups, in each month.

    Problem is that when I create this price difference variable (pdiff_) for each country combination, STATA does not take the difference between the two groups but between each observation. An observation (transaction) cannot be in a ward that has both high pop. of resident from country X and low at the same time - it is one or the other. Hence there will be an empty value when taking the difference.

    Can someone please point me out on how I can group these categories so that STATA does not try and take the difference at the individual observation level?

    Many thanks

    Code:
    encode stata_date, generate(stata_date1)
    encode ward_code, generate(ward_num)
    *Calculating ward fixed effects and time effects
    . reghdfe ln_price_paid propType_* oldNew_* duration_*, absorb(ward_fe=ward_num time_fe=stata_date1) cluster(ward_num) residuals(resid)
    (MWFE estimator converged in 5 iterations)
    note: propType_5 omitted because of collinearity
    note: oldNew_2 omitted because of collinearity
    note: duration_2 omitted because of collinearity
    
    HDFE Linear regression                            Number of obs   =  1,172,017
    Absorbing 2 HDFE groups                           F(   6,    679) =    1301.66
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.4221
                                                      Adj R-squared   =     0.4217
                                                      Within R-sq.    =     0.2309
    Number of clusters (ward_num) =        680        Root MSE        =     0.6098
    
                                 (Std. err. adjusted for 680 clusters in ward_num)
    ------------------------------------------------------------------------------
                 |               Robust
    ln_price_p~d | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
      propType_1 |   .4683893   .0138025    33.94   0.000     .4412886      .49549
      propType_2 |   .1664204   .0305239     5.45   0.000     .1064879    .2263529
      propType_3 |  -.5524648   .0416714   -13.26   0.000     -.634285   -.4706446
      propType_4 |   .1341584   .0059325    22.61   0.000     .1225101    .1458066
      propType_5 |          0  (omitted)
        oldNew_1 |  -.1989015   .0101902   -19.52   0.000    -.2189096   -.1788934
        oldNew_2 |          0  (omitted)
      duration_1 |   .7961511   .0327772    24.29   0.000     .7317943     .860508
      duration_2 |          0  (omitted)
           _cons |   12.75263   .0318903   399.89   0.000     12.69001    12.81525
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
        ward_num |       680         680           0    *|
     stata_date1 |       124           1         123     |
    -----------------------------------------------------+
    * = FE nested within cluster; treated as redundant for DoF computation
    
    . 
    . predict residuals, xbd

    Code:
    *Looping through countries to calculate the price difference based on residuals
    
    foreach country in country1 country2 {
        *Categorising wards into quantiles based on the number of residents from each country. *country variables represent the residents demography in each ward.
        by stata_date1: egen quantile_`country' = xtile(`country'), nquantiles(5)
    
        *Calculating the mean residual for wards in the highest and lowest quantile.
        by stata_date1: egen high_resid_`country' = mean(resid) if quantile_`country' == 5
        by stata_date1: egen low_resid_`country' = mean(resid) if quantile_`country' == 1
        
        *Calculating the price difference for each time period within the same country and date.
        by stata_date1: gen pdiff_`country' = high_resid_`country' - low_resid_`country'
        
    .
    (940,103 missing values generated)
    (933,254 missing values generated)
    (1,172,017 missing values generated)
    (939,203 missing values generated)
    (928,913 missing values generated)
    (1,172,017 missing values generated)
    
    }

  • #2
    You don't provide any example data, but the following commands make no sense to me, and I suspect this is the source of your difficulty:
    Code:
    foreach country in country1 country2 {
        *Categorising wards into quantiles based on the number of residents from each country. *country variables represent the residents demography in each ward.
        by stata_date1: egen quantile_`country' = xtile(`country'), nquantiles(5)
    `country' refers to either country1 or country2, which, I assume, are either names of countries or alphanumeric or numeric country codes. If that's not the case, then I can't make sense of what the loop is even trying to do. Assuming it is the case, then -xtile(`country')- makes no sense. If `country' is a name or alphanumeric code, then it doesn't even have numeric values and you can't take percentiles of it. If `country' is a numeric code for the countries, then this is definable, but useless since the country codes themselves must be arbitrary numbers, and percentiles make no sense for variables that have no intrinsic order.

    My guess is you meant -by stata_date1: egen quantile_`country' = xtile(rbfc) if country == `country', nquantiles(5)-, where rbfc should be replaced by the name of a variable that gives the proportion of residents born in foreign countries.

    If this is not correct, please post back including example data, using the -dataex- command to do that. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      Thank you for pointing that out, here is a shorter version of the dataset (one country only).

      Each country variable contains a number which is the demography for a specific ward_code across my timeseries.
      egen quantile_`country' = xtile(`country'), nquantiles(5) is supposed to be assigning each observation into quantiles for each country/columns depending on how many residents born abroad live in that location, so it contains numeric values.
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input int unitedkingdom float ln_price_paid byte(propType_1 propType_2 propType_3 propType_4 propType_5 oldNew_1 oldNew_2 duration_1 duration_2) long(stata_date1 ward_num) double(resid ward_fe time_fe residuals) float quantile_unitedkingdom
      3015 13.924365 0 1 0 0 0 1 0 0 1 1  42   .2885960223142587    1.2848389001657345 -.36921889461336177 13.635769021325878 1
      4557 13.206704 0 1 0 0 0 1 0 0 1 1 313 -.21841384454139406    1.0741878644209915 -.36921889461336177 13.425117984250866 1
      4557 13.560618 0 1 0 0 0 1 0 0 1 1 313  .13550041632286508    1.0741878644209915 -.36921889461336177 13.425117984250866 1
      3333 12.341477 0 1 0 0 0 0 1 0 1 1 509  -.2078067353061163 -.0005474776466130511 -.36921889461336177  12.54928412941012 1
      3086 13.353476 0 1 0 0 0 1 0 0 1 1  43 -.44312630386356205    1.4456717640703318 -.36921889461336177 13.796601874542272 1
      11462  12.15478 0 1 0 0 0 1 0 0 1 1 294  -.09823725430433979 -.09791341600256881 -.36921889461336177 12.253016688508442 5
      11314 13.028052 1 0 0 0 0 1 0 1 0 1 299  -.46677471466636233 .045776907909534426 -.36921889461336177 13.494827044683452 5
      11654 13.854732 0 0 0 0 1 1 0 1 0 1 610    .5348149006973558  .33925582768317936 -.36921889461336177 13.319916659056062 5
      10798  12.79386 0 1 0 0 0 1 0 0 1 1 664    .2113696577472681  .23155970456890498 -.36921889461336177 12.582489824064256 5
      11775 13.062508 0 0 0 0 1 0 1 1 0 1 569 -.025785394110011604 -.09126934181707551 -.36921889461336177 13.088293023504542 5
      end
      label values stata_date1 stata_date1
      label def stata_date1 1 "2013m1", modify
      label values ward_num ward_num
      label def ward_num 42 "E05009388", modify
      label def ward_num 43 "E05009389", modify
      label def ward_num 313 "E05013659", modify
      label def ward_num 509 "E05013907", modify
      label def ward_num 294 "E05013640", modify
      label def ward_num 299 "E05013645", modify
      label def ward_num 569 "E05013987", modify
      label def ward_num 610 "E05014028", modify
      label def ward_num 664 "E05014104", modify


      Comment


      • #4
        OK, now that I see the layout of your data, that makes more sense. I assumed you had a single variable called country, with separate observations for separate countries (long data layout), but you have wide data here. I also assumed that the values of the variable country were names of, or codes for, countries themselves, whereas what you have are population counts for the country the variable is named after.

        So, the root of your problem is not where I first though. Rather the problem is with
        Code:
        *Calculating the mean residual for wards in the highest and lowest quantile.
        by stata_date1: egen high_resid_`country' = mean(resid) if quantile_`country' == 5
        by stata_date1: egen low_resid_`country' = mean(resid) if quantile_`country' == 1
        Those should instead be:
        Code:
        *Calculating the mean residual for wards in the highest and lowest quantile.
        by stata_date1: egen high_resid_`country' = mean(cond(quantile_`country' == 5, resid, .))
        by stata_date1: egen low_resid_`country' = mean(cond(quantile_`country' == 1, resid, .))

        Comment


        • #5
          Thank you ! Output as expected.

          Comment

          Working...
          X