Grouping residuals by month and taking the difference

Andrew Clarke

Join Date: Nov 2023
Posts: 6

Grouping residuals by month and taking the difference

09 Nov 2023, 06:37

Using STATA 18 and I am trying to capture variation in the location, (London wards), time fixed effects for property prices in London, using a hedonic pricing model.
This is using property price data with dummies for property characteristics.

I am trying to capture that using reghdfe and create two groups, wards with high/low residents born of foreign countries', with the idea of taking the difference in the residuals between those two groups, in each month.

Problem is that when I create this price difference variable (pdiff_) for each country combination, STATA does not take the difference between the two groups but between each observation. An observation (transaction) cannot be in a ward that has both high pop. of resident from country X and low at the same time - it is one or the other. Hence there will be an empty value when taking the difference.

Can someone please point me out on how I can group these categories so that STATA does not try and take the difference at the individual observation level?

Many thanks

Code:

encode stata_date, generate(stata_date1)
encode ward_code, generate(ward_num)
*Calculating ward fixed effects and time effects
. reghdfe ln_price_paid propType_* oldNew_* duration_*, absorb(ward_fe=ward_num time_fe=stata_date1) cluster(ward_num) residuals(resid)
(MWFE estimator converged in 5 iterations)
note: propType_5 omitted because of collinearity
note: oldNew_2 omitted because of collinearity
note: duration_2 omitted because of collinearity

HDFE Linear regression                            Number of obs   =  1,172,017
Absorbing 2 HDFE groups                           F(   6,    679) =    1301.66
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.4221
                                                  Adj R-squared   =     0.4217
                                                  Within R-sq.    =     0.2309
Number of clusters (ward_num) =        680        Root MSE        =     0.6098

                             (Std. err. adjusted for 680 clusters in ward_num)
------------------------------------------------------------------------------
             |               Robust
ln_price_p~d | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  propType_1 |   .4683893   .0138025    33.94   0.000     .4412886      .49549
  propType_2 |   .1664204   .0305239     5.45   0.000     .1064879    .2263529
  propType_3 |  -.5524648   .0416714   -13.26   0.000     -.634285   -.4706446
  propType_4 |   .1341584   .0059325    22.61   0.000     .1225101    .1458066
  propType_5 |          0  (omitted)
    oldNew_1 |  -.1989015   .0101902   -19.52   0.000    -.2189096   -.1788934
    oldNew_2 |          0  (omitted)
  duration_1 |   .7961511   .0327772    24.29   0.000     .7317943     .860508
  duration_2 |          0  (omitted)
       _cons |   12.75263   .0318903   399.89   0.000     12.69001    12.81525
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
    ward_num |       680         680           0    *|
 stata_date1 |       124           1         123     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

. 
. predict residuals, xbd

Code:

*Looping through countries to calculate the price difference based on residuals

foreach country in country1 country2 {
    *Categorising wards into quantiles based on the number of residents from each country. *country variables represent the residents demography in each ward.
    by stata_date1: egen quantile_`country' = xtile(`country'), nquantiles(5)

    *Calculating the mean residual for wards in the highest and lowest quantile.
    by stata_date1: egen high_resid_`country' = mean(resid) if quantile_`country' == 5
    by stata_date1: egen low_resid_`country' = mean(resid) if quantile_`country' == 1
    
    *Calculating the price difference for each time period within the same country and date.
    by stata_date1: gen pdiff_`country' = high_resid_`country' - low_resid_`country'
    
.
(940,103 missing values generated)
(933,254 missing values generated)
(1,172,017 missing values generated)
(939,203 missing values generated)
(928,913 missing values generated)
(1,172,017 missing values generated)

}

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30087
#2

09 Nov 2023, 10:54

You don't provide any example data, but the following commands make no sense to me, and I suspect this is the source of your difficulty:

Code:

foreach country in country1 country2 { *Categorising wards into quantiles based on the number of residents from each country. *country variables represent the residents demography in each ward. by stata_date1: egen quantile_`country' = xtile(`country'), nquantiles(5)

`country' refers to either country1 or country2, which, I assume, are either names of countries or alphanumeric or numeric country codes. If that's not the case, then I can't make sense of what the loop is even trying to do. Assuming it is the case, then -xtile(`country')- makes no sense. If `country' is a name or alphanumeric code, then it doesn't even have numeric values and you can't take percentiles of it. If `country' is a numeric code for the countries, then this is definable, but useless since the country codes themselves must be arbitrary numbers, and percentiles make no sense for variables that have no intrinsic order.

My guess is you meant -by stata_date1: egen quantile_`country' = xtile(rbfc) if country == `country', nquantiles(5)-, where rbfc should be replaced by the name of a variable that gives the proportion of residents born in foreign countries.

If this is not correct, please post back including example data, using the -dataex- command to do that. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment

Andrew Clarke

Join Date: Nov 2023
Posts: 6

10 Nov 2023, 04:32

Thank you for pointing that out, here is a shorter version of the dataset (one country only).

Each country variable contains a number which is the demography for a specific ward_code across my timeseries.
egen quantile_`country' = xtile(`country'), nquantiles(5) is supposed to be assigning each observation into quantiles for each country/columns depending on how many residents born abroad live in that location, so it contains numeric values.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int unitedkingdom float ln_price_paid byte(propType_1 propType_2 propType_3 propType_4 propType_5 oldNew_1 oldNew_2 duration_1 duration_2) long(stata_date1 ward_num) double(resid ward_fe time_fe residuals) float quantile_unitedkingdom
3015 13.924365 0 1 0 0 0 1 0 0 1 1  42   .2885960223142587    1.2848389001657345 -.36921889461336177 13.635769021325878 1
4557 13.206704 0 1 0 0 0 1 0 0 1 1 313 -.21841384454139406    1.0741878644209915 -.36921889461336177 13.425117984250866 1
4557 13.560618 0 1 0 0 0 1 0 0 1 1 313  .13550041632286508    1.0741878644209915 -.36921889461336177 13.425117984250866 1
3333 12.341477 0 1 0 0 0 0 1 0 1 1 509  -.2078067353061163 -.0005474776466130511 -.36921889461336177  12.54928412941012 1
3086 13.353476 0 1 0 0 0 1 0 0 1 1  43 -.44312630386356205    1.4456717640703318 -.36921889461336177 13.796601874542272 1
11462  12.15478 0 1 0 0 0 1 0 0 1 1 294  -.09823725430433979 -.09791341600256881 -.36921889461336177 12.253016688508442 5
11314 13.028052 1 0 0 0 0 1 0 1 0 1 299  -.46677471466636233 .045776907909534426 -.36921889461336177 13.494827044683452 5
11654 13.854732 0 0 0 0 1 1 0 1 0 1 610    .5348149006973558  .33925582768317936 -.36921889461336177 13.319916659056062 5
10798  12.79386 0 1 0 0 0 1 0 0 1 1 664    .2113696577472681  .23155970456890498 -.36921889461336177 12.582489824064256 5
11775 13.062508 0 0 0 0 1 0 1 1 0 1 569 -.025785394110011604 -.09126934181707551 -.36921889461336177 13.088293023504542 5
end
label values stata_date1 stata_date1
label def stata_date1 1 "2013m1", modify
label values ward_num ward_num
label def ward_num 42 "E05009388", modify
label def ward_num 43 "E05009389", modify
label def ward_num 313 "E05013659", modify
label def ward_num 509 "E05013907", modify
label def ward_num 294 "E05013640", modify
label def ward_num 299 "E05013645", modify
label def ward_num 569 "E05013987", modify
label def ward_num 610 "E05014028", modify
label def ward_num 664 "E05014104", modify

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30087
#4

10 Nov 2023, 09:52

OK, now that I see the layout of your data, that makes more sense. I assumed you had a single variable called country, with separate observations for separate countries (long data layout), but you have wide data here. I also assumed that the values of the variable country were names of, or codes for, countries themselves, whereas what you have are population counts for the country the variable is named after.

So, the root of your problem is not where I first though. Rather the problem is with

Code:

*Calculating the mean residual for wards in the highest and lowest quantile. by stata_date1: egen high_resid_`country' = mean(resid) if quantile_`country' == 5 by stata_date1: egen low_resid_`country' = mean(resid) if quantile_`country' == 1

Those should instead be:

Code:

*Calculating the mean residual for wards in the highest and lowest quantile. by stata_date1: egen high_resid_`country' = mean(cond(quantile_`country' == 5, resid, .)) by stata_date1: egen low_resid_`country' = mean(cond(quantile_`country' == 1, resid, .))
Comment
Andrew Clarke

Join Date: Nov 2023

Posts: 6
#5

13 Nov 2023, 04:00

Thank you ! Output as expected.
Comment

Announcement

Grouping residuals by month and taking the difference

Comment

Comment

Comment

Comment