Extract individual coefficients from a regression (-reghdfe-) to map them

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

Extract individual coefficients from a regression (-reghdfe-) to map them

22 Apr 2025, 05:48

Hello everyone,

I want to extract individual coefficients from a regression using -reghdfe- from SSC. Then, the goal is to map those same coefficients for each postal code that are inside a given distance bins from the city center of Madrid with -geoplot- from SSC, based on their distance from a specific point. I've created several distance bins, based on whether the postcode in question is in the 25, 50, 75 percentile of distance.

I don't know if what I'm doing is appropriate and correct, so I'd like some help, please.

Here's my code for building the dataset:

Code:

clear all
version 17
set more off
set varabbrev off

    if c(username) == "MIDUARTE" {
        global project "C:/Users/miduarte/Desktop"
    }

use "${project}/Cars_PredictedPrices_Metals_IV_xoutliers.dta", clear

drop if province != 30
drop make model engine_cap_cc liters weight_max seating_cap kw_power co2_g_km car_type car_origin car_renting fueltype euro_emissions_standard ln_avg_price_original ln_avgp_hat_all_fe avg_price_pred_fe_original ln_avgp_hat_all_fe avg_price_pred_re_original ln_avgp_hat_all_re hh_size steel_avg_month_p_eur_per_ton steel_avg_month_p_eur_per_kg iv_steel_p_times_car_weight iv_neodymium_eur_kg_hev_ev iv_neodymium_eur_ton_hev_ev iv_lithium_eur_kg_hev_ev iv_lithium_eur_ton_hev_ev

// Create a dummy for EVs and hybrids and another one for combustion
decode powertraintype, gen (tipo)
gen EV = volumes_per_postal_code if inlist(powertraintype, 1, 3, 5, 6)
gen combustion = volumes_per_postal_code if powertraintype == 2
gen electric = volumes_per_postal_code if inlist(powertraintype, 1)
gen hybrid = volumes_per_postal_code if inlist(powertraintype, 3, 5, 6)


// Sum all of the EVs and combustion cars per month per zip code
collapse (sum) EV combustion electric hybrid volumes_per_postal_code (first) pop geodesic_distance_km pp_euro hh_i1 hh_i2 hh_i3 hh_i4 hh_i5 province, by(ym_date postal_code)

// Find the total vehicles for each zip code and each month by summing up the previous variables
gen total_cars = electric + hybrid + combustion

// Now find the share of EVs per month per zip code
gen EV_share = EV/total_cars
replace EV_share = 0 if missing(EV_share)
gen combustion_share = combustion/total_cars
replace combustion_share = 0 if missing(combustion_share)
gen electric_share = electric/total_cars
replace electric_share = 0 if missing(electric)
gen hybrid_share = hybrid/total_cars
replace hybrid_share = 0 if missing(hybrid)

label var EV_share "Share of EVs in total vehicles by postal code and month"

label var combustion_share "Share of combustion cars in total vehicles by postal code and month"

label var electric_share "Share of electric cars in total vehicles by postal code and month"

label var hybrid_share "Share of hybrid cars in total vehicles by postal code and month"

replace geodesic_distance_km = 0 if postal_code == 28004 | postal_code == 28005 | postal_code == 28012 | postal_code == 28013

* Get quartile values
_pctile geodesic_distance_km, p(10 25 50 75)
local p10 = r(r1)
local p25 = r(r2)
local p50 = r(r3)
local p75 = r(r4)

gen dist_category_mad = .
replace dist_category_mad = 1 if geodesic_distance_km < 5
replace dist_category_mad = 2 if geodesic_distance_km >= 5  & geodesic_distance_km <= `p25'
replace dist_category_mad = 3 if geodesic_distance_km > `p25' & geodesic_distance_km <= `p50'
replace dist_category_mad = 4 if geodesic_distance_km > `p50' & geodesic_distance_km <= `p75'
replace dist_category_mad = 5 if geodesic_distance_km > `p75'

replace pp_euro = round(pp_euro) // Rounds values to nearest integer

_pctile pp_euro, p(10 20 30 40 50 60 70 80 90)
local p10 = r(r1)
local p20 = r(r2)
local p30 = r(r3)
local p40 = r(r4)
local p50 = r(r5)
local p60 = r(r6)
local p70 = r(r7)
local p80 = r(r8)
local p90 = r(r9)

gen pp_euro_bins = .
replace pp_euro_bins = 1 if pp_euro < `p10'
replace pp_euro_bins = 2 if pp_euro >= `p10' & pp_euro_bins < `p20'
replace pp_euro_bins = 3 if pp_euro >= `p20' & pp_euro < `p30'
replace pp_euro_bins = 4 if pp_euro >= `p30' & pp_euro < `p40'
replace pp_euro_bins = 5 if pp_euro >= `p40' & pp_euro < `p50'
replace pp_euro_bins = 6 if pp_euro >= `p50' & pp_euro < `p60'
replace pp_euro_bins = 7 if pp_euro >= `p60' & pp_euro < `p70'
replace pp_euro_bins = 8 if pp_euro >= `p70' & pp_euro < `p80'
replace pp_euro_bins = 9 if pp_euro >= `p80' & pp_euro < `p90'
replace pp_euro_bins = 10 if pp_euro >= `p90'

gen pp_eurok = pp_euro / 1000

Here is the dataset obtained and that I use for my further regressions:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(EV_share dist_category_mad) int ym_date float pp_eurok long postal_code
   .015625 1 660 26.472 28001
 .02586207 1 660 23.912 28002
         0 1 660 25.015 28003
         0 1 660 22.623 28004
       .02 1 660 22.329 28005
.013157895 1 660  25.63 28006
 .05384615 1 660 24.233 28007
.033333335 1 660 23.012 28008
.014084507 1 660 24.512 28009
 .05769231 1 660  25.14 28010
         0 1 660 17.867 28011
         0 1 660 19.981 28012
      .125 1 660 23.254 28013
       .04 1 660 24.181 28014
.033333335 1 660 23.283 28015
.015873017 2 660 25.648 28016
.012195121 2 660 17.046 28017
 .07017544 2 660 15.323 28018
.029411765 1 660 17.012 28019
         0 1 660 23.919 28020
         0 2 660 14.479 28021
         0 2 660 20.346 28022
 .02173913 2 660 26.487 28023
  .0483871 2 660 17.947 28024
.013157895 2 660 15.854 28025
 .01923077 1 660 15.737 28026
         0 2 660 21.453 28027
.011235955 1 660 23.434 28028
         0 2 660 18.995 28029
.030612245 2 660 19.353 28030
.010526316 2 660 16.957 28031
.012658228 2 660 17.841 28032
      .016 2 660 20.785 28033
         0 2 660 21.165 28034
.022727273 2 660 22.512 28035
.014492754 2 660 26.025 28036
         0 2 660 15.264 28037
         0 1 660 14.943 28038
 .02469136 1 660 18.626 28039
         0 1 660 25.726 28040
.008333334 2 660 15.152 28041
 .01869159 3 660 22.868 28042
.007633588 2 660 23.397 28043
.011627907 2 660   16.1 28044
 .04166667 1 660 21.333 28045
.023255814 2 660 25.088 28046
 .01369863 1 660 17.104 28047
         0 3 660 16.456 28048
         0 3 660 14.601 28049
.016574586 2 660 24.136 28050
.021276595 2 660 18.561 28051
.024390243 2 660 14.242 28053
         0 2 660 17.158 28054
 .10526316 2 660 21.651 28055
         0 3 660 18.062 28100
.004474273 3 660 24.742 28109
.033333335 4 660 17.082 28110
         0 4 660 23.326 28120
         0 4 660 17.941 28130
      .125 4 660 15.265 28140
         0 4 660 13.099 28150
         0 5 660 14.472 28160
         0 5 660 18.329 28200
         0 4 660 17.654 28210
         0 5 660 17.185 28211
         0 5 660 13.296 28212
         0 5 660 17.186 28219
    .01875 3 660 21.174 28220
    .03125 3 660 24.532 28221
         0 3 660 24.323 28222
         0 2 660 23.521 28223
.010638298 2 660 25.633 28224
 .02857143 4 660 17.882 28229
         0 3 660 23.321 28231
         0 3 660 24.169 28232
 .11111111 4 660 18.421 28240
         0 4 660 22.299 28250
         0 4 660 16.259 28260
         0 4 660 18.017 28270
 .04761905 5 660 17.567 28280
         0 3 660 22.944 28290
         0 5 660 14.522 28293
         0 5 660 12.889 28296
.014492754 5 660 14.446 28300
.015873017 3 660 15.271 28320
         0 4 660 12.366 28330
         0 4 660 14.595 28341
         0 4 660 14.528 28342
         0 4 660 14.554 28343
         0 4 660 13.199 28350
         0 4 660 10.634 28359
         0 5 660  9.942 28360
         0 5 660 12.282 28370
         0 5 660 11.103 28380
         0 5 660 11.216 28390
         0 4 660 16.347 28400
 .06666667 4 660 16.937 28410
         0 4 660 16.613 28412
         0 5 660 19.674 28413
         0 4 660 21.046 28420
end
format %tm ym_date

Next, the aim is to run this regression and extract the coefficients for each postcode and average them out, according to the distance bin in which they are located, in order to map them.

I'm not sure I'm doing the right thing. Here's my code:

Code:


reghdfe EV_share, absorb(postal_code) resid
predict double resid1, resid
gen EV_hat_postal = EV_share - resid1


reghdfe EV_share, absorb(ym_date) resid
predict double resid2, resid
gen EV_hat_date = EV_share - resid2

reghdfe EV_share ib5.dist_category_mad##ib700.ym_date c.pp_eurok##ib700.ym_date, absorb(postal_code ym_date) vce(cluster postal_code)
predict double xb_hat, xb


gen EV_hat = xb_hat + EV_hat_postal + EV_hat_date

collapse (mean) EV_hat, by(postal_code dist_category_mad)

preserve
collapse (mean) EV_hat, by(dist_category_mad)
rename EV_hat avg_EV_bin
save bin_avg.dta, replace
restore

merge m:1 dist_category_mad using bin_avg.dta, nogen

Is this the right way to extract the coefficients? Is it correct to use -predict- in this setup? Does -predict- take fixed effects into account from the outset or not?
Thank you very much.

Last edited by Michael Duarte Goncalves; 22 Apr 2025, 06:12.

Tags: None

George Ford

Join Date: Aug 2014

Posts: 3152
#2

24 Apr 2025, 08:32

The dataex is unsable. All singularities.

when you include "resid" it produces the resid = _reghdfe_resid (or you can name it whatever you want). There's no need to predict it.

Not sure what the first 2 models are for.

If you want the prediction, you can predict it after reghdfe using the xbd option which includes the absorbed effects.

The collapse is giving you the predictions, not the coefficients. The coefficients are in r(table) or e(b).
1 like
Comment

Announcement

Extract individual coefficients from a regression (-reghdfe-) to map them

Comment