Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Predict-command after ppmlhdfe

    Hi all,

    In a dataset with one observation per firm, year and destination, I try two different ways of predicting trade flows.

    The data set has a lot of zeros. Therefore, I started by estimating a ppml:

    Approach 1: PPML
    Code:
    ppmlhdfe exports log(distance) ... , vce(r) d absorb(industry firm_id)
    predict pred_exps1 if year == 2005, mu
    An alternative approach is to only take strictly positive trade flows into account and estimate the model in logs. After this, I predict trade flows in levels:

    Approach 2: Log-model
    Code:
    reghdfe log_exports log(distance) ... , vce(r) absorb(industry firm_id) resid
    predict pred_exps2_biased if year == 2005, xbd 
    gen pred_exps2 = exp(pred_exps2_biased) * exp(0.5 * e(rmse)^2) if year == 2005
    (The data set runs from 2005 to 2019.)

    How do I get the right standard errors in the ppml-case, so that the two predictions become comparable? I would expect the same amount of predicted observations and a correlation between pred_exps1 and pred_exps2 of 1?

    Best.
    Kathrin

  • #2
    You may want to cluster standard errors by firm to allow for within firm-dependence. In approach 2, all observations with 0 exports are dropped. Approach 1 is better for reasons detailed by Santos Silva and Tenreyro (2006, 2011, 2022).

    One of these reasons is Jensen's inequality. The exponent of the expectation is not equal to the expectation of the exponent. This explains a divergence in coefficient estimation, and therefore fitted values. I would focus on approach 1 if I were you.

    Comment


    • #3
      Thanks for your input Maxence Morlet. I'll cluster the std errors.

      Regarding the two approaches: Yes, I am dropping all observations with 0 exports. However, the predictions will be for all observations (also for the ones that had 0 exports).
      I am currently just trying to understand why the two approaches do not give me the same results. You are right about the exponent expectation relation, but I would nevertheless expect a correlation close to 1 (if not 1). But I only get a correlation of .4 at the moment.

      Is the mu-option in the ppml-case considering std. errors at all? How can I take care of the std. errors in the ppml-prediction-case?

      Best,
      Kathrin

      Comment


      • #4
        You should never log a variable with zero values.

        I am currently just trying to understand why the two approaches do not give me the same results.

        Regarding why the predictions are not correlated, you are underestimating the presence of zeros in the outcome. As Maxence states, logging zero values turns them into missing values, so the estimates are based on different samples, considering the data set has a lot of zeros. If you want to compare like for like, use the same estimation sample.

        Code:
        ppmlhdfe exports log(distance) ... if exports>0 , cluster(firm_id) d absorb(industry firm_id)

        Comment

        Working...
        X