PPML, panel data - Statalist

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

#301

09 Aug 2019, 09:00

Many thanks Tom, your paper does indeed very specifically address my issue, and will be very helpful in writing up the discussion about these estimation techniques in my article.

I have some further questions now, regarding your ppmlhdfe command.
For my project, I divide the world's countries into 2 groups (A and B).
I also divide trade value into two groups (interesting and not-interesting), so i have two observations for each country-pair and year.
The question I have is whether trade in the goods of interest are higher (i.e., higher shares) between A->A, A->B, B->A, or B->B.

For this purpose I create dummies for the interaction (as ppml did not allow factor variables), and drop dummies that I consider baselines (AA and 'not-interesting').
I then estimate:

Code:

ppmlhdfe tradevalueusd AB BA BB interesting intAB intBA intBB, a(expf#year impf#year expf#impf) cluster(expf#impf) nolog

If I do this in ppml (with time-varying country dummies, but excluding country-pair dummies), I get an estimate for all dummies.
If I use ppmlhdfe with the same variables, two dummies are dropped because of collinearity, and I cannot figure out why this would be. Do you have any insight?

Code:

. ppmlhdfe tradevalueusd AB BA BB interesting intAB intBA intBB if year>2010, a(expf#year impf#year expf#impf) cluster(expf#impf) nolog
(dropped 181656 observations that are either singletons or separated by a fixed effect)
warning: dependent variable takes very low values after standardizing (2.1388e-10)
note: 2 variables omitted because of collinearity: BA BB
Converged in 20 iterations and 104 HDFE sub-iterations (tol = 1.0e-08)

HDFE PPML regression                              No. of obs      =    142,680
Absorbing 3 HDFE groups                           Residual df     =     25,452
Statistics robust to heteroskedasticity           Wald chi2(5)    =    8356.41
Deviance             =  8.93317e+11               Prob > chi2     =     0.0000
Log pseudolikelihood = -4.46659e+11               Pseudo R2       =     0.9975

Number of clusters (expf#impf)=    25,453
                         (Std. Err. adjusted for 25,453 clusters in expf#impf)
------------------------------------------------------------------------------
             |               Robust
tradevalue~d |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          AB |  -.0059766   .0827166    -0.07   0.942    -.1680982     .156145
          BA |          0  (omitted)
          BB |          0  (omitted)
 interesting |   -4.99851   .0725082   -68.94   0.000    -5.140623   -4.856396
       intAB |  -.2686544   .1358391    -1.98   0.048    -.5348942   -.0024146
       intBA |   .0959186   .1709414     0.56   0.575    -.2391204    .4309577
       intBB |  -.0540459   .2374066    -0.23   0.820    -.5193544    .4112625
       _cons |   23.61517   .0165171  1429.74   0.000      23.5828    23.64754
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
   expf#year |       453           0         453     |
   impf#year |       693           3         690     |
   expf#impf |     25453       25453           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Last edited by Jorrit Gosens; 09 Aug 2019, 09:02.

Comment

Tom Zylkin

Join Date: Nov 2016

Posts: 188
#302

09 Aug 2019, 12:03

Hi Jorrit,
The reason why ppml does not appear to drop any variables is because Stata often drops collinear variables from right to left (it does not know which are the variables that you really care about.) So in this case, these variables that are being dropped by ppmlhdfe are perfectly collinear with your fixed effects; there is actually no ambiguity about it here. You can also change your syntax for ppml so that your FE dummies are to the right of the other variables if you don't believe me.

To see why this is, consider your country-pair (expf#impf) fixed effect. Note that this absorbs all time- and industry-invariant sources of variation in trade that are specific to each pair. Your AA and AB variables are pair-specific based on how you defined them, and they do not appear to vary by either time or industry. Thus, they cannot be identified independently of the pair fixed effect in this case. That said, it seems like you are mostly interested in the interaction terms, yes? In that case, it does not seem like a problem that AA and AB are not identified.

Regards,
Tom

Originally posted by Jorrit Gosens View Post

Many thanks Tom, your paper does indeed very specifically address my issue, and will be very helpful in writing up the discussion about these estimation techniques in my article.

I have some further questions now, regarding your ppmlhdfe command.
For my project, I divide the world's countries into 2 groups (A and B).
I also divide trade value into two groups (interesting and not-interesting), so i have two observations for each country-pair and year.
The question I have is whether trade in the goods of interest are higher (i.e., higher shares) between A->A, A->B, B->A, or B->B.

For this purpose I create dummies for the interaction (as ppml did not allow factor variables), and drop dummies that I consider baselines (AA and 'not-interesting').
I then estimate:

Code:

ppmlhdfe tradevalueusd AB BA BB interesting intAB intBA intBB, a(expf#year impf#year expf#impf) cluster(expf#impf) nolog

If I do this in ppml (with time-varying country dummies, but excluding country-pair dummies), I get an estimate for all dummies.
If I use ppmlhdfe with the same variables, two dummies are dropped because of collinearity, and I cannot figure out why this would be. Do you have any insight?

Code:

. ppmlhdfe tradevalueusd AB BA BB interesting intAB intBA intBB if year>2010, a(expf#year impf#year expf#impf) cluster(expf#impf) nolog (dropped 181656 observations that are either singletons or separated by a fixed effect) warning: dependent variable takes very low values after standardizing (2.1388e-10) note: 2 variables omitted because of collinearity: BA BB Converged in 20 iterations and 104 HDFE sub-iterations (tol = 1.0e-08) HDFE PPML regression No. of obs = 142,680 Absorbing 3 HDFE groups Residual df = 25,452 Statistics robust to heteroskedasticity Wald chi2(5) = 8356.41 Deviance = 8.93317e+11 Prob > chi2 = 0.0000 Log pseudolikelihood = -4.46659e+11 Pseudo R2 = 0.9975 Number of clusters (expf#impf)= 25,453 (Std. Err. adjusted for 25,453 clusters in expf#impf) ------------------------------------------------------------------------------ | Robust tradevalue~d | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- AB | -.0059766 .0827166 -0.07 0.942 -.1680982 .156145 BA | 0 (omitted) BB | 0 (omitted) interesting | -4.99851 .0725082 -68.94 0.000 -5.140623 -4.856396 intAB | -.2686544 .1358391 -1.98 0.048 -.5348942 -.0024146 intBA | .0959186 .1709414 0.56 0.575 -.2391204 .4309577 intBB | -.0540459 .2374066 -0.23 0.820 -.5193544 .4112625 _cons | 23.61517 .0165171 1429.74 0.000 23.5828 23.64754 ------------------------------------------------------------------------------ Absorbed degrees of freedom: -----------------------------------------------------+ Absorbed FE | Categories - Redundant = Num. Coefs | -------------+---------------------------------------| expf#year | 453 0 453 | impf#year | 693 3 690 | expf#impf | 25453 25453 0 *| -----------------------------------------------------+ * = FE nested within cluster; treated as redundant for DoF computation
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#303

09 Aug 2019, 12:34

Ah, I see, thanks. This makes sense now.
I wouldn't have guessed the different behavior of order of picking dummies.
I was somehow also confused that one dummy was left in. I dont know why but I somehow figured that I already dropped one of the possible 4 combinations of AA AB BA BB, and therefore all 3 remaining dummies should not be reported if they were perfectly collinear with the country-pair FE. Maybe ti was already too deep into the Friday afternoon to understand that.

But yes, I am entirely interested in the interaction, and can also say that in this case, the sign and significance level hardly change a bit when including the full set of FE. Maybe this is because the variable of interest is not a country characteristic?

Thanks a lot in any case, both for the paper/package, and the explanation here. Much appreciated.
Comment
Tom Zylkin

Join Date: Nov 2016

Posts: 188
#304

10 Aug 2019, 07:53

Hi Jorrit,
Glad to hear it.
Tom
Comment
Qiyangfan Feng

Join Date: Sep 2019

Posts: 9
#305

13 Sep 2019, 06:30

Dear Joao,

Nice to meet you!

Does it means ppml could be used in almost all estimates? Could it be used in a model that treat FDI flows as the dependent and even have no zero?
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3008
#306

13 Sep 2019, 07:57

Dear Qiyangfan Feng,

PPML can be used for any multiplicative model (e.g. gravity equation, Cobb-Dougles). For FDI the problem is that sometimes the data takes on negative values and in those cases PPML is unlikely to be suitable.

Best wishes,

Joao
Comment
Qiyangfan Feng

Join Date: Sep 2019

Posts: 9
#307

13 Sep 2019, 08:15

Dear Joao,

Thanks for your replay!

Please forgive my misrepresentation.
I want to measure the effect of FDI to Economic Growth, and treat E-G as the dependent var ,there is no negative values in them,but negative values for FDI. Can I use ppml to do this? And is it necessery to log the E-G(it has big Std)?
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3008
#308

13 Sep 2019, 09:38

You should not log the dependent variable in a model estimated by PPML. I do not know the form of the model you want to estimate but PPML is suitable if you have a multiplicative model.

Joao
Comment
Qiyangfan Feng

Join Date: Sep 2019

Posts: 9
#309

13 Sep 2019, 18:23

Dear Joao,

Many thanks for your reply and pations.

I estiblished a theoretical model,and it is as follow : The national endowments(include elementary education、R&D level 、Business environment etl.) would help the countries absorb FDI.So I take national economic growth(GDP per capita) as the dependent variable,take FDI and all the endowments variable as the independent variables. And the main independent variables are the Interaction item of FDI and endowment. I think a liner equation may not be suitble for it.So i decide to use ppmlhdfe model. I even do not know Is it correct. Many sugestion?

Thanks Joao !
Comment
Qiyangfan Feng

Join Date: Sep 2019

Posts: 9
#310

14 Sep 2019, 09:03

Hello Joao
Comment
Umar Boodoo

Join Date: Mar 2020

Posts: 11
#311

01 Feb 2021, 03:52

Hello

I have some data on grants to geographic units. About 125 entities to 3134 counties over a 8-year period. About 90% of the data is zero. Would PPML work in this instance?
My variables are as follows:

totalgrants - grants from entity X to county C in year t
distance - geographic distance from X to C
comm_normalized - normalized community score of entity X
population, MedianHouseholdIncome, gini, unemployment, socioecogrants - these are all stuff to apply to county C at time t
size roa cashratio - stuff that applies to entity X at time t
year* - year fixed effects (not sure if I should have them)

pair = pair of entity X and county C

I am running the following: ppml totalgrants distance comm_normalized population MedianHouseholdIncome gini unemployment socioecogrants size roa cashratio year*, cluster(pair);

In particular, I have the following warnings when I run PPML:

WARNING: totalgrants has very large values, consider rescaling
WARNING: population has very large values, consider rescaling or recentering
WARNING: gini has very large values, consider rescaling or recentering
WARNING: unemployment has very large values, consider rescaling or recentering
WARNING: socioecogrants has very large values, consider rescaling or recentering
WARNING: roa has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 0
Number of observations excluded: 0

... the regression runs but estimates differ quite massively from OLS, Hausman Taylor with log (1+totalgrants). Plus it comes with this warning:

WARNING: The model appears to overfit some observations with totalgrants=0

Any tips?
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3008
#312

01 Feb 2021, 04:07

Dear Umar Boodoo,

Yes, PPML should work in this case and it is not surprising that the results are different because using log(1+totalgrants) as the dependent variable leads to meaningless results. I suggest you use the command ppmlhdfe rather than ppml.

Best wishes,

Joao
Comment
Etienne Jenni

Join Date: Jan 2021

Posts: 2
#313

02 Feb 2021, 11:30

Dear Joao Santos Silva

I would like to analyse the effect of education on innovation. I have a panel data set with over 100 countries for around 20 years. My dependent variable is approximated by adjusted patent applications per country. Since this is not normally distributed, but non-negative and right-skewed (almost as most trade data if I am correct), I wanted to use the PPML estimator. I am unsure about the model specification though. The gravity equation in trade is a multiplicative model. However, I assume that my model is additive. My plan was to define it as follows: Patent = exp(b₁*education + b₂*controls + ... + e), so without logging the independent variables. Is this possible? Or is PPML estimation only applicable for multiplicative models and therefore with logged variables?

Best wishes,
Etienne
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3008
#314

03 Feb 2021, 01:31

Dear Etienne Jenni

You can estimate a model like that with PPML, but note that it is still a multiplicative model because you have the exponential function on the right-hand side.

Best wishes,

Joao
1 like
Comment
Etienne Jenni

Join Date: Jan 2021

Posts: 2
#315

03 Feb 2021, 02:57

Dear Joao

Ah indeed, the overall specification is still multiplicative, just the term in the exponential function is additive I guess. Great, thank you for your help!

Best wishes,
Etienne
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment