How to transform or work with heavy-tailed dependent variable

Iris Voncken

Join Date: Jun 2022

Posts: 11
#1

How to transform or work with heavy-tailed dependent variable

23 Jun 2022, 04:36

Dear all,

I'm working on my master's thesis and am looking at the effects of culture (individualism vs. collectivism) on vaccination acceptance across European countries. I am using panel data and a random effects model and data ranges from februari 2020 - februari 2022. This is due to the fact that I use both static (cultural) and dynamic (COVID_19 indicators) variables in my research. My dependent variable is "fully vaccinated people per hundred". When looking at its histogram and quantile, it seems to be heavy-tailed.. I am wondering what I can do to make sure my results make sense. Can I transform the dependent variable in any way? Or do you recommend something else?

See my histogram and quantile below:

My random effects model can be seen below:

Individualism: Every country is rated between 0-100 (Hofstede index) (static)
Excess mortality: Logged, weekly data, calculated as: excess deaths = reported deaths - expected deaths
GDP per capita: Logged (static)
Population: Logged (static)
Government trust: Percentage (static)
Median age: Absolute number (static)

DO-FILE:
clear
// ssc inst asdoc
import excel "/Users/Administrator/Documents/MASTERTHESISDATA.xlsx", sheet("TRY-OUT") firstrow

// set time
gen sdate = date(date, "YMD")
format sdate %td
rename location country
encode country, gen(scountry)
xtset scountry sdate

// change variables
gen logpopulation = ln(population)
gen loggdppercapita = ln(gdp_per_capita)
rename total_vaccinations_per_hundred vaccinationsperhundred
gen logexcessmortality = ln(excess_mortality)

// xtline
xtline vaccinationsperhundred
xtline vaccinationsperhundred, overlay

// reg
xtreg people_fully_vaccinated_per_hund Individualism logexcessmortality loggdppercapita logpopulation Governmenttrust median_age, re

Please mind: I am still finding my way when it comes to this analysis. Other suggestions regarding other variables / research methods are welcome as well.

Last edited by Iris Voncken; 23 Jun 2022, 05:10.
Tags: heavy-tailed, master thesis, panel data, random effects, regression
Maxence Morlet

Join Date: Mar 2021

Posts: 652
#2

23 Jun 2022, 06:06

You could try winsorizing your data (e.g. deleting the top and bottom 1% values of each variable). Log transformations might help.

As an aside, you mentioned you are using a random effects model. I presume you ran a Hausman test that indicated that the random effects model will be valid.

There's a big issue with this model though; one of its identification assumptions is that regressors are uncorrelated with the unobserved heterogeneity (which the fixed-effects estimation wipes out). This assumption is simply implausible; you'll have a hard time convincing anyone that it holds unfortunately... If you are aiming to perhaps publish your MSc thesis especially, you may want to throw a two-way fixed effects estimation at this problem...
Comment
Iris Voncken

Join Date: Jun 2022

Posts: 11
#3

23 Jun 2022, 06:27

Originally posted by Maxence Morlet View Post

You could try winsorizing your data (e.g. deleting the top and bottom 1% values of each variable). Log transformations might help.

As an aside, you mentioned you are using a random effects model. I presume you ran a Hausman test that indicated that the random effects model will be valid.

There's a big issue with this model though; one of its identification assumptions is that regressors are uncorrelated with the unobserved heterogeneity (which the fixed-effects estimation wipes out). This assumption is simply implausible; you'll have a hard time convincing anyone that it holds unfortunately... If you are aiming to perhaps publish your MSc thesis especially, you may want to throw a two-way fixed effects estimation at this problem...

Thank you for your response. Yes, my hausman test looks as follows:

De dofile commands I used:
//hausman
xtreg people_fully_vaccinated_per_hund Individualism logexcessmortality loggdppercapita logpopulation Governmenttrust median_age, fe
estimates store fe
xtreg people_fully_vaccinated_per_hund Individualism logexcessmortality loggdppercapita logpopulation Governmenttrust median_age, re
estimates store re
hausman fe re

I shall take on your suggestions and try to see if a two-way fixed effects estimation works better.
Any additional comments are always welcome.
Thanks!
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 652
#4

23 Jun 2022, 06:39

Well first of all the Hausman test makes very unrealistic assumptions, e.g. homoscedasticity.

But even if it indicates that random effects is valid, would you be willing to believe that regressors are uncorrelated with the unobserved heterogeneity?

Let me give an example:

Suppose I have panel data on final year university students, and regress their marks obtained at the final exams on a dummy indicating whether they resided in university accommodation during their studies. Suppose I also have the following covariates: their age, socio-economic status, and parental income.

Random effects assumes that for instance their age is uncorrelated with unobserved heterogeneity, for instance their motivation. Would you be willing to believe that? I would personally argue that age, and with it maturity, strongly correlates with motivation. Same for parental income.

To sum up, the assumption that your regressors are uncorrelated with unobserved time-invariant factors is simply extremely difficult to support...
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30356
#5

23 Jun 2022, 10:40

While I think Maxence Morlet makes a good argument against the plausibility of the assumption that regressors are uncorrelated with unobserved time-invariant factors, this overlooks the fact that one of the key variables of interest here is cultural, which will be invariant within country, and therefore its effects cannot be estimated in a fixed-effects model. A fixed effects model here is simply incompatible with the research goals. I would suggest using the Mundlak correlated random effects model, which is implemented in the -xthybrid- command, available from SSC.
2 likes
Comment
Iris Voncken

Join Date: Jun 2022

Posts: 11
#6

26 Jun 2022, 09:27

Originally posted by Clyde Schechter View Post

While I think Maxence Morlet makes a good argument against the plausibility of the assumption that regressors are uncorrelated with unobserved time-invariant factors, this overlooks the fact that one of the key variables of interest here is cultural, which will be invariant within country, and therefore its effects cannot be estimated in a fixed-effects model. A fixed effects model here is simply incompatible with the research goals. I would suggest using the Mundlak correlated random effects model, which is implemented in the -xthybrid- command, available from SSC.

Thank you for your response! I was indeed experiencing issues with a fixed-effects model due to my cultural variable Individualism. Your recommendation is very helpful!
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

27 Jun 2022, 10:29

Focusing just on this part of the question:

I'm working on my master's thesis and am looking at the effects of culture (individualism vs. collectivism) on vaccination acceptance across European countries. I am using panel data and a random effects model and data ranges from februari 2020 - februari 2022. This is due to the fact that I use both static (cultural) and dynamic (COVID_19 indicators) variables in my research. My dependent variable is "fully vaccinated people per hundred". When looking at its histogram and quantile, it seems to be heavy-tailed..

If this were not panel data, then the DV is a proportion. You are effectively scaling it in percentage points, but it is still a proportion. This would lend itself to things like fractional regression (fracreg).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

How to transform or work with heavy-tailed dependent variable

Comment

Comment

Comment

Comment

Comment

Comment