Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to transform or work with heavy-tailed dependent variable

    Dear all,

    I'm working on my master's thesis and am looking at the effects of culture (individualism vs. collectivism) on vaccination acceptance across European countries. I am using panel data and a random effects model and data ranges from februari 2020 - februari 2022. This is due to the fact that I use both static (cultural) and dynamic (COVID_19 indicators) variables in my research. My dependent variable is "fully vaccinated people per hundred". When looking at its histogram and quantile, it seems to be heavy-tailed.. I am wondering what I can do to make sure my results make sense. Can I transform the dependent variable in any way? Or do you recommend something else?

    See my histogram and quantile below:
    Click image for larger version

Name:	Screenshot 2022-06-23 at 12.24.45.png
Views:	1
Size:	102.4 KB
ID:	1670569

    Click image for larger version

Name:	Screenshot 2022-06-23 at 12.24.58.png
Views:	1
Size:	160.4 KB
ID:	1670570




    My random effects model can be seen below:
    Click image for larger version

Name:	Screenshot 2022-06-23 at 12.28.50.png
Views:	1
Size:	192.6 KB
ID:	1670571




    Individualism: Every country is rated between 0-100 (Hofstede index) (static)
    Excess mortality: Logged, weekly data, calculated as: excess deaths = reported deaths - expected deaths
    GDP per capita: Logged (static)
    Population: Logged (static)
    Government trust: Percentage (static)
    Median age: Absolute number (static)

    DO-FILE:
    clear
    // ssc inst asdoc
    import excel "/Users/Administrator/Documents/MASTERTHESISDATA.xlsx", sheet("TRY-OUT") firstrow

    // set time
    gen sdate = date(date, "YMD")
    format sdate %td
    rename location country
    encode country, gen(scountry)
    xtset scountry sdate

    // change variables
    gen logpopulation = ln(population)
    gen loggdppercapita = ln(gdp_per_capita)
    rename total_vaccinations_per_hundred vaccinationsperhundred
    gen logexcessmortality = ln(excess_mortality)

    // xtline
    xtline vaccinationsperhundred
    xtline vaccinationsperhundred, overlay

    // reg
    xtreg people_fully_vaccinated_per_hund Individualism logexcessmortality loggdppercapita logpopulation Governmenttrust median_age, re

    Please mind: I am still finding my way when it comes to this analysis. Other suggestions regarding other variables / research methods are welcome as well.


    Last edited by Iris Voncken; 23 Jun 2022, 05:10.

  • #2
    You could try winsorizing your data (e.g. deleting the top and bottom 1% values of each variable). Log transformations might help.

    As an aside, you mentioned you are using a random effects model. I presume you ran a Hausman test that indicated that the random effects model will be valid.

    There's a big issue with this model though; one of its identification assumptions is that regressors are uncorrelated with the unobserved heterogeneity (which the fixed-effects estimation wipes out). This assumption is simply implausible; you'll have a hard time convincing anyone that it holds unfortunately... If you are aiming to perhaps publish your MSc thesis especially, you may want to throw a two-way fixed effects estimation at this problem...

    Comment


    • #3
      Originally posted by Maxence Morlet View Post
      You could try winsorizing your data (e.g. deleting the top and bottom 1% values of each variable). Log transformations might help.

      As an aside, you mentioned you are using a random effects model. I presume you ran a Hausman test that indicated that the random effects model will be valid.

      There's a big issue with this model though; one of its identification assumptions is that regressors are uncorrelated with the unobserved heterogeneity (which the fixed-effects estimation wipes out). This assumption is simply implausible; you'll have a hard time convincing anyone that it holds unfortunately... If you are aiming to perhaps publish your MSc thesis especially, you may want to throw a two-way fixed effects estimation at this problem...
      Thank you for your response. Yes, my hausman test looks as follows:

      Click image for larger version

Name:	Screenshot 2022-06-23 at 14.16.24.png
Views:	1
Size:	88.3 KB
ID:	1670589


      De dofile commands I used:
      //hausman
      xtreg people_fully_vaccinated_per_hund Individualism logexcessmortality loggdppercapita logpopulation Governmenttrust median_age, fe
      estimates store fe
      xtreg people_fully_vaccinated_per_hund Individualism logexcessmortality loggdppercapita logpopulation Governmenttrust median_age, re
      estimates store re
      hausman fe re


      I shall take on your suggestions and try to see if a two-way fixed effects estimation works better.
      Any additional comments are always welcome.
      Thanks!


      Comment


      • #4
        Well first of all the Hausman test makes very unrealistic assumptions, e.g. homoscedasticity.

        But even if it indicates that random effects is valid, would you be willing to believe that regressors are uncorrelated with the unobserved heterogeneity?

        Let me give an example:

        Suppose I have panel data on final year university students, and regress their marks obtained at the final exams on a dummy indicating whether they resided in university accommodation during their studies. Suppose I also have the following covariates: their age, socio-economic status, and parental income.

        Random effects assumes that for instance their age is uncorrelated with unobserved heterogeneity, for instance their motivation. Would you be willing to believe that? I would personally argue that age, and with it maturity, strongly correlates with motivation. Same for parental income.

        To sum up, the assumption that your regressors are uncorrelated with unobserved time-invariant factors is simply extremely difficult to support...

        Comment


        • #5
          While I think Maxence Morlet makes a good argument against the plausibility of the assumption that regressors are uncorrelated with unobserved time-invariant factors, this overlooks the fact that one of the key variables of interest here is cultural, which will be invariant within country, and therefore its effects cannot be estimated in a fixed-effects model. A fixed effects model here is simply incompatible with the research goals. I would suggest using the Mundlak correlated random effects model, which is implemented in the -xthybrid- command, available from SSC.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            While I think Maxence Morlet makes a good argument against the plausibility of the assumption that regressors are uncorrelated with unobserved time-invariant factors, this overlooks the fact that one of the key variables of interest here is cultural, which will be invariant within country, and therefore its effects cannot be estimated in a fixed-effects model. A fixed effects model here is simply incompatible with the research goals. I would suggest using the Mundlak correlated random effects model, which is implemented in the -xthybrid- command, available from SSC.
            Thank you for your response! I was indeed experiencing issues with a fixed-effects model due to my cultural variable Individualism. Your recommendation is very helpful!

            Comment


            • #7
              Focusing just on this part of the question:

              I'm working on my master's thesis and am looking at the effects of culture (individualism vs. collectivism) on vaccination acceptance across European countries. I am using panel data and a random effects model and data ranges from februari 2020 - februari 2022. This is due to the fact that I use both static (cultural) and dynamic (COVID_19 indicators) variables in my research. My dependent variable is "fully vaccinated people per hundred". When looking at its histogram and quantile, it seems to be heavy-tailed..
              If this were not panel data, then the DV is a proportion. You are effectively scaling it in percentage points, but it is still a proportion. This would lend itself to things like fractional regression (fracreg).
              Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

              When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

              Comment

              Working...
              X