Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel Data - Help sought in steps to perform and the order of the steps

    Dear everyone,

    I have (based upon literature) chosen a model for my thesis, which is actually slightly above my head.... So, I know the basics of stata (OLS and so on), but never had any lecture on the more advanced models. Unfortunately, both my promoters are the kind of: "everything advanced above OLS is scary..." I have checked various websites, and read various books and powerpoints I found online, but its simply too much to comprehend in one time. I condensed everything as per below information, and I would kindly ask you, if my reasoning is correct? , am i missing something? is the order of steps correct? Thanks a million in advance for your reply/replies, and my apologies for the long post, but I wanted to be as complete as possible...

    Situation: I want to test if digital (il)literacy has an effect on corruption.

    Dataset: 142 countries, 13 years -> +/- 30000 observations (but its unbalanced, i.e. not every country has a value for a specific variable for a specific year, I have 810 missing values, which I coded as -9999, and then set -9999 as missing)
    Main model: 1 Dependent Var, 3 Independent Vars, 5 control vars, 2 moderating vars.
    *a one independent var i am not sure, and possibly is a mediating variable, i want to test this also.
    *b moreover, i want to test also if region or subregion has an effect (or if countries in a (sub)region) are homogeneneous or heterogeneous) as well, so I have added the possibility in my dataset to group countries by region or by subregion.

    So my thoughts are as follows, and that I perform this in this order as well:
    1) the DV is a rank, which is not normally distributed, but ofcourse uniform distribution. therefore I need to transform the rank to a normal distribution? I found that I use this by converting the rank by Stata command:
    Code:
    generate zscore = invnorm(pctrank/100)
    or
    Code:
    generate nce = invnorm(pctrank/100)*21.06 + 50
    ?

    2) I want to test whether to use fixed effects or random effect, so I use the Hausmann test? I would that by following the commands as provided by http://www.stata.com/manuals13/rhausman.pdf

    3) Then, I test whether to use a dynamic model or static model (i.e. whether or not lagged values of IV). I would convert the IV by
    Code:
    sort Country Year, xtset Country Year, gen iv_lag = L1.iv
    right? and then a 'stupid question' what would be my decision criteria?

    4) I perform the
    Code:
    xtmixed
    command for a mixed model, and test either the fixed effect or random effects model?

    Is this correct so far? what about the order? what am I missing?
    ----

    Then a few addition questions which I havent figured out yet:
    *a: for one independent variable, i am not sure whether its independent or its mediating the other independent variables. for OLS simple regression, I would do a Baron & Kenny test, and then a sobel test to confirm. How would this work out for panel data? I found http://www.ats.ucla.edu/stat/stata/f...mediation2.htm for multilevel data, which I have, but it doesnt look like its not for panel data?

    *b: To test whether to ungroup countries, or group them by either region or subregion, I would make 3 models: 1 'flat', 1 multilevel grouped by region and 1 multilevel grouped by subregion.
    Is that correct? And where in the order of processes will fit this in?
    ----

    Then some more additional questions:
    a) I want to reduce the model to as simply as possible. I am doubting. Would I start from 1 dv, and 1 IV, and slowly build up my model, performing all tests again for every variable I include extra? Or would I start the other way around, and start by the most complete model, and deleting variable by variable?
    On top of that, is there a single command for it? or do I do it manually, variable after variable?

    b) To test the moderating variables: In OLS simple regression, I would add an interaction term, and compare the p values and VIF values. Would it be just as simply for panel data?

    c) On internet there is much ambiguity and unclarity as to the assumptions I have to check. Do I check just as OLS for outliers, heteroskedasticity, normality of error terms, multicolinearity, and for independence? OR do check other assumptions, and if so which ones?

    d) there are other measurements of corruption, so I have other DV's at hand. I would want to use the other DVs to confirm my findings, and thus will need to test for robustness? any suggestion on how to perform this?

    e) In the end, I would like to say something about a 'granger causality' and would want to test for vector autoregression. for that, i would want to follow http://paneldataconference2015.ceu.h...ael-Abrigo.pdf which is quite a lengthy process. Is there a quick way to do it? Or shall i keep it as in this paper?

    ---

    And finally? am i complete now? am I missing something? is the order correct? any other useful feedback?

    Thank you in advance for your feedback and reply,
    Trebor

  • #2
    Nothing stops partial replies, but being complete doesn't necessarily help motivate replies.

    This looks like guidance for an entire project. If you don't get more response, consider splitting this. (Don't you have a supervisor/advisor/mentor/commitee who can help?)

    Concrete, specific, clear Stata questions usually get attention here.

    Essentially statistical questions often get answered, but there is more caprice about that.

    Comment


    • #3
      You have a lot of questions, and I don't feel able to answer all right now. In general, I agree that you should not just avoid any more complex models than OLS, because they are not that scary. And the mixed model command is basically going to produce coefficients that you just interpret as if they were OLS coefficients. However, you also need to know what you're getting into when you go into more advanced models.

      1) If you want to consider a lagged value of the dependent variable, you should a) not use xtmixed and b) think about why you want a lagged DV. Some info below, but the command that you might want to consider instead is xtabond. If you put a lagged DV into a regular mixed model, you won't be able to estimate your coefficients consistently (i.e. unbiased-ly).

      http://blog.stata.com/2015/11/12/xtabond-cheat-sheet/

      http://statisticalhorizons.com/lagge...dent-variables

      Basically, in xtmixed, each country gets its own random intercept, i.e. its own, country-specific average level of corruption. That's how you handle the fact that you have repeated measures on each country in that framework. Each year can be considered to be a deviation from that country-specific average. You can further handle autocorrelation by imposing a structure on the residuals (e.g. you can say that the year-specific residuals for each country have an autoregressive 1 structure, or an exchangeable one, or whatever; there are a few options).

      If you have a theoretical basis for which to use a dynamic panel model, then you need to go do reading on that subject. I have absolutely no exposure to those at all, and hopefully someone else can offer better insight.

      2) You aren't actually interested whether or not the DV has a normal distribution. You want to errors/residuals to have a normal distribution (or close enough to one), i.e. if you take each country's actual rank for corruption in each year, and you deduct the predicted value of the rank, that's your error term, and you want those to be normally distributed. The things you throw into your regression model should predict all or most of the systematic variation in corruption, leaving just random noise. This is a common misconception, and indeed I didn't quite get this until not long ago.

      But, you're right, you have ranked data, and you want a principled method to handle that. I haven't come across this sort of data before, so I'm not sure! The transformation you suggested would probably be acceptable, and it looks like it came from the UCLA website. However, I think you mentioned you have 142 countries, so I believe you'd need to calculate the z-score by dividing by 142.

      http://www.ats.ucla.edu/stat/stata/faq/prank.htm

      Hopefully someone has a better recommendation, though. Another thought is that your score probably didn't come from a panel of experts sitting down and sorting the 142 countries into rank order by consensus. There was probably some sort of continuous score underlying the ranking. You may be able to get your hands on the actual scores, and I think that would be preferable.

      3) If you are going to do a Hausman test for fixed vs random effects, I think the more logical command to start with is xtreg. You can ask xtreg to use a random or a fixed effect for each panel, which in your case is a country. You can then do a Hausman test. But again, I do not believe xtreg is usable with lagged DVs.
      Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

      When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Comment


      • #4
        Originally posted by Nick Cox View Post
        This looks like guidance for an entire project. If you don't get more response, consider splitting this. (Don't you have a supervisor/advisor/mentor/commitee who can help?)
        Thank you for your reply. Its indeed guidance for entire project. Like I said, I am new to this, its a little over my head, as I only have had experience with OLS, and my supervisor and promotor are both 'scared' of everything beyond OLS/logistic/factor analysis. I Thought by being complete maybe helped me in getting more answers, as it showed that I did *some* research before asking hopelessly you all... My ideas was that it would also be more readible in one topic, instead of various split topics...

        Originally posted by Weiwen Ng View Post
        If you have a theoretical basis for which to use a dynamic panel model, then you need to go do reading on that subject. I have absolutely no exposure to those at all, and hopefully someone else can offer better insight.
        Thank you for your elaborate comments. To answer this part: There is no theoretical basis, although a variety of methods have been used in the past. I just thought that maybe it would be good to test whether or not to use dynamic modelling, as no one has done it before, and therefore, it might really add something.

        Originally posted by Weiwen Ng View Post
        it looks like it came from the UCLA website. However, I think you mentioned you have 142 countries, so I believe you'd need to calculate the z-score by dividing by 142.
        You are absolutely right! that part I took from the UCLA website, and forgot after the copy/paste, to alter 100 to 142.

        Originally posted by Weiwen Ng View Post
        There was probably some sort of continuous score underlying the ranking. You may be able to get your hands on the actual scores, and I think that would be preferable.
        Indeed, there are actual scores, which go from -2.5 to 2.5, but then again, every value would only come up once. (I.e. one time -1.456, one time -1.457, etc) would it not give the same problem?

        Thank you for all the other comments!

        Trebor

        Comment


        • #5
          Originally posted by Trebor Dantisch View Post
          Thank you for your reply. Its indeed guidance for entire project. Like I said, I am new to this, its a little over my head, as I only have had experience with OLS, and my supervisor and promotor are both 'scared' of everything beyond OLS/logistic/factor analysis. I Thought by being complete maybe helped me in getting more answers, as it showed that I did *some* research before asking hopelessly you all... My ideas was that it would also be more readible in one topic, instead of various split topics...


          Thank you for your elaborate comments. To answer this part: There is no theoretical basis, although a variety of methods have been used in the past. I just thought that maybe it would be good to test whether or not to use dynamic modelling, as no one has done it before, and therefore, it might really add something.


          You are absolutely right! that part I took from the UCLA website, and forgot after the copy/paste, to alter 100 to 142.


          Indeed, there are actual scores, which go from -2.5 to 2.5, but then again, every value would only come up once. (I.e. one time -1.456, one time -1.457, etc) would it not give the same problem?

          Thank you for all the other comments!

          Trebor
          Regarding dynamic modeling, just because nobody has done it before, doesn't mean that someone should go do it now. It may add something, it may not. You want to be familiar with the theoretical basis for dynamic modeling before proposing it. It may also help to talk to someone who has actually used it. Are there any economics faculty at your university who have?

          Regarding the scoring issue. Say you had a continuous outcome, perhaps blood pressure reported in millimeters of mercury to 10 decimal places. Imagine that yes, the blood pressure machine was really that good. You can expect that any two people would probably not have the exact same value of mmHg. But, say the mmHg readings were normally distributed. You can see that this would not be a problem for an OLS regression model, right?

          If the scores are really skewed, then you certainly might want to think about how you'd handle them. But again, ultimately, you want the residuals, not the raw data, to be normally distributed. Now, in OLS regression, we are trying to fit this model:

          Y = XB + e

          Where XB is basically a matrix of all the betas and all the independent variables for each unit in the sample. e is an error term, and we ideally want it to be normally distributed with mean 0, i.e. you have predicted all the systematic influences on Y, thus leaving random noise. When Y is normally distributed to begin with, then the way I see it, our job is probably simpler. But again, it's ultimately the distribution of e that matters for OLS. And even if e is isn't quite normally distributed, then your inferences about the mean effect of one of the Xs may still be usable.

          That said, you are more likely to get help with discrete programming questions here. The Statalist commenters come from many disciplines, and we all have our own day jobs, so if you ask a very broad and general question, you may or may not get a response. Help with the entire project is probably not going to get traction.
          Last edited by Weiwen Ng; 14 Feb 2017, 20:09.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment

          Working...
          X