Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression with first differences and lagged variable

    Dear Statalists,

    I am trying to replicate an empirical paper and therefore I am trying to understand the author's regression. He is using first differences for all variables, a lagged dependent variable as an additional regressor and logarithms for some of the variables.
    So far, I have figured out the following:

    xi: areg lnFIAS_th_USD lngdp lninflation EATR EMTR statutory_corptax, i.year absorb(year) robust

    Firstly, to get the first differences, I am trying to do it as follows:
    generate fdgdp = d.gdp
    Do you think that this is the right way to do it or can anyone help me to get the first differences?

    Secondly, I am trying to set up the lagged dependent variable:
    gen lag1 = FIAS_th_USD[_n-1]
    gen lag2 = FIAS_th_USD[_n-2]

    Can anyone help me how to get the first differences and the lagged dependent variable for my regression?

    Thank you very much in advance for your help!!

    Kind regards,
    Ferdi

  • #2
    There is no need to generate new variables for the differences and the lags. Just use the variables with the corresponding d. and l. operators in the regression model. Whatever you do, don't use the [_n-1], etc. constructs for this because you will get wrong results if there are gaps in your time series. The lag and lead and difference operators are "smart enough" to avoid that pitfall.

    Also, there is no point in including i.year in the model when you are also absorbing the year variable. All that will happen is that the year indicators from i.year will be colinear with the absorbed year effects, and tStata will omit the year indicators anyway.

    Finally, don't use -xi:-. Use factor variable notation instead. It's easier, and it enables you to then explore your model's results with the -margins- command. -xi- is at this point close to obsolete. There are only a handful of situations in modern Stata where xi is needed or helpful--factor variable notation has almost entirely replaced it.

    Do read -help fvvarlist- and -help tsvarlist-.
    Last edited by Clyde Schechter; 22 Jun 2017, 09:53.

    Comment


    • #3
      Clyde Schechter Thank you very much for your quick answer and your help!

      I tried to run my regresison with d. and l. but then I received the error message 'time variable not set (r111)' which I tried to fix using tsset but unfortunately it didn't work.
      Do you maybe have any idea how I could solve this problem or to which particularities I have to pay attention?

      Regarding the use of -xi:- I was just using it since my advisor said that I should try it this way but of courseI am always open for better ways to do it. Basically, with my regression I am trying to prove that taxes have a negative impact on the amount of MNEs' investments, this is why I am using fixed assets as the depndent variable.

      Thank you very much in advance for your help!

      Comment


      • #4
        Well, "it didn't work" could mean a lot of different things, even in the restricted context of the -tsset- command. Exactly what happened? Did you get an error message? If so, what did it say? If not, what happened that was different from what you wanted? Show the exact -tsset- command you gave and show exactly how Stata responded. Since details matter, be sure to do this by copy/pasting directly from the Stata Results window or your log file into a code block here in the Forum editor, and do not edit it in any way. (If you don't know about code delimiters, read FAQ #12 for instructions.)

        Comment


        • #5
          Clyde Schechter Sorry for my imprecise answer!
          I tried it several times and finally I could solve it the following way:
          Code:
          egen double ID = group(BVDID year)
          sort ID
          by ID: gen nr=_n
          tab nr
          drop if nr> 1
          xtset ID year, yearly
          To get my Regression running, I did the following:
          Code:
          generate lnFIAS_th_USD = ln(FIAS_th_USD)
          generate lnGDP = ln(GDP)
          generate lnInflation = ln(Inflation)
          gen dlnFIAS_th_USD = lnFIAS_th_USD-lnFIAS_th_USD[_n-1]
          gen dlnGDP = lnGDP-lnGDP[_n-1]
          gen dlnInflation = lnInflation-lnInflation[_n-1]
          xi: areg dlnFIAS_th_USD dlnGDP dlnInflation EATR EMTR statutory_corptax i.year, absorb(BVDID) robust
          I am still not sure about my regression and about using -areg-.
          Do you think it does make sense or would you regress it another way?
          Thank you very much for your help, I really appreciate it!!

          Comment


          • #6
            Well, your first block of code is worrisome: you have retained only a single observation for each combination of BVDID and year. And I certainly understand why you want that. But you have done it by simply picking one arbitrarily (not even truly at random) to retain. Now if the observations being dropped are purely duplicates this is not a problem. (But the safe way to assure that this is all you remove is with -duplicates drop-, not -drop if nr > 1-). If you have different values of some variables in the "surplus" operations, then you are discarding information arbitrarily. If that is the case, you need to reconsider your approach and select the observations you retain based on their being the correct ones, or by somehow combining the operations into one that summarizes the variables appropriately (e.g. mean values of variables, or maxima, or first, or something like that.)

            I don't know your data so I can't tell if this is actually a problem or not. But the code you have used doesn't guard against this problem and is not good practice.

            Once you have properly reduced to a single observation per BVDID year pairing, your -xtset- command should be -xtset BVDID year-, not what you have.
            Next, you are still using [_n-1] notation to compute lags; another unsafe practice. If there are no gaps in the time series, it'll be OK. But if there are, you will calculate lags and differences that are incorrect. Also, in this case, because each combination of BVDID year is a single ID, and you have (incorrectly) -xtset ID year-, the sort order of the data within BVDID is not guaranteed to be correct.

            So, after correcting the -xtset- command, I would get rid of all those [_n-1] commands. There is no need to calculate these differences yourself. Then I would run the regression as:

            Code:
            areg d.lnFIAS_th_USD d.lngGDP d.lnInflation EATR EMTR staturory_corptax i.year, absorb(BVDID0 robust
            (Again, -xi:- is, at best, of no help to you here, and it may well get in the way of further analysis. You really should try to pretend that you never heard of -xi- and stop using it. The situations in modern Stata where it is actually needed are quite rare, and this certainly isn't one of them.


            Comment


            • #7
              Clyde Schechter Thank you very much for your answer, I deeply appreciate your time and help! Regarding my data, I am using the current ORBIS database.
              I followed your advice and dropped all the duplicates which worked pretty well. Then I tried to correct my –xset- command the way you are suggesting it which led to the following error message:
              Code:
              xtset BVDID year
              string variables not allowed in varlist;
              BVDID is a string variable
              r(109);
              Unfortunately, I got the error message since my original BVDID is in string format (e.g.AE0000346220). I tried to change it using –encode- and –recast- but couldn’t figure out a way to do it.

              I tried to use –xtset ID year- instead and then tried to get my regression running with your command:
              Code:
              areg d.lnFIAS_th_USD d.lnGDP d.lnInflation EATR EMTR statutory_corptax i.year, absorb(BVDID) robust
              no observations
              r(2000);
              The regression is working when I run it the following way:
              Code:
              areg lnFIAS_th_USD lnGDP lnInflation EATR EMTR statutory_corptax i.year, absorb(BVDID) robust
              I tried to generate first differences for the three variables but only missing values were generated.
              Code:
              gen dlnGDP = d.lnGDP
              (25,720,760 missing values generated)
              Do you have any suggestions how I could solve this problem?
              Thank you very much in advance for your help!

              Comment


              • #8
                I got the error message since my original BVDID is in string format (e.g.AE0000346220). I tried to change it using –encode- and –recast- but couldn’t figure out a way to do it.
                Well, -recast- isn't suitable for this purpose. It's surprising that -encode- didn't work (unless there are more than 65,000 different values of BVDCODE--then you would be exceeding its limits). I wonder what your actual -encode- command looked like--perhaps it had an error. But be that as it may, you can handle this with
                Code:
                egen ID = group(BVDID)
                and then you will be able to successfully
                Code:
                xtset ID year
                and your regression will run when you do it as I have suggested. Notice that the variable year is not mentioned in the -egen ID = group(...- command. This is crucial.

                You need to think carefully about your code and what it does. When you run -egen ID = group(BVDID year)-, your variable ID identifies combinations of BVDID and year. But, having removed the duplicates as you needed to, you now have only one observation for each combination of BVDID and year. So, your version of ID just identifies one observation with each value of ID. When you then -xtset ID year- (with your version of ID), each panel is just a single observation. So then when you try to apply the d. operator, there is nothing for it to work with: you can't calculate a difference unless you have two things--and you only have one. Consequently, when you code -gen dlnGDP = d.lnGDP- you come up with all missing values. And for exactly the same reason, when you run -areg d.lnFIAS_th_USD...- there are no observations: because all of those d.* terms are missing values.

                If you do it as I have suggested here, you will see that you get a distinct numeric value of ID for each value of BVDID. ID is, in fact, just an arbitrary numeric coding of BVDID. So now you will have multiple observations for each value of ID--in fact as many as there are years. And the difference operator has something to work with.

                Comment


                • #9
                  Clyde Schechter Thank you very much again for your help!

                  My initial encode command looked like this:
                  Code:
                  encode BVDID, gen(ID)
                  too many values
                  r(134);
                  So probably there are more than 65,000 different values of the variable.

                  I followed your advice and did the following:
                  Code:
                  egen ID = group(BVDID)
                  (2260 missing values generated)
                  sort ID
                  by ID: gen nr=_n
                  duplicates drop
                  Duplicates in terms of all variables
                  (0 observations are duplicates)
                  drop if nr > 1
                  (22,000,108 observations deleted)
                  xtset ID year
                  panel variable:  ID (unbalanced)
                  time variable:  year, 1978 to 2017
                  delta:  1 unit
                  gen dlnGDP = d.lnGDP
                  (4,223,979 missing values generated)
                  Unfortunately, I still get only missing values generated. Do I have to sort the variables before generating the first difference?
                  Thank you very much in advance for your help!

                  Comment


                  • #10
                    Ferdi, once again, you are getting all missing values because you are deleting all but one observation per ID, so there is nothing for d. to take differences of. You are using snippets of code from the thread and interspersing them among your own commands, but you need to think through what they do and how they interact with each other.

                    The two lines
                    Code:
                    by ID: gen nr = _n
                    drop if nr > 1
                    remove all but one observation per ID. GET RID OF THOSE TWO LINES OF CODE. Then you will have multiple observations per ID and there will be something to calculate differences on. The rest looks OK.

                    No, you do not have to explicitly -sort- the data because -xtset- does that when it runs.

                    Comment

                    Working...
                    X