No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Nested foreach and forvalue for fixed effect analysis (using xtreg), with saving outputs


    I am completely new to STATALIST, STATA and some "decent" econometrics, in general. Therefore, please apologise in advance for my ignorance. I couldn't find the answer to what I am struggling with, even though I did find some parts that are in a way related. I believe that following is closest to my question:

    What I am doing is a natural experiment on changes in proximity of firms to banks/branches, analysing if it has influenced their business outcomes. I have an unbalanced panel of roughly 15,000 firms for 9 years (2010 to 2018). It is paired with an unbalanced panel of roughly 470 branches, but that part only mattered for distance calculating (getting the values of independent variables). My main variables of interest as “measures of proximity” are: (i) distance to closest branch in terms of km, (ii) distance to closest branch in terms of min, (iii) number of branches within a 10-km radius, (iv) number of branches within a 30-min travel distance, (v) number of branches within a 5-km radius, (iv) number of branches within a 15-min travel distance, etc. In total, I have 8 such variables, 2 of which are floats, and 6 integers. I will refer to those as: x_1, x_2, .. x_8. Since I want to do a multiple regression analysis, I have squared each of those, so I technically have 16 independent variables, making 8 pairs and in each unique regression I want to regress my dependent variable on 1 pair (e.g. x_1, sq_x_1).

    I have 20 dependent variables of interest, each of which indicates what I could potentially observe as a firm’s outcome (revenues, profits, number of effective employees…). Each of those is a float. I will refer to those as: y_1, y_2, …, y_20.

    In terms of fixed effects, I am still not sure how many of those will I have, but I know that some will be “simple” and some “multiple”. Say I have 5 “simple” fixed effects (like year FE, municipality FE): fe1_1, fe1_2, …, fe1_5, and 3 “multiple” (such as year-industry FE) denoted as fe2_1, fe2_2, fe2_3. So in my regressions, three cases are possible:
    1. To have one or few “simple” fixed effects
    2. To have one or few “multiple” fixed effects
    3. To have a combination of one or few “simple” and “multiple” fixed effects
    I want to do all regressions both for level-level, and log-log, so it doubles the number of regressions. For the sake of simplicity, I will continue as if I am only doing one of those, because I assume that those are analogous to each other.

    Since this is the first time ever that I am doing panel data analysis, I would really appreciate any comment both on syntaxes as well the way I use them (possible options, etc.)... This is how I think I should start:

    global id id_firm
    global t year
    global ylist y_1 y_2 y_3 … y_20
    global xlist x_1 sq_x_1 … x_8 sq_x_8
    global felist .fe1_1 … fe1_5 .fe2_1 .fe2_2 .fe2_3
    sort $id $t
    xtset $id $t
    xtsum $id $t $ylist $xlist
    Then I would like to figure out a way to loop over regressions, but interchangeably with different independent, dependent and fixed effect variables. I do not even have a vague idea how could I do it in case of interchangeable use of different combinations of fixed effects, but I would even be satisfied if I could make it work for using always one of fixed effect variables from $felist, and then the rest I could sort out by copying and pasting, if nothing else. This is what I got so far, but it is not working:

    foreach y of $ylist {
       foreach fe of $felist {
           forvalues i=1(2)16 {
              forvalues j=1/2 {
              local v`j' : word `i' of `$xlist'
              local ++i
    xtreg `y' `v1' `v2' `fe', fe
    * or
    * xtreg `y' `v1' `v2' `fe', fe vce(robust)?
    * For saving output, could it be something like this:
    outreg2 sum using file, tex append
    * I would prefer my output both in tex and dta format

    Please if anyone could help me with making these loops work as well as figuring out a way to store the output. I would prefer storing only individual coefficients on the fixed effects.

    The final question I have is should I run Hausman test only once for my whole panel data, or I should do it in each regression. Looking at this syntax and by my understanding of it, it is enough to do it once for the dataset:

    quietly xtreg $ylist $xlist, fe
    estimates store fixed
    quietly xtreg $ylist $xlist, re
    estimates store random
    hausman fixed random
    I hope no one will mind such a long post, and various questions that I asked.


    Last edited by Jelica Rastoka; 17 May 2019, 20:08.

  • #2
    Welcome to Statliast, and to Stata.

    It appears to me that you are trying to run every possible model of your data at once, having yet to gain the experience you would by exploring your data on a smaller scale.

    I notice right away from the formulation of your models that you write

    Since I want to do a multiple regression analysis, I have squared each of those
    Squaring your independent variables does not follow as an implication of doing "multiple regression analysis", although your wording suggests it does. And something similar holds for using a log-log formulation as an alternative to level-level, or for that matter, to log-level. In each case underlying theory helps to justify these choices. Back in the day, "data mining" was a pejorative term for throwing everything at the wall and seeing what sticks. The problem is that doing so, and then choosing the most favorable results to report, eliminates the meaning of the statistical tests you choose to report.

    Beyond that, you seem to be unaware of Stata's factor variable notation, so that you can include x_1 and (x_1)2 in your model by including c.x_1##c.x_1 as an independent variable. Since that notation is required to inform postestimation commands of the relationship between the two terms x_1 and (x_1)2 it further suggests to me, along with your interest in running every possible model with little intervening analysis, that you are unaware of Stata's rich offerings of postestimation commands, such as margins, to help you understand your regression results.

    My suggestion is that you start much simpler. Pick one of your outcome measures. Try to build the best model you can. See what happens when some remote firms have 0 branches of a particular bank within a 10km radius and you formulate a log-log model. Find the pitfalls in your approach. See what variables seem to matter and which do not. Then model another outcome measure and see to what extent your insights from the first model hold up. Do this manually, not with some sort of automation that directs you in a predetermined direction regardless of the results.

    I apologize for addressing questions other than those you asked about automating the process you have chosen to follow.

    PS - If I start with y = a + bx + cx2 and then replace y and x with log(y) and log(x), it is not at all clear to me how I interpret the coefficient on (log(x))2.


    • #3
      Hi William,

      Thank You very much for Your reply. Everything what You mentioned is very useful, so thanks a lot.

      As for "multiple regression analysis", I didn't mean that squaring implies multiple regression, I just wanted to suggest that I am not regressing y-variable on one x-variable at time, but instead I am regressing it on 2 x-variables at time. I considered it to be relevant because I thought I have to loop over 2 by 2 independent variables from my x-list. Saying that the other independent variable is the square of first independent variable, was just some extra piece of information... Similar for mentioning log-log and level-level, I was rather trying to be as precise as possible in explaining what kind of data available I have… In log log, I would say regress log(y) on log(x_1) and log(x_2), I wasn't thinking of using something like (log(x))2…. Sorry about confusion that I made with my terrible explanation.

      When you wrote:
      ...that you are unaware of Stata's rich offerings of postestimation commands, such as margins, to help you understand your regression results..
      You were absolutely right! I am completely unaware of those commands that You mentioned. Same as trying to do everything possible because I don't have a clue how to do what in Stata. There are so many options that I really find it difficult to decide what exactly I have to use in a specific case, and then what else is good to use and why. Could You please refer me to some further explanation of using post estimation commands?

      Further on Your comments, I have actually started with using just few variables (2 y, 2 x, 2 fe), but I thought I should then loop over “everything”, in order to get a clue in which direction to continue (I thought it could give a clearer idea of what exactly should my specification be alike). As You suggested, it wasn't a good approach. Namely, I didn't understood how loops work. So even though I was trying to regress "everything on everything", I didn't want to do it "in every possible way", and that's what I did by this approach I used. (Please apologise for this laic expressions).

      Getting back to code that I was struggling with… A friend of mine suggested avoiding confusion with combining many foreach with forvalue. Instead of using forvalue for pairing, he suggested "decomposing" x-list in two separate list. And it perfectly worked (including ourreg2). But then I realised that too many loops does no good, since I get a bunch of results messed up, so I finally gave up on that and did it in a simpler way, by some more "copying and pasting", but with focusing on producing only those results that I need/want, avoiding all redundant.

      Again, many thanks for these specific comments that are very useful to confirm if I am doing anything good, as well as for suggestions on changing my approach in ordered to make my work more efficient.



      • #4
        Chapter 20 of the Stata User's Guide PDF included in your Stata installation and accessible from Stata's Help menu is the basic introduction to the common features of Stata's estimation and postestimation commands, Then for any estimation command - xtreg in your case - help xtreg postestimation gives the specifiics of the postestimation commands applicable after xtreg.

        When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. There are a lot of examples to copy and paste into Stata's do-file editor to run yourself, and better yet, to experiment with changing the options to see how the results change.

        All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu. The objective in doing the reading was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and PDF manuals.

        Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.