Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • xtset: how to use both year and month as time variable

    Hi All,

    I am trying to run a panel data regression using the xtset command but am facing 2 problems. I have 5 years of monthly data, hence my dataset includes the variable 'Year' and 'Month'.

    this is what i used:

    xtset Company Year Month


    1. May I know if there is a way to use xtset with both the year and month variables? From the manual, it seems like xtset only allows 1 panel id and 1 time variable.

    I thought of combining the year and month variables into 1 variable e.g. Date but I was hoping to control for both year and month fixed effects.


    Because of problem 1, I changed to command as such:

    xtset Company Date


    2. My panel variable is a string variable which contains both numerics and letters.

    When I tried to use that as my panel variable, Stata gives an error message "string variables not allowed".

    varlist: CompanyName: string variable not allowed


    I tried to destring it by using "destring Company, generate (company)"

    another error occurred "Company contains nonnumeric characters; no generate"

    a similar error message occurs if I try to use "destring Company, replace"


    Could someone kindly advise me on the 2 problems above? Thank you very much.


  • #2
    So, if you expect that there are cyclical "seasonal" effects associated with month, you can capture that with -xtset company- and specify indicator variables for the calendar months by including the factor variable i.month in your model. If you really want to capture the combination of month and year as a fixed effect, then use your date variable and forget about month and year.

    In any case, bear in mind that when you -xtset panelvar timevar- and then do some kind of -xt- regression, you automatically get fixed effects for panelvar included, but not for timevar.

    With regard to #2, there are a couple of possibilities:

    1. Company is in fact a string variable whose human-readable content consists of things like "AT&T" etc. In that case -destring- is in appropriate. You can either do -egen nCompany = group(Company)- or you can do -encode Company, gen(nCompany)- depending on whether you want to be able to look directly at nCompany and see which companies are which.

    2. Company is in fact supposed to already be a numeric id for companies. Then Stata is telling you that some of your data is not meeting your expectations. To find out which observations are incorrect, try -list Company if missing(real(Company))-. That will give you a list of the observations that do not have proper representations of numbers. Get those fixed, and then you'll be fine with -destring-.

    Comment


    • #3
      Thank you very much for your help Clyde!

      #Problem 1:
      You are right, I am expecting cyclical seasonal effects associated with month and I am trying to capture that effect. I am hoping to have capture the fixed effects (FE) of year and month separately, and also as an interaction term i.e. it is as if I am running a regression with Year, Month and Year*Month as my explanatory variables.

      If I understood you correctly, did you mean that as long as I use -xtset Company- it will automatically capture the Year FE, Month FE and Year*Month FE? Or did you mean that I should use -xtset Company- and at the same time, create dummy variables for each of the years and months? Sorry I did not quite understand what you said about indicator variables and factor variables.



      Problem 2:
      Your first solution works perfectly!

      As for your second solution, I followed your stata command and fixed those observations by dropping observations which have an empty "Company" cell i.e. I assumed that means missing values?(thought I thought missing values should be represented by a "." rather than just being left empty). However, the same error message occurs.

      But it's fine since your first solution works. Thank you!

      Comment


      • #4
        The only thing that is automatically represented in a -xt[whatever], fe- model in Stata is the panel variable specified with -xtset-. Anything else you want in there you have to specify explicitly as a variable in the model.

        I don't know quite what you're trying to do in your modeling. If what you have is panel data in which a cohort of Companies is observed once per month, including year, month and yearXmonth effects will give you a saturated model: company year and month will uniquely identify all the observations. If you have multiple observations for companies within any given month and year, then you can, indeed, include effects for year, month, and yearXmonth. But, think about whether you want to do this. That has the potential to soak up a lot of degrees of freedom, and "seasonal" monthly effects in most situations do not vary much from year to year. So even if your data permits this specification, think about whether you really have any rationale for thinking that the cyclic seasonal effects actually change from year to year. I don't know what your project is about, so I can't help you. Actually given that this looks like economic or financial data, I probably couldn't help you with that even if I did know what it's about.

        With regard to your string variable Company, -destring- expects a string missing value to be coded as an empty string (i.e. ""). It will also accept values consisting entirely of blanks, but not "." So "." will cause -destring- to complain about non-numeric characters if you have "." as a value. When -destring- encounters "", it will converted it to Stata's system missing, which is displayed as a period when you list it or look in the data browser.

        It sounds like using -encode- has given you what you want, so I'm not sure it's worth your while pursuing the problems with -destring-ing Company. But if the source of your data is such that Company should be a valid number, then it means you have corrupted or erroneous data--and that you should definitely pursue.

        I would start, again, with
        Code:
        list Company if missing(real(Company))
        That will show you the values of Company that are not string representations of numbers, and where they are in the data set. If they look OK to the eye, then it probably means that they include non-printing characters that Stata can see, but which are either not displayed or displayed as whitespace. To ferret out the culprits, you can do:

        Code:
        charlist Company if missing(real(Company))
        return list
        [Note: -charlist- is written by Nick Cox and can be obtained from SSC if you don't already have it installed.]
        The output of the -return list- command will show you the ASCII codes of all the characters that appear in the offending observations of Company. You can then deal with them in some appropriate way (which might involve editing them with Stata string functions, or might entail going back over how your Stata data set was built and finding a bug, or finding that the original source of the data included these erroneous entries.)

        Comment


        • #5
          Hi Clyde,

          Thank you for your explanation about pursuing the problems with -destring-ing Company.

          You're right - it definitely helps to put things in perspective if you know what I'm modelling... Basically, I am trying to find out whether winning an accolade leads to increased demand for an airlines. Say, for example, if Alaska airlines was voted as the best airlines in 2014, will more passengers switch to Alaska airlines and lead to an increased demand?

          I have aggregated monthly data for the past 5 years, that is, average passenger traffic flow (Traffic) for each airlines (variable name: Company) between a Departure Point (Depart) and Arrival Point (Arrive) and whether it is a direct flight. You mentioned that using Year, Month and Year*Month FE will give a saturated model. Does that still apply if I have many observations per Company within any given Year and Month (since there are many Departure and Arrival airport pairs)? Should I still be using Year, Month and Year*Month FE? In addition, I also plan to use Company FE, Depart FE, Arrive FE, and Depart*Arrive FE.

          How do you think I should run my model? I planned to regress Traffic on Company, Year, Month, Year*Month, Depart, Arrive, Depart*Arrive, "DirectFlight" dummy and "WonAccolade" dummy. I believe the best way to do it is probably by using the -xtreg- command. However, I am confused as to what to put for the "panel variable" and "time variable" under the xtset command.



          I started off by doing -xtset- in the following manner:

          generate Date = Year + Month
          xtset Company Date


          An error message occurred: "repeated time values within panel"



          I then tried simplifying it slightly:

          xtset Company
          xtreg Traffic Company i.Year*i.Month i.Depart*i.Arrive DirectFlight WonAccolade, fe


          Then, an error message occurred: "variable Year*i.Month not found r(111)"


          I then changed it to the following:

          xtset Company
          xi: regress Traffic Company i.Year*i.Month i.Depart*i.Arrive DirectFlight WonAccolade


          But it says "invalid syntax"


          Do you know which of the above code I should be using and why? What time variable should I use for xtset (if any)? Should I use -xtreg ... fe- or -xi: regress..." and what's wrong with my above syntax? What should the correct code be instead?

          Thanks so much for your help!

          Comment


          • #6
            Since you do have multiple observations for each company within each year month combination, the use of year, month, and yearXmonth variables will not give you a saturated model. But given your problem, I think the use of yearXmonth interactions is inadvisable. As far as I can see, there is no reason to think that the cyclic monthly effects will differ from one year to the next. Probably if I were in your shoes I would model time with c.date and i.month. That will allow for any general upward or downward trend in traffic over time and also capture month-to-month seasonal variation within years. (If seasonal effects are known to change year to year in airline traffic, then disregard my advice on this.)

            Also, I don't know much about the airline industry, but if there were any "shocks" to travel in your five year period, you need to represent those as well. (An example would be the markedly depressed air travel following 9/11/2001 that lasted for a long time; perhaps the financial crisis of 2009 had a similar effect?) If traffic has gone down and then up over the past five years (or up and then down) then consider using a quadratic representation of date: c.date##c.date

            As for how to -xtset- this data, you don't have a time variable that uniquely identifies observations within Company. So just -xtset nCompany-, where nCompany is the numericalized version of Company discussed earlier in the thread. The time variable in -xtset- is optional, and can only be used when it uniquely identifies observations within panel variable.

            Another crucial point: -generate date = month + year- is legal syntax, but it will give you gibberish. That is not how Stata represents dates. I'm guessing that you have a variable called year, that is numeric and ranges from 2009 to 2014 (or some other 5 year range), and that you have another numeric variable month that is coded as 1 to 12 to represent Jan through Dec. To create a Stata date from that you need to do:

            Code:
            gen int date = ym(year, month)
            format date %tm
            This is the date variable that I am referring to in the first two paragraphs of this post. If you plan to do this kind of modeling on a regular basis, time spent reading the manual section on date functions will be well spent.

            Next there is the question of representing arrival and departure airports. Using arrival, departure, and their interaction terms as you propose will cover it. Only you know whether that is excessively complicated or whether some simpler specification would adequately capture these effects.

            Next, if you are using current Stata (version 13) or version 12 you should not be using xi: any more. For most purposes it has been replaced by the use of factor variables. See the online help and manual sections for those: -help fvvarlist-. The correct use of notation, with # or ##, not *, for interaction terms will eliminate one of the syntax errors you report. [If you are using non-current Stata, you are supposed to say so in your post.]

            So I'm guessing your model will start out as something like this:
            Code:
            xtreg Traffic c.date i.month i.Depart##i.Arrive i.DirectFlight i.WonAccolade, fe

            Comment


            • #7
              I am using the current Stata. I was not aware -xi:- was for earlier versions.

              Thank you for pointing out the crucial point - would have been crucial indeed: I forgot to mention that I have fixed it by converting year and month into string variables before using -generate date = year + month- to prevent them from adding the two variables numerically. I noticed that your method generates date as int type rather than string type. Does the type matter here?

              I tried with the code you proposed and faced this constraint:

              maxvar too small
              You have attempted to use an interaction with too many levels or attempted to fit a model with too many variables. You need to increase maxvar; it is currently 5000. Use
              set maxvar; see help maxvar.

              If you are using factor variables and included an interaction that has lots of missing cells, either increase maxvar or set emptycells drop to reduce the required matrix
              size; see help set emptycells.

              If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of categories. Use the c. operator on such
              variables.
              r(907);


              I then tried to overcome it by -set maxvar 32000- and another error message appeared: "no; data in memory would be lost". Is there any other way I can proceed to overcome this?


              I tried setting the panel var using -xtset- as "DepartArrive" (did this by generating DepartArrive after converting Depart and Arrive to long variables) instead of nCompany and it works. However, will the results be the same as what I tried estimating above? Am I changing the estimation in any way unknowingly? Basically, are the results identical for the following:

              1. -xtset nCompany-
              -xtreg Traffic c.date i.month i.Depart##i.Arrive i.DirectFlight i.WonAccolade, fe-
              2. -xtset DepartArrive-
              -xtreg Traffic c.date i.month i.nCompany iDirect.Flight i.WonAccolade, fe


              By the way, does it matter whether to use date as a continuous variable i.e. does it affect the result if I used i.date instead?

              Also, I get the impression that string variables are not allowed anywhere in -xtset- and -xtreg-. All string variables must be converted. Is that right?

              Thanks so much!

              Comment


              • #8
                As for the best way to represent date in your model, you need to think that through in terms of what is known about the history of airline traffic. My suggestion to use c.date assumes that the change over time is more or less a linear trend. But if that's not true, and there are erratic but important up and down fluctuations from year to year, then i.date would be a better specification. They are definitely different, and which you should use is a matter of what is going on in the industry, not a statistical question.

                I guess the number of airports and airlines here are such that your model, including all those interactions, is huge. So you need to increase maxvar, as Stata suggested. But in order to do that, you have to first clear out the data from memory. In other words, you need to start over, and set maxvar to a higher value before you load in the data again. You should also consider Stata's other recommendation to -set emptycells drop-. I suspect that there are a lot of Depart and Arrive combinations that simply don't occur--and you could free up a lot of memory by not having Stata represent those in the matrices it works with to do your regression. (Remember, the memory requirements for the covariance matrix go up as the square of the number of variables you are tracking in the model.)

                The two models you list near the end should be equivalent from the perspective of estimating the effect of winning an accolade. The constant terms will differ, and, of course, one will give you estimates of effects of route but not company, and the other the opposite. But presumably you aren't interested in those other effects anyway--they're just nuisance parameters for your purposes.

                And, you are right, neither -xtset- nor -xtreg- wants anything to do with string variables. Actually, string variables play only a very limited role in Stata, as identifiers or labels for things.

                Comment


                • #9
                  You're very helpful. Thanks so much Clyde!

                  Comment


                  • #10
                    Hi all, I'm new here and I'm very intersted to this problem because I have a similar one.

                    I have a dataset with 1580 rows.
                    This dataset is the consequence of this logic: 79 people makes bids in 5 round for 4 different products.
                    So, in each row, I have an "ID" (1-79), a variable that identifies the "round" (1-5) , one single "bid" column , a variable that identifies the product called "product" (1-4) and also 3 dummies called "product_2", "product_3" and "product_4". When all these 3 dummies are equal to zero, the row and the bid are referring to the product number 1.
                    I'm intersted to use my dataset like a panel, but with command - xtset ID round - I have not unique cases, but I have 4 row for each combination of ID and Round, so I can't proceed in this way.
                    If I create different datasets, for each product, I can't analyse the significance of the dummies variables relatives to the products.
                    There is a way to use like a panel my dataset, without loose information about the products?
                    Is a valid alternative use the command - xtset ID - and insert dummies variables for the products and create other dummies for the rounds, checking for their significance in the model?

                    Thanks in advance

                    Comment

                    Working...
                    X