Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coding linear time trend with missing years

    I am familiar with creating linear time trends. For example, if I have study years 1996, 1997, 1998 and 1999; then the linear time trend variable will be coded 1 for 1996, 2 for 1997, 3 for 1998 and 4 for 1999.

    However, I am uncertain what to do when I have one year of data missing, for example, if I have data on only study years: 1996, 1998 and 1999 with the 1997 data missing.

    Do I code 1998 as 2 or 3 in the linear time trend variable?

  • #2
    Abioudun:
    welcome to this forum.
    What you describe as a linear trend resembles instead a categorical variable where each year corresponds to a given level of the predictor (I would translate it into -i.year- using, as recommended, -fvvarlist- notation; see -help fvvarlist- and related entry in Stata .pdf manual that comes with your Stata).
    For time linear trend, I would have thought of -c.year- (possibly with a square term to investigate the presence of a turning point
    Code:
    c.year##c.year
    .
    If you have a missing year, any observation affected by it will be omitted by Stata; hence you will unavoidably have an hole in your dataset.
    However, if you have panel data, do not worry, because Stata can manage both balanced and unbalanced panel datasets.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      In most situations I would code these as 1996 = 1, 1998 = 3, 1999 = 4. That's because in most situation what you want the time trend variable to represent is the elapsed time.

      An exception to this general principle might arise depending on why there is no 1997 data. If the effects you are analyzing were actually suspended or otherwise inoperative during 1997, then it would be more correct to code 1998 as year 2 and 1999 as year 3, because nothing actually happened in 1997.

      Edit: Crossed with #2, which covers other aspects of this situation.

      Comment


      • #4
        Originally posted by Carlo Lazzaro View Post
        Abioudun:
        welcome to this forum.
        What you describe as a linear trend resembles instead a categorical variable where each year corresponds to a given level of the predictor (I would translate it into -i.year- using, as recommended, -fvvarlist- notation; see -help fvvarlist- and related entry in Stata .pdf manual that comes with your Stata).
        For time linear trend, I would have thought of -c.year- (possibly with a square term to investigate the presence of a turning point
        Code:
        c.year##c.year
        .
        If you have a missing year, any observation affected by it will be omitted by Stata; hence you will unavoidably have an hole in your dataset.
        However, if you have panel data, do not worry, because Stata can manage both balanced and unbalanced panel datasets.
        Thanks. I am aware c.year is a more straightforward alternative but it doesn't work in my particular case. I have a model

        Code:
        reg i.statepolicy covariates i.state i.year state#c.linearyear state#c.linearyear#c.linearyear
        Using c.year instead of c.linearyear causes Stata to drop out the state-specific quadratic trends (state#c.year#c.year)


        Originally posted by Clyde Schechter View Post
        In most situations I would code these as 1996 = 1, 1998 = 3, 1999 = 4. That's because in most situation what you want the time trend variable to represent is the elapsed time.

        An exception to this general principle might arise depending on why there is no 1997 data. If the effects you are analyzing were actually suspended or otherwise inoperative during 1997, then it would be more correct to code 1998 as year 2 and 1999 as year 3, because nothing actually happened in 1997.

        Edit: Crossed with #2, which covers other aspects of this situation.
        Clyde Schechter Thanks. This clears things up for me. I have a follow-up question. So I coded 1996 = 1, 1998 = 3, 1999 = 4 and in a separate instance I coded it as 1996 = 5, 1998 = 7, 1999 = 8 and noticed that my results were essentially the same. Does this mean that you can the code the 1st year as any number so long as there is a 1 digit increase for subsequent years? Or was this just a chance finding?

        Comment


        • #5
          Does this mean that you can the code the 1st year as any number so long as there is a 1 digit increase for subsequent years?
          Essentially, yes. If you took this to extremes and choose some extremely large starting number you might introduce numerical instability into the estimation and get garbage instead of usable results. But, in principle, changing the starting year will not affect anything else in the results, except the constant term, which generally nobody cares about anyway.

          Comment


          • #6
            See also https://www.stata-journal.com/sjpdf....iclenum=st0394 for a broader discussion of origins, conveniently shifted or compulsorily fixed, as the case may be. (@Clyde Schechter's contribution is documented therein.)

            Comment


            • #7
              Clyde Schechter comment in #5 that "generally nobody cares about [the constant]"; I take some issue with this - here is an example of when someone might care - if the researcher wants to know whether the coefficient is "large enough to matter", the usual first test is subject matter knowledge; however, in many cases (in my experience at least) there is insufficient subject matter knowledge; in that case, comparing the coefficient to a meaningful constant is a way to assess its real-world importance; in my work, we often put effort into ensuring that there is a meaningful constant for just this reason

              Comment


              • #8
                Rich Goldstein is right, as usual. I overstated my case.

                Nevertheless, it is often true that the constant is not important. In the model of this thread, there are numerous factor variables in play: their presence assures that the constant term is an unidentified parameter of the model, and the calculated value will depend on the particular choice of base categories for these factor variables. So unless one has made deliberate and meaningful choices for those base categories, the constant in this model should not be cared about, as it is just an artifact of the parameterization of the model. Of course, such deliberate and meaningful choices are, no doubt, part of what he means when he refers to "put[ting] effort into ensuring that there is a meaningful constant..."

                Comment

                Working...
                X