Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Referring to variables at specific dates in panel data

    Hi Stata Community,

    I was wondering if anyone knew how to refer to variables in specific years for data cleaning purposes.

    For a simple example, if in 2009 a survey respondent says their washing machine is 8 years old, I would like to fill in their replacement choice: "repl"=1 in 2001.

    In an ideal world, I would love to use syntax like:

    replace 2001.repl=1 if 2009.age == 8

    Since some of my data has only categories of years, this can get kind of complex:

    replace 2004.repl=1 if 2005.age== "0 to 1 years" & 2009.age == "5 to 9 years"

    Is any version of this possible with panel data? I have been unsuccessful at locating the appropriate syntax.

    Alecia Cassidy
    Last edited by Alecia Cassidy; 28 Sep 2014, 14:43.

  • #2
    try -help encode- and if that doesn't get you going in the right direction. come back with more details (data structure, example of about twenty lines of data, etc.).

    Best,

    ben,

    Comment


    • #3
      ps. After re-reading, even after -encode- you still need to make some decisions. If the categories are numerous, then re-coding for the mean of that category is attractive (or simply using the values encode chose). But if there aren't that many categories or there is an obvious curve to it,, then setting it at the midpoint is iffy. No perfect solution, but be prepared to do sensitivity analyses. Regardless, I hope -encode- gets you partway there. You'll still probably have to do some re-coding or generation of nre variables after -encide- since, unless each category starts with a higher number at the beginning of the label, it may put things in an unexpected order.
      Last edited by ben earnhart; 28 Sep 2014, 15:13.

      Comment


      • #4
        While your pseudocode is suggestive of what you want, you don't tell us enough about the data to get a concrete response. Key question: are your data in "long" format? That is, do you have a single variable repl and another single variable age, and a variable year, with multiple observations per person? Or do you have "wide" format, i.e. a single observation per person with variables named repl2001 repl2002....age2001 age2002..... (and no year variable).

        If your data are in long format, does each person have an observation every year, or can there be gaps between years for a person? And can a person have multiple observations in a single year?

        What I know you don't have are variables named 2001.repl, etc., because those aren't legal Stata variable names.

        Second key question: when you say "if 2005.age== "0 to 1 years" & 2009.age == "5 to 9 years", do you really mean that both conditions have to be met (which is what the use of & will get you), or did you mean to set repl for year 2004 to 1 if either condition is met (which you can get from |, not &). And when there are multiple conditions like this, what will you do if the responses for some years are missing?

        Comment


        • #5
          Hi Clyde and Ben,

          Thanks so much for your responses. Sorry for not providing more information.

          My data are in long format. I have a single variable repl and another single variable age, and a variable year, with multiple observations per person. The consumers were asked the age of their washing machine in certain years. Their responses are under the age variable. So, consumer 1 said their washing machine was 8 years old in 2001 and 0 to 1 years old when surveyed in 2005.
          id age repl year
          1 1985
          1 1986
          1 1987
          1 1988
          1 1989
          1 1990
          1 1991
          1 1992
          1 1993
          1 1994
          1 1995
          1 1996
          1 1997
          1 1998
          1 1999
          1 2000
          1 8 2001
          1 2002
          1 2003
          1 2004
          1 0 to 1 years 2005
          1 2006
          1 2007
          1 2008
          1 5 to 9 years 2009
          1 2010
          1 2011
          As you can see, repl is missing. That's because I'm trying to write an algorithm to generate repl. In this particular case, the algorithm would ultimately lead to something along the lines of:
          id age repl year
          1 . 1985
          1 . 1986
          1 . 1987
          1 . 1988
          1 . 1989
          1 . 1990
          1 . 1991
          1 . 1992
          1 1 1993
          1 0 1994
          1 0 1995
          1 0 1996
          1 0 1997
          1 0 1998
          1 0 1999
          1 0 2000
          1 8 . 2001
          1 . 2002
          1 . 2003
          1 1 2004
          1 0 to 1 years 0 2005
          1 0 2006
          1 0 2007
          1 0 2008
          1 5 to 9 years 0 2009
          1 0 2010
          1 0 2011

          I have the algorithm on paper, but just can't figure out how to refer to the variables by their years of observation in order to implement it.

          There can be gaps between years for a given person. For example, the survey year 2009 could be missing for a given person. Nobody has two observations for the same year. That doesn't necessarily mean I don't know the replacement decision in that year, though, since if the washing machine is said to be 20 years old in 2009, then I know that it was not replaced in 2005. I had been considering simply tossing out missing values or imputing if I can't determine the exact year. The point is that I think I need some easy way to reference the variables in specific years so that I can do rather complex imputations and such to back out the replacement years.

          I read up about encode and I'm not sure I understand what I should do with it. If both the conditions in "if 2005.age== "0 to 1 years" & 2009.age == "5 to 9 years" are met, then I know the exact year of replacement must be 2004 (as long as the survey respondent has a crystal-clear memory). So, it seems I lose something by using encode to somehow average the categories or get the midpoint or whatever when I could have the actual year. As far as Clyde's question goes, I did mean both because in that case, the year is determined by the two conditions. In some cases, it won't be uniquely determined, but I at least can rule out certain years in lots of cases and impute among the rest.

          Does a method of referring to variables in specific years like what I described above exist?

          Again, thanks so much for all of your helpful comments!

          Alecia Cassidy



          Comment


          • #6
            This is unbelievably complicated! To be honest, I don't have any sense of how to deal with the "0 to 1" or "5 to 9" issue. It will require an enormous amount of logic programming to determine when the conjunction of the various responses leads to a unique determination of when the thing was replaced. And my best guess is that there will be a large number of instances where you are left with just a range of possibilities, and I have no sense of how you plan to resolve those. But, if we had only exact ages, there is a fairly simple way to handle the issue of referencing the right years:

            Code:
            // FIRST ELIMINATE YEAR GAPS IN THE DATA
            fillin id year
            xtset id year
            
            // GET LEVELS OF AGE VARIABLE (PRETENDING IT IS EXACT)
            levelsof age, local(ages)
            
            // LOOP OVER AGES TO MARK THE REPLACEMENT YEAR
            foreach a of local ages {
                by id (year), sort: replace repl = 1 if F`a'.age== `a'
            }
            replace repl = 0 if missing(repl)
            
            // IF YOU WANT YOU CAN NOW ELIMINATE THE EMPTY RECORDS THAT WERE
            // CREATED UP TOP: BUT IT IS POSSIBLE THEY WILL ACTUALLY HAVE INSTANCES
            // OF repl == 1.
            drop if _fillin // optional
            Hope this helps.

            Comment


            • #7
              On further consideration of my last post, the -by id (year), sort:- inside the foreach loop is not necessary. -replace repl = 1 if F`a'.age == `a'- without any by prefix will do exactly the same, and more efficiently I imagine.

              Comment


              • #8
                Dear Clyde,

                This is such a clever use of the forward lag operator! Thanks so much. This was very, very helpful.

                I agree that it's going to be complicated. I already have thought of an exhaustive list of the completely determined cases (cases in which I know the exact year from combinations of survey responses in different years), which can probably be coded in a similar fashion to the above. In the not-completely-determined cases, it gets a little more tricky.

                Anyways, this is exactly what I needed to get started here. Thank you!

                Alecia

                Comment


                • #9
                  You're welcome, and thanks for the kind words.

                  It dawns on me that you may face yet another problem, that my code glosses over. Depending on how your data were collected, it is possible that there will be inconsistencies. If somebody says the thing is 5 years old in 2005 but says it was 8 years old in 2006 you have a problem. The code I wrote will mark repl = 1 in both 2000 and 1998. I'm not sure how you will resolve such inconsistencies (if they actually arise).

                  Good luck!

                  Comment


                  • #10
                    Hi Clyde,
                    Yes, think I will be able to go through and flag these using the forward operator as well, and then decide what to do about these.
                    Thanks!
                    Alecia

                    Comment

                    Working...
                    X