Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to tag new observations in a dataset

    Hi All,

    I have dataset that resembles the following:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(id year y x)
    1 1990 123  32
    2 1990 321  23
    3 1990   3  23
    4 1990 213  23
    1 1991 213 123
    2 1991   3 123
    3 1991 123 213
    4 1991 123 123
    5 1991  23  23
    end

    In the above dataset, I have information on y and x, by individual (identified by id) and year.. There is a big jump in my data of new individuals in 1991. In the above, as can be seen, observation with id=5 is new in 1991 ( non existent in 1990). My results are quite sensitive to the inclusion of these new individuals. Suppose that the year 1990 is the "base" year, or the year before which individuals were more or less a constant set. Is there any way such that relative to 1990, I can tag new entrats? So for instance, I would like a dummy variable to be created, where individual 5 in 1991 would get a value of 1 (new entrants get a value of 1) whereas existing ones relative to the previous year get a value of 0?


    Best Wishes,
    CS

  • #2
    I'm not sure if this is the quickest or best way to do this, but this is how I'd do it:

    Code:
    bysort id: gen temp =_n
    gen present_before_1991 = 0
    replace present_before_1991 = 1 if year = 1991 & temp > 1
    bysort id: egen max_value = max(present_before_1991)
    replace present_before_1991 = 1 if max_value == 1
    drop temp max_value
    Let me know if that works for you

    EDIT: I just noticed you asked for the exact opposite way of tagging the dummy variable. I've given those entries that were there before 1990 a '1', while you asked to have them tagged as '0'. Nonetheless, you can reverse that if you want of course.
    Last edited by Jesse Tielens; 01 Aug 2018, 07:06.

    Comment


    • #3
      Hi Jesse!

      Many thanks - this works.

      Comment


      • #4
        Some more technique:

        Code:
        bysort id (year) : gen is_new = year[1] >  1990
        bysort id (year) : gen entered_1991 = year[1] ==  1991

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Some more technique:

          Code:
          bysort id (year) : gen is_new = year[1] > 1990
          bysort id (year) : gen entered_1991 = year[1] == 1991
          That's definitely a shorter and more elegant solution to Chimnay's problem.

          If you dont mind me asking, I've noticed in several of your comments that you use this syntax:
          Code:
          bysort id (year): .....
          With 'year' between parentheses. How is that command different from:
          Code:
          bysort id year: ...
          The manual seems to list your code as the correct one, but the output is identical?

          Comment


          • #6
            Thans Nick Cox !!

            Comment


            • #7
              There is a world of difference there. With

              Code:
              bysort id (year)
              the distinct groups are defined by id alone: within those groups observations are sorted by year. In very many panel problems, that is the kind of thing you often want.

              With

              Code:
              bysort id year
              the distinct groups are defined by id and year jointly. For many panel datasets with at most one observation for each identifier and time, that could define for each group at most one observation. It wouldn't bite unless you thought it specified a calculation comparing observations in each panel.

              See e.g. https://www.stata-journal.com/sjpdf....iclenum=pr0004 for a tutorial on by:.

              Comment


              • #8
                That's definitely an important distinction, could prove very useful if I've ever got a panel with multiple observations per year. Thanks!

                Comment


                • #9
                  I tried another route as well:

                  Code:
                  gen present1990=0
                  by ifscode, sort: replace present1990=1 if !missing(y) & year==1990
                  gen present1991=0
                  by ifscode, sort: replace present1991=1 if !missing(y) & year==1991
                  by ifscode (year), sort: gen new=present1991[_n]-present1990[_n-1]
                  This works as well, but is definitely not as succinct as Nick's. Thanks Jesse Tielens as well!

                  Comment

                  Working...
                  X