Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding the first occurrence of a specific value or greater

    Dear Stata users,
    I would need help with finding the first time a specific value or greater in a varlist appears.
    I have panel data looking soemthing like this:

    ID NR VAR
    1 1 15
    1 2 3
    1 3 22
    1 4 17
    1 5 25

    I would like to mark the observation where var is 20 or greater for the first time, by each id. In the case above it's the third observtion. It might also be that for some IDs the value 20 or greater doesn't exist. As I understand the egen command ifirst is used to find a specific value only, and cannot be combined with ( >= ).

    So I tried the command:
    egen ig20= ifirst(var), v(>=20) after by(ID)
    "option value () invalid" is the error syntax, which of course has to do with the >=. Executing the code without >= works perfectly fine.

    Would you know how I could proceed instead?

    Best and thanks in advance,
    Ida

  • #2
    This is an FAQ. Recall that Statalist FAQ Advice advises you to look at the Stata FAQs before posting. Note that in Stata

    Code:
    search first occurrence
    finds the FAQ in question, so you had exactly the right keywords:

    FAQ . . . . . . . . . . . . . . . First and last occurrences in panel data
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
    3/07 How can I identify first and last occurrences
    systematically in panel data?
    http://www.stata.com/support/faqs/data/firstoccur.html

    Any way, it's just one line:

    Code:
    bysort id (NR) : gen firstge20 = sum(VAR >= 20 & VAR < .) == 1
    Deconstruction:

    Code:
    VAR >= 20 & VAR < .
    returns 1 when true and 0 when false. Its running or cumulative sum tallies such occurrences. When the cumulative sum is 1, that marks the first occurrence. This is all done panelwise, by virtue of the bysort. Note the need to be cautious if VAR could be missing, as missing VAR counts as greater than 20 too.

    The FAQ gives another solution, and there are yet others, but this one is worth understanding.
    Last edited by Nick Cox; 01 Jul 2015, 01:57.

    Comment


    • #3
      Great! Thanks for advice both on "search" as well as code. Very helpful!

      Comment


      • #4
        Its been 7 years since this post but for anyone looking at this later, the approach suggested above is pretty good but erroneous for the poster's purpose (I had a similar issue to the poster). The above approach yields the following:
        I am trying to find the first occurrence of heart attack (variable amics) for every patient (NRD_VisitLink or patno).

        HTML Code:
        sort NRD_VisitLink NRD_DaysToEvent
        
        egen patno = group(NRD_VisitLink)
        label variable patno "Unique Patient Number"
        
        by NRD_VisitLink: generate admno = _n
        label variable admno "Patient's Admission Number"
        
        //by above approach
        by NRD_VisitLink: gen firstamics = sum(amics==1)==1
        label variable admno "First AMICS admission"
        yields the following:
        the problem here is that it can label more than one observation as first occurrence in certain cases.
        Click image for larger version

Name:	Screenshot 2022-09-15 at 11.05.34 PM.jpg
Views:	1
Size:	177.8 KB
ID:	1682166



        Now I only wanted to drop the observations for a patient that occurred BEFORE the first amics admission. So I solved this by generating a new variable that labels such "pre-admits" that are to be dropped.
        HTML Code:
        by NRD_VisitLink: gen preadmit = sum(amics==1)==0
        which yields:
        Click image for larger version

Name:	Screenshot 2022-09-15 at 11.18.30 PM.jpg
Views:	1
Size:	210.6 KB
ID:	1682167

        which is an appropriate solution of my problem. I can now drop if preadmit=1.

        However, I still did not figure out how to specifically answer the posters question. Maybe an argument using a combination of (cumulative sum of amics) and (observation number) can yield a solution. Hopefully above can give some leads to future readers.
        Last edited by Mohak Gupta; 15 Sep 2022, 21:42.

        Comment


        • #5
          Mohak Gupta you're right that the code in #2 yields multiple observations with the value 1, and so does not uniquely identify the first observation. However, the solution is mentioned in the link provided in #2.

          For the original poster, the correct code would be:
          Code:
          bysort ID (NR) : gen firstge20 = sum(VAR >= 20 & VAR < .) == 1 & sum(VAR[_n-1] >= 20 & VAR[_n-1] < .) == 0
          For you similarly, the one-line code would be:
          Code:
          bysort NRD_VisitLink (NRD_DaysToEvent): gen firstamics = sum(amics==1)==1 & sum(amics[_n-1] == 1) == 0
          which avoids the need the create the preadmit variable.
          Last edited by Hemanshu Kumar; 16 Sep 2022, 01:31.

          Comment


          • #6
            I guess I will take the comment "pretty good" as a compliment on my code, but "erroneous" is a strong word.. It's too late to edit a 2015 post but Mohak Gupta is right in #3 so that

            When the cumulative sum is first 1, that marks the first occurrence.
            would have been better wording. Naturally I agree also with Hemanshu Kumar in #5 that the larger point of #2 was "This is an FAQ" (so please read it!).

            Comment

            Working...
            X