Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating unique entries on two variables basis

    Hello,

    My dataset looks something like below: (Data is confidential so I'm not using dataex command)

    Code:
    clear
    input double(id off_date offense)
    1 20129   41
    1 17183   22
    1 17183   12
    2 21642    7
    2 17275   12
    2 21642   41
    3 16490   41
    4 16502   12
    5 17279   22
    5 18385 37.8
    end
    The idea is that each individual (id) has an offense date (off_date) when they were arrested. Each arrest could have multiple offenses associated with it. This results in multiple observations (rows) of a single person getting arrested once, since the data is at offense level instead of offense date. I would like to create a variable that basically counts each arrest instead of each offense. I tried to do the bysort egen option, but it leads to numbering the offenses per offense date. What I would like is the offense dates per individual. I did try to look up responses on this but I'm afraid I'm looking up the wrong keywords or search text as I am unable to find a solution so far. Also, I can not use the keep if _n==1 option since I still need the whole dataset as is.

    Thank you,
    Tessie

  • #2
    Your question is not entirely clear. I'm going to interpret what you want as a variable that counts up distinct arrests within each id. That is, it starts over from 1 for each id. If that's not what you intended, please post back and I'll modify the code.
    Code:
    by id (off_date offense), sort: gen arrest_num = sum(off_date != off_date[_n-1])

    Comment


    • #3
      Hello Clyde,

      That was exactly what I needed. My apologies for not being clear with my question. Thank you so much for your help. This works great!

      Best,
      Tessie

      Comment


      • #4
        Hello Clyde,

        This is a slightly different question but similar enough to the first one, hence posting it here. My apologies if that is not allowed. Same dataset as before with the id, offense date and offenses list. I am now trying to calculate the number of offenses (or rather, I am trying to number the offenses per arrest) for each arrest of an individual. I am using the code provided below. Just wanted to confirm with you if it makes sense to do it this way.

        Code:
        egen id_date = group(id off_date)
        by id_date, sort: gen offense_num = _n
        Thanks,
        Tessie

        Comment


        • #5
          It mostly makes sense. The only problem with it is that id_date does not uniquely determine the sort order of the data: within the observations of a given id_date there are multiple possible orderings of the offenses themselves, and, consequently, the code generates a random and irreproducible result for offense_num. If it doesn't matter to you which offense gets numbered first, which next, and so on, and if you will never do anything with the offense_num variable that would rely on a particular ordering, then this doesn't matter and what you have is OK. But, as a general rule, it is better to specify a deterministic order. One way to do that would be to order the offenses in the numerical order of the coding of offense itself. Another way would be to preserve the original order in the data set. So
          Code:
          by id_date (offense), sort: gen offense_num = _n // ORDRED BY CODING OF OFFENSE VARIABLE
          // OR
          sort id_date, stable
          by id_date: gen offense_num = _n // PRESERVES ORIGINAL SORT ORDER WITHIN ARRESTS

          Comment


          • #6
            I see what you mean. I was mostly using it to get to the next step which was calculate the total number of offenses per arrest for which I've used the below code afterwards but I will revise my code to include the offense variable in the by , sort: option just to be systematic about it. Thank you, Clyde! I appreciate your help.
            Code:
            by id_date (offense), sort: egen offense_num = max(offense_count)
            Last edited by Tessie Krishna; 12 Jul 2023, 18:02. Reason: Accidental posting

            Comment


            • #7
              I see what you mean. I was mostly using it to get to the next step which was calculate the total number of offenses per arrest for which I've used the below code afterwards but I will revise my code to include the offense variable in the by , sort: option just to be systematic about it. Thank you, Clyde! I appreciate your help.
              Code:
              by id_date (offense), sort: egen offense_num = max(offense_count)

              Comment

              Working...
              X