Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep the duplicate observation based on the highest value of a variable

    Dear Statalist,

    I recently combined multiple files (filename_v1, filename_v2, etc.) into a single dataset and now I have duplicate observations (the datestamp variable should be enough but I would like to deal with duplicates from both datestamp and resp_email to be sure). I want to keep the duplicate observation from the file with the highest number at the end, since this may vary.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str19 datestamp str44 resp_email str7 filename
    "2020-06-18 09:53:22" "email" "FILE_v2"
    "2020-06-18 09:53:22" "email" "FILE_v3"
    end
    Here I would just like to keep the second observation, but if the files were "FILE_v4" and "FILE_v3" I would like to have only the FILE_v4 because it is indeed the one with the highest number.
    Last edited by Hugo Denis; 02 Feb 2023, 02:47.

  • #2
    Originally posted by Hugo Denis View Post
    Dear Statalist,

    I recently combined multiple files (filename_v1, filename_v2, etc.) into a single dataset and now I have duplicate observations (the datestamp variable should be enough but I would like to deal with duplicates from both datestamp and resp_email to be sure). I want to keep the duplicate observation from the file with the highest number at the end, since this may vary.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str19 datestamp str44 resp_email str7 filename
    "2020-06-18 09:53:22" "email" "FILE_v2"
    "2020-06-18 09:53:22" "email" "FILE_v3"
    end
    Here I would just like to keep the second observation, but if the files were "FILE_v4" and "FILE_v3" I would like to have only the FILE_v4 because it is indeed the one with the highest number.

    The code below only works with pulling out the maximum file number, not max of date stamps. This also assumes that data are likely ordered sequentially by respondents, not depicted in your sample.
    Code:
    // This portion may not apply to your data
    *--------------------------------------------------------
    gen long seq = _n in 1/2    // generate sequential per respondent for just these two observation
    gen patid      = 1 in 1/2        // generate id per respondent for just these observation
    order patid seq                // order both
    
    
    // Drop the bys if data are not structured by respondent 
    *--------------------------------------------------------
    gen filenum = real(ustrregexra(filename, "\D*","")) // delete all non-numbers
    bys patid: egen byte keep = max(filenum)            // tag the highest to keep
    drop if filenum < keep                                // drop observations not matching highest
    drop filenum keep patid seq                            // drop all newly generated variables.
    
    list

    Comment


    • #3
      I don't follow #2 as an answer to #1. My guess is that you want

      Code:
      bysort datestamp resp_email (filename) : keep if _n == _N
      which selects the last in each group, including the special case of the last of one.

      Comment


      • #4
        Hi there,
        I have 4 monthly datasets of firms over the period 2015-2019. Each dataset has about 15,000 obs and 14 variables. Here is an example of the original dataset, which contains firm id, sector code, year and month only.

        Code:
        * Example generated by -dataex-. For more info, type help dataex clear input byte(firm_id1 sector) int year byte month 1 1 2010 1 1 1 2010 2 1 1 2010 3 1 1 2011 1 1 1 2011 2 1 1 2011 3 2 1 2010 1 2 1 2010 2 2 1 2010 3 2 1 2011 1 2 1 2011 2 2 1 2011 3 3 1 2010 1 3 1 2010 2 3 1 2010 3 end
        I want to:
        1. expand this dataset
        2. Create two variables treat1 and treat2 such that treat 1 = 1 if a firm took a monthly action and is observed on that month. 0 otherwise
        treat 2 = 1 if a firm has consecutively taken action at least two times within a year>

        Here is what I am looking for

        Code:
        * Example generated by -dataex-. For more info, type help dataex clear input byte(firm_id1 sector) int year byte(month treat treat1 treat2) 1 1 2010 1 1 1 1 1 1 2010 2 1 1 1 1 1 2010 3 1 1 1 1 1 2011 1 1 1 1 1 1 2011 2 1 1 1 1 1 2011 3 1 1 1 2 1 2010 1 1 1 0 2 1 2010 2 1 1 0 2 1 2010 3 . 0 0 2 1 2011 1 1 1 0 2 1 2011 2 1 1 0 2 1 2011 3 . 0 0 3 1 2010 1 1 1 0 3 1 2010 2 . 0 0 3 1 2010 3 . 0 0 end
        3. Next, I would like to merge these four files and create a variable time that goes from january2015 to december2019 such that
        march2018 =0
        feb2018 = -1
        jan2018 = -2 ... until jan2015
        apr2018=1
        may2018 = 2 ... until dec 2019


        Can anyone knows how to handle these tasks?

        Comment


        • #5
          guyle Smith Please start a new thread with this question and include CODE delimiters.

          Comment


          • #6
            Thanks Nick. This is silly, but could you please tell me where to find the button for a new thread? I can't find it.

            Comment


            • #7
              Originally posted by guyle Smith View Post
              Thanks Nick. This is silly, but could you please tell me where to find the button for a new thread? I can't find it.
              At the top of the threads here is a button on which it is written "+ New Topic". You click on it to start a new thread.

              Comment

              Working...
              X