Keep the duplicate observation based on the highest value of a variable

Hugo Denis

Join Date: Apr 2022

Posts: 27
#1

Keep the duplicate observation based on the highest value of a variable

02 Feb 2023, 02:44

Dear Statalist,

I recently combined multiple files (filename_v1, filename_v2, etc.) into a single dataset and now I have duplicate observations (the datestamp variable should be enough but I would like to deal with duplicates from both datestamp and resp_email to be sure). I want to keep the duplicate observation from the file with the highest number at the end, since this may vary.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str19 datestamp str44 resp_email str7 filename "2020-06-18 09:53:22" "email" "FILE_v2" "2020-06-18 09:53:22" "email" "FILE_v3" end

Here I would just like to keep the second observation, but if the files were "FILE_v4" and "FILE_v3" I would like to have only the FILE_v4 because it is indeed the one with the highest number.

Last edited by Hugo Denis; 02 Feb 2023, 02:47.
Tags: None

Girish Venkataraman

Join Date: Dec 2021
Posts: 293

02 Feb 2023, 05:10

Originally posted by Hugo Denis View Post

Dear Statalist,

I recently combined multiple files (filename_v1, filename_v2, etc.) into a single dataset and now I have duplicate observations (the datestamp variable should be enough but I would like to deal with duplicates from both datestamp and resp_email to be sure). I want to keep the duplicate observation from the file with the highest number at the end, since this may vary.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str19 datestamp str44 resp_email str7 filename
"2020-06-18 09:53:22" "email" "FILE_v2"
"2020-06-18 09:53:22" "email" "FILE_v3"
end

Here I would just like to keep the second observation, but if the files were "FILE_v4" and "FILE_v3" I would like to have only the FILE_v4 because it is indeed the one with the highest number.

The code below only works with pulling out the maximum file number, not max of date stamps. This also assumes that data are likely ordered sequentially by respondents, not depicted in your sample.

Code:

// This portion may not apply to your data
*--------------------------------------------------------
gen long seq = _n in 1/2    // generate sequential per respondent for just these two observation
gen patid      = 1 in 1/2        // generate id per respondent for just these observation
order patid seq                // order both


// Drop the bys if data are not structured by respondent 
*--------------------------------------------------------
gen filenum = real(ustrregexra(filename, "\D*","")) // delete all non-numbers
bys patid: egen byte keep = max(filenum)            // tag the highest to keep
drop if filenum < keep                                // drop observations not matching highest
drop filenum keep patid seq                            // drop all newly generated variables.

list

Comment

Nick Cox

Join Date: Mar 2014

Posts: 36058
#3

02 Feb 2023, 05:52

I don't follow #2 as an answer to #1. My guess is that you want

Code:

bysort datestamp resp_email (filename) : keep if _n == _N

which selects the last in each group, including the special case of the last of one.
Comment
guyle Smith

Join Date: Feb 2023

Posts: 14
#4

02 Feb 2023, 05:58

Hi there,
I have 4 monthly datasets of firms over the period 2015-2019. Each dataset has about 15,000 obs and 14 variables. Here is an example of the original dataset, which contains firm id, sector code, year and month only.

Code:
* Example generated by -dataex-. For more info, type help dataex clear input byte(firm_id1 sector) int year byte month 1 1 2010 1 1 1 2010 2 1 1 2010 3 1 1 2011 1 1 1 2011 2 1 1 2011 3 2 1 2010 1 2 1 2010 2 2 1 2010 3 2 1 2011 1 2 1 2011 2 2 1 2011 3 3 1 2010 1 3 1 2010 2 3 1 2010 3 end
I want to:
1. expand this dataset
2. Create two variables treat1 and treat2 such that treat 1 = 1 if a firm took a monthly action and is observed on that month. 0 otherwise
treat 2 = 1 if a firm has consecutively taken action at least two times within a year>

Here is what I am looking for

Code:
* Example generated by -dataex-. For more info, type help dataex clear input byte(firm_id1 sector) int year byte(month treat treat1 treat2) 1 1 2010 1 1 1 1 1 1 2010 2 1 1 1 1 1 2010 3 1 1 1 1 1 2011 1 1 1 1 1 1 2011 2 1 1 1 1 1 2011 3 1 1 1 2 1 2010 1 1 1 0 2 1 2010 2 1 1 0 2 1 2010 3 . 0 0 2 1 2011 1 1 1 0 2 1 2011 2 1 1 0 2 1 2011 3 . 0 0 3 1 2010 1 1 1 0 3 1 2010 2 . 0 0 3 1 2010 3 . 0 0 end
3. Next, I would like to merge these four files and create a variable time that goes from january2015 to december2019 such that
march2018 =0
feb2018 = -1
jan2018 = -2 ... until jan2015
apr2018=1
may2018 = 2 ... until dec 2019

Can anyone knows how to handle these tasks?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#5

02 Feb 2023, 06:00

guyle Smith Please start a new thread with this question and include CODE delimiters.
Comment
guyle Smith

Join Date: Feb 2023

Posts: 14
#6

02 Feb 2023, 07:05

Thanks Nick. This is silly, but could you please tell me where to find the button for a new thread? I can't find it.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#7

02 Feb 2023, 09:08

Originally posted by guyle Smith View Post

Thanks Nick. This is silly, but could you please tell me where to find the button for a new thread? I can't find it.

At the top of the threads here is a button on which it is written "+ New Topic". You click on it to start a new thread.
Comment

Announcement