Defining duplicates through extracting day from date-time variable

Maxence Morlet

Join Date: Mar 2021
Posts: 653

Defining duplicates through extracting day from date-time variable

19 Oct 2021, 02:47

Hi all,

I have scoured statalist for some time but have not found the answer to this question.

First of all, here is my data:

Code:

input long id double tstamp float semid_min str11 exam
197143 1908625531000 1 "F_SF_ZK1"  
197143 1908622351000 1 "F_SF_ZK1"  
197143 1908625712000 1 "F_SF_ZK2"  
197143 1908622431000 1 "F_SF_ZK2"  
197143 1939971189000 2 "F_SF_HK1"  
197143 1940086597000 2 "F_SF_HK1"  
197143 1939713046000 2 "F_SF_PP"  
197143 1940086743000 2 "F_SF_PP"  
314527 1940087881000 1 "F_SF_HK1"  
314527 1940087931000 1 "F_SF_HK1"  
314527 1940088143000 1 "F_SF_PP"  
314527 1940087991000 1 "F_SF_PP"  
314527 1938762660000 1 "F_SF_ZK1"  
361889 1877076918000 1 "F_SF_ZK1"  
361889 1877077864000 1 "F_SF_ZK1"  
361889 1892276506000 2 "F_SF_HK1"  
361889 1892131466000 2 "F_SF_HK1"  
361889 1892276389000 2 "F_SF_HK1"  
361889 1892130877000 2 "F_SF_HK1"  
361889 1894871347000 2 "F_SF_PP"  
361889 1894871401000 2 "F_SF_PP"  
361889 1894871610000 2 "F_SF_PP"  
361889 1894871030000 2 "F_SF_PP"  
361889 1908532174000 3 "F_SF_AP"  
361889 1908532234000 3 "F_SF_AP"

The tstamp variable appears like this (e.g.):

24jun2020 12:46:30

I want to reshape observations from long to wide (I know this is quite a rare practice but I need to in this case unfortunately). This is my reshape code:

Code:

reshape wide grade extry sem tstamp, ///
                     i(id semid_min) j(exam) string

As is visible however, there are a plethora of duplicates in the exam variable that are not constant within id and semid_min that are preventing the reshape.

Two exams that have been passed on the same day by the same person (as indicated by variables tstamp and id respectively) are genuine duplicates and must be dropped.

Therefore, I would like to create a variable identifying duplicates as defined by id and tstamp, and then drop these duplicate observations. Any ideas?

Last edited by Maxence Morlet; 19 Oct 2021, 03:17.

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

19 Oct 2021, 07:31

Perhaps this will start you in a useful direction.

Code:

. format %tc tstamp 

. generate day = dofc(tstamp)

. format day %td

. sort id day exam tstamp

. by id day exam, sort: generate count = _N

. list if count>1, sepby(id day exam)

     +-----------------------------------------------------------------------+
     |     id               tstamp   semid_~n       exam         day   count |
     |-----------------------------------------------------------------------|
  1. | 197143   24jun2020 12:52:31          1   F_SF_ZK1   24jun2020       2 |
  2. | 197143   24jun2020 13:45:31          1   F_SF_ZK1   24jun2020       2 |
     |-----------------------------------------------------------------------|
  3. | 197143   24jun2020 12:53:51          1   F_SF_ZK2   24jun2020       2 |
  4. | 197143   24jun2020 13:48:32          1   F_SF_ZK2   24jun2020       2 |
     |-----------------------------------------------------------------------|
 10. | 314527   23jun2021 17:18:01          1   F_SF_HK1   23jun2021       2 |
 11. | 314527   23jun2021 17:18:51          1   F_SF_HK1   23jun2021       2 |
     |-----------------------------------------------------------------------|
 12. | 314527   23jun2021 17:19:51          1    F_SF_PP   23jun2021       2 |
 13. | 314527   23jun2021 17:22:23          1    F_SF_PP   23jun2021       2 |
     |-----------------------------------------------------------------------|
 14. | 361889   25jun2019 10:15:18          1   F_SF_ZK1   25jun2019       2 |
 15. | 361889   25jun2019 10:31:04          1   F_SF_ZK1   25jun2019       2 |
     |-----------------------------------------------------------------------|
 16. | 361889   16dec2019 15:54:37          2   F_SF_HK1   16dec2019       2 |
 17. | 361889   16dec2019 16:04:26          2   F_SF_HK1   16dec2019       2 |
     |-----------------------------------------------------------------------|
 18. | 361889   18dec2019 08:19:49          2   F_SF_HK1   18dec2019       2 |
 19. | 361889   18dec2019 08:21:46          2   F_SF_HK1   18dec2019       2 |
     |-----------------------------------------------------------------------|
 20. | 361889   17jan2020 09:03:50          2    F_SF_PP   17jan2020       4 |
 21. | 361889   17jan2020 09:09:07          2    F_SF_PP   17jan2020       4 |
 22. | 361889   17jan2020 09:10:01          2    F_SF_PP   17jan2020       4 |
 23. | 361889   17jan2020 09:13:30          2    F_SF_PP   17jan2020       4 |
     |-----------------------------------------------------------------------|
 24. | 361889   23jun2020 11:49:34          3    F_SF_AP   23jun2020       2 |
 25. | 361889   23jun2020 11:50:34          3    F_SF_AP   23jun2020       2 |
     +-----------------------------------------------------------------------+

. duplicates report id day exam

Duplicates in terms of id day exam

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |            5             0
        2 |           16             8
        4 |            4             3
--------------------------------------

. duplicates drop id day exam, force

Duplicates in terms of id day exam

(11 observations deleted)

.

Comment

Maxence Morlet

Join Date: Mar 2021

Posts: 653
#3

19 Oct 2021, 08:50

Worked a wonder and I learned something, thanks William!
Comment

Announcement

Defining duplicates through extracting day from date-time variable

Comment

Comment