looking at past activity by group

Ylenia Curci

Join Date: Sep 2017
Posts: 72

looking at past activity by group

20 Oct 2021, 10:45

Hi there,
for each observation for which the variable selected==1,I want to create a variable that is =1 if, by firmid, there is at least another observation for which the variables typetextdef==1 & nao_t==0 that has an older date and that date is before 31 December 2010, =2 if by firm id, there is at least another observation for which the variables typetextdef==1 & nao_t==0 that has a older date and that date is after 1st of Jan 2011, and =3 if by firm id, there are at least two observations for which the variables typetextdef==1 & nao_t==0 that have older dates, and these dates are one before 31 December 2010 and the other is after 1st of Jan 2011.
So with the example below, for firmid "00708002100012" the variable should be 2, for firmid "01545059600026" it should be 2, for firmid "06150024500016" it should be 1, for firmid "06220100900222" it should be 3 and so on

I hope it is clear.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float selected str14 firmid str8 date float(typetextdef nao_t)
1 "00705004000018" "20151023" 1 0
1 "00708002100012" "20141224" 1 0
0 "00708002100012" "20121005" 1 0
0 "00708002100012" "20111202" 1 0
0 "00708002100012" "20110705" 4 1
0 "00708002100012" "20090706" 4 1
1 "01545059600026" "20160307" 1 1
0 "01545059600026" "20130926" 1 0
1 "01585179300131" "20160407" 1 1
0 "01585179300131" "20140115" 1 1
0 "01585179300131" "20120628" 1 0
1 "01625002900739" "20120627" 1 0
0 "01625002900739" "20090317" 1 0
0 "01625002900739" "20070514" 1 1
1 "01698006200018" "20111201" 1 1
0 "01698006200018" "20100203" 1 1
1 "03652015300028" "20160921" 6 0
0 "03652015300028" "20150703" 6 0
1 "04705017400029" "20160802" 1 0
1 "05480095800248" "20160520" 6 0
0 "05480095800248" "20151023" 6 0
0 "05480095800248" "20140213" 1 1
1 "05480095800503" "20160520" 6 0
0 "05480095800503" "20151023" 6 0
0 "05480095800503" "20140213" 1 1
1 "05480654200012" "20160421" 1 0
0 "05480654200012" "20160421" 1 0
0 "05480654200012" "20150203" 1 1
0 "05480654200012" "20130403" 4 1
0 "05480654200012" "20121019" 7 0
0 "05480654200012" "20120320" 1 0
1 "05480654203602" "20160421" 1 0
0 "05480654203602" "20160421" 1 0
0 "05480654203602" "20150203" 1 1
0 "05480654203602" "20130403" 4 1
0 "05480654203602" "20121019" 7 0
0 "05480654203602" "20120320" 1 0
1 "05550190200039" "20160325" 1 1
0 "05550190200039" "20150402" 1 1
0 "05550190200039" "20140409" 1 1
0 "05550190200039" "20110331" 1 1
0 "05550190200039" "20090416" 4 1
1 "05580925500015" "20160627" 6 0
0 "05580925500015" "20150401" 6 0
1 "05650171100115" "20131219" 1 0
0 "05650171100115" "20120719" 1 0
0 "05650171100115" "20081218" 1 1
1 "05680065901435" "20161208" 1 0
0 "05680065901435" "20121205" 1 0
0 "05680065901435" "20090115" 1 1
0 "05680065901435" "20070130" 1 1
1 "05720137800023" "20150527" 1 1
0 "05720137800023" "20141010" 6 0
0 "05720137800023" "20120612" 1 1
0 "05720137800023" "20110407" 1 1
1 "05750274200661" "20111223" 1 1
0 "05750274200661" "20111223" 1 0
0 "05750274200661" "20110523" 1 1
1 "05780615000017" "20140127" 1 0
1 "05820018900015" "20131209" 1 0
0 "05820018900015" "20110330" 1 1
0 "05820018900015" "20090708" 1 1
1 "05980406200053" "20150925" 1 0
0 "05980406200053" "20120713" 1 0
1 "06020012800019" "20111219" 1 0
0 "06020012800019" "20100218" 1 1
1 "06080139600023" "20121127" 1 0
1 "06150024500016" "20161222" 1 0
0 "06150024500016" "20161222" 1 0
0 "06150024500016" "20160513" 1 1
0 "06150024500016" "20140228" 1 1
0 "06150024500016" "20130306" 1 1
0 "06150024500016" "20120330" 1 1
0 "06150024500016" "20110414" 1 1
0 "06150024500016" "20070713" 1 0
0 "06220100900222" "20161214" 7 0
1 "06220100900222" "20151216" 1 1
0 "06220100900222" "20141219" 1 0
0 "06220100900222" "20140117" 1 1
0 "06220100900222" "20140117" 1 1
0 "06220100900222" "20140117" 1 1
0 "06220100900222" "20131219" 1 1
0 "06220100900222" "20130118" 1 1
0 "06220100900222" "20130118" 1 1
0 "06220100900222" "20130118" 1 1
0 "06220100900222" "20130118" 1 1
0 "06220100900222" "20130118" 1 1
0 "06220100900222" "20120111" 1 1
0 "06220100900222" "20120111" 1 1
0 "06220100900222" "20120111" 1 1
0 "06220100900222" "20120111" 1 1
0 "06220100900222" "20111215" 1 0
0 "06220100900222" "20100113" 1 0
end

Thank you,

Ylenia

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

20 Oct 2021, 13:50

Your description of the conditions you want to apply is not consistent with the output you say you want. Since any given firmid may have both observations with selected = 1 and selcted = 0, and because only those with selected = 1 are to be given a value of 1, 2, or 3 for this new variable, it is wrong and confusing to speak of results for a firmid--there will be different results for different observations within firmid. And even if a firmid had only selected = 1 observations, it is possible that some of those will meet different conditions from others The most sense I can make out of this is the code below, which I think implements the conditions you describe. You also don't say what you want the new variable to be if there are no other observations meeting the stipulated conditions or if selected = 0. The code below sets the value of the new variable to 0 in these situations, but you can easily modify that.

Code:

// CREATE A REAL STATA DATE VARIABLE gen _date = daily(date, "YMD") assert missing(_date) == missing(date) format _date %td drop date rename _date date sort firmid date gen long obs_no = _n tempfile candidates preserve keep if typetextdef == 1 & nao_t == 0 drop typetextdef nao_t selected list if firmid == "00708002100012" save `candidates' restore rangejoin date . date using `candidates', by(firmid) by obs_no, sort: egen early = max(date_U < td(31dec2010)) by obs_no, sort: egen late = max(date_U > td(1jan2010) & !missing(date_U)) gen byte wanted = 0 replace wanted = 1 if early & !late & selected replace wanted = 2 if late & !early & selected replace wanted = 3 if late & early & selected by obs_no, sort: keep if _n == 1 drop *_U

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Added: I also note that your example data has many observations that are entirely duplicates. Perhaps there are other variables in your full data set that distinguish them. Or perhaps there is good reason that certain observations appear more than once. But usually there is not, and it usually means that something has gone wrong in the data management creating that data set. While it is easy enough to eliminate pure duplicate observations with -duplicates drop-, unless you have an explanation for why they are, and should be, there, you should step back and review the creation of this data set. This is often a warning that other, more serious, errors have crept in and need to be fixed before analysis proceeds.

Last edited by Clyde Schechter; 20 Oct 2021, 14:01.
Comment
Ylenia Curci

Join Date: Sep 2017

Posts: 72
#3

21 Oct 2021, 01:11

Thank you Clyde for your answer. Yes, my message was not clear enough, I doubted it.
The dataset is a collection of documents signed by firms, so there is an unique document identifier, which explains why you see duplicates that are actually not duplicates at all. A firm can sign many different documents, even in the same day. The final dataset I will working on will be composed only by the selected documents, so I do not really care what happens to the variable wanted if selected==0, as I will drop those observations.
So what I need is to understand if the selected document followed other documents signed by the same firm and, if any exists, if they were signed before or after (or both) 2010.
Hence, I would like the variable wanted to be . if there are no previous documents, 1 if there is one or more documents all signed before 2010, 2 if there is one or more documents all signed after 2010, 3 if there are at least two documents, one signed before and the other after 2010.
I hope is clear now.

Last edited by Ylenia Curci; 21 Oct 2021, 01:20.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

21 Oct 2021, 09:43

The final dataset I will working on will be composed only by the selected documents, so I do not really care what happens to the variable wanted if selected==0, as I will drop those observations.

This is actually a very major departure from the original description, and will lead to different results. The code in #2 only calculates wanted = 1, 2, or 3 for documents with selected = 1. But in deciding whether a previous document exists satisfying the criteria, documents with selected = 0 were included. So be clear on your intent: if you want the search for previous documents to include those that are not selected = 1, then you can't just work with the set of selected documents. It may well be that the only earlier documents meeting the specified criteria have selected = 0.

Anyway, assuming that you really do want to disregard the selected == 0 documents altogether--not allow them to serve as earlier documents that might meet criteria, the code becomes simpler:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long document_id str14 firmid str8 date float(typetextdef nao_t)
 1 "00705004000018" "20151023" 1 0
 2 "00708002100012" "20141224" 1 0
 3 "01545059600026" "20160307" 1 1
 4 "01585179300131" "20160407" 1 1
 5 "01625002900739" "20120627" 1 0
 6 "01698006200018" "20111201" 1 1
 7 "03652015300028" "20160921" 6 0
 8 "04705017400029" "20160802" 1 0
 9 "05480095800248" "20160520" 6 0
10 "05480095800503" "20160520" 6 0
11 "05480654200012" "20160421" 1 0
12 "05480654203602" "20160421" 1 0
13 "05550190200039" "20160325" 1 1
14 "05580925500015" "20160627" 6 0
15 "05650171100115" "20131219" 1 0
16 "05680065901435" "20161208" 1 0
17 "05720137800023" "20150527" 1 1
18 "05750274200661" "20111223" 1 1
19 "05780615000017" "20140127" 1 0
20 "05820018900015" "20131209" 1 0
21 "05980406200053" "20150925" 1 0
22 "06020012800019" "20111219" 1 0
23 "06080139600023" "20121127" 1 0
24 "06150024500016" "20161222" 1 0
25 "06220100900222" "20151216" 1 1
end

//  CREATE A REAL STATA DATE VARIABLE
gen _date = daily(date, "YMD")
assert missing(_date) == missing(date)
format _date %td
drop date
rename _date date

sort firmid date

tempfile candidates
preserve
keep if typetextdef == 1 & nao_t == 0
drop typetextdef nao_t
save `candidates'
restore

rangejoin date . date using `candidates', by(firmid)
by document_id, sort: egen early = max(date_U < td(31dec2010))
by document_id, sort: egen late = max(date_U > td(1jan2010) & !missing(date_U))

gen byte wanted = 1 if early & !late
replace wanted = 2 if late & !early
replace wanted = 3 if late & early

by document_id, sort: keep if _n == 1
drop *_U

Note that I have revised the example data. First, observations with selected = 0 have been dropped, and so has the variable selected, as it is now always 1 and carries no information. I also created a document_id variable. In using this code, you should replace all references to document_id by the actual name of your document identifier variable. With the existence of a unique document identifier variable there is no need for the variable obs_no that the code in #2 created and used. The document identifier serves this purpose.

Announcement

looking at past activity by group

Comment

Comment

Comment