Creating new variables (change_1,change_n) according to changes in variable A and/or B

Laura Hill

Join Date: Aug 2021
Posts: 42

Creating new variables (change_1,change_n) according to changes in variable A and/or B

15 Mar 2023, 06:36

Dear all,

I was hoping someone would be able to help me figure out the following:

I want to create new variables (change_1, change_2...change_n), that contains the date of when a change in the team happened (someone left/joined).
The variables used to track this are 'AppointmentDate' (joined) and 'ResignationDate' (left).
Whenever a change happens, I want to create a new variable 'change_1', 'change_2'...., and it contains either the appointment or resignation date.

BvdIdNumber = the company number (used as a grouping variable).
UCI = employee number

It could be that some firms only have 1 change vs others having multiple during their lifecycle, therefore that I wanted to not only look at fixed time periods (eg. years) as for some firms this means no change happened, while for others many team changes have occurred.
Additionally, if possible, I would only like to track changes as of 1 year (potentially 1/2 year) after the company's start (DateofIncorporation) not to count the founding team as a change.
Small side note: There are still some duplicates in the data (aka same employee number and same appointment/resignation date), do I need to get rid of these first?

My goal with this is to create a wide dataset that I can then, after removing individual-level variables, turn into a long format to perform a survival analysis (stcox).
Rather than using only years or months to identify team composition, I would also try my hand at looking at changes and calculating gender/age/nationality diversity after each change (perhaps using stsplit).
I would then based on the date in the variable calculate a diversity measure for all people present at that time.

Would anyone know how to start coding this or point me in the right direction?
I have not been able to find what I am looking for, but it could also be me not using the right vocab.

Thanks in advance for your help!
Below is an excerpt of my data using dataex

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str16 BvdIdNumber str10 DateOfIncorporation float(FirmFailure2 DateFailure) str10(DMUci DMAppointmentDate DMResignationDate) float DMGenderDummy str10 DMBirthdate str27 Nationality
"AT9070350951" "2016"       1     . "P039706490" "04/05/2016" ""           0 "1978"       "Austria"       
"AT9070350951" "2016"       1     . "P039706490" "04/05/2016" ""           0 "1978"       "Austria"       
"AT9070350951" "2016"       1 22410 "P039706490" "04/05/2016" ""           0 "1978"       "Austria"       
"AT9070350951" "2016"       1 22410 "P206368021" "04/05/2016" ""           0 "1961"       "Austria"       
"AT9070350951" "2016"       1 22410 "P039706490" "04/05/2016" ""           0 "1978"       "Austria"       
"AT9070350951" "2016"       1     . "P206368021" "04/05/2016" ""           0 "1961"       "Austria"       
"AT9070422953" "11/12/2017" 1 21844 "P301627755" "18/09/2018" "16/09/2019" 0 "10/03/1995" "Austria"       
"AT9070422953" "11/12/2017" 1 21844 "P300272480" "24/01/2018" ""           0 "17/04/1993" "Austria"       
"AT9110939024" "08/01/2016" 1 21825 "P117216063" "04/02/2016" ""           0 "1968"       "Germany"       
"AT9110939024" "08/01/2016" 1 21825 "P001968694" "04/02/2016" ""           0 "30/03/1962" "Germany"       
"BE0536436625" "28/06/2013" 1 20438 "P009351640" "31/01/2015" "16/12/2015" 0 "1967"       "Canada"        
"BE0536436625" "28/06/2013" 1 20438 "P009351700" "31/01/2015" "16/12/2015" 0 "1960"       "United Kingdom"
"BE0536436625" "28/06/2013" 1     . "P009351640" "31/01/2015" "16/12/2015" 0 "1967"       "Canada"        
"BE0536436625" "28/06/2013" 1 20438 "P048688796" "31/01/2015" "16/12/2015" 0 "1971"       "United Kingdom"
"BE0536436625" "28/06/2013" 1     . "P009351700" "31/01/2015" "16/12/2015" 0 "1960"       "United Kingdom"
"BE0536436625" "28/06/2013" 1     . "P048688796" "31/01/2015" "16/12/2015" 0 "1971"       "United Kingdom"
"BE0536436625" "28/06/2013" 1 20438 "P014424187" "28/06/2013" ""           0 "1956"       "Canada"        
"BE0536436625" "28/06/2013" 1     . "P014424187" "28/06/2013" ""           0 "1956"       "Canada"        
"BE0555980046" "10/07/2014" 1 21236 "P039659380" "05/08/2015" "21/02/2018" 0 "1965"       "United Kingdom"
"BE0555980046" "10/07/2014" 1 21236 "P076509994" "10/07/2014" ""           0 "11/11/1970" "France"        
"BE0656918246" "22/06/2016" 1 21081 "P349820174" "22/06/2016" "23/06/2020" 0 "06/11/1980" "Bulgaria"      
"BE0656918246" "22/06/2016" 1 21081 "P141888237" "24/11/2016" ""           0 "31/07/1989" "Bulgaria"      
"BE0656918246" "22/06/2016" 1 21081 "P349820174" "22/06/2016" "23/06/2020" 0 "06/11/1980" "Bulgaria"      
"BE0656918246" "22/06/2016" 1 21081 "P141888237" "24/11/2016" ""           0 "31/07/1989" "Bulgaria"      
"BE0657890622" "30/06/2016" 1 21720 "P077578372" "30/06/2016" "20/06/2019" 0 "19/11/1964" "France"        
"BE0657890622" "30/06/2016" 1 21720 "P350651532" "30/06/2016" "20/06/2019" 1 "10/11/1965" "France"        
"BE0664736644" "17/10/2016" 1 21181 "P365060689" "17/10/2016" ""           0 "22/01/1981" "Netherlands"   
"BE0664736644" "17/10/2016" 1 21181 "P438431943" "17/10/2016" ""           1 "11/12/1991" "Romania"       
"BE0664799594" "18/10/2016" 1 21906 "P006533913" "18/10/2016" "23/12/2019" 0 "16/02/1968" "United Kingdom"
"BE0664799594" "18/10/2016" 1 21906 "P263565204" "18/10/2016" "23/12/2019" 0 "24/08/1939" "Greece"        
"BE0664799594" "18/10/2016" 1 21906 "P006533911" "18/10/2016" "23/12/2019" 0 "29/07/1969" "Greece"        
"BE0808687713" "23/12/2008" 1     . "P068611593" "02/01/2011" "03/10/2017" 0 "15/01/1968" "Denmark"       
"BE0808687713" "23/12/2008" 1 19837 "P068611593" "02/01/2011" "03/10/2017" 0 "15/01/1968" "Denmark"       
end
format %td DateFailure

Best regards,
Laura

Tags: dates, survival analysis, variable creation

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

15 Mar 2023, 10:24

Code:

// REPLACE STRING VARIABLES THAT LOOK LIKE DATES WITH STATA NUMERIC DATE VARIABLES foreach v of varlist DMAppointmentDate DMResignationDate { gen _`v' = daily(`v', "DMY"), after(`v') assert missing(_`v') == missing(`v') format _`v' %td drop `v' } gen _DateOfIncorporation = daily(DateOfIncorporation, "DMY"), after(DateOfIncorporation) replace _DateOfIncorporation = mdy(1, 1, real(DateOfIncorporation)) /// if missing(_DateOfIncorporation) assert missing(_DateOfIncorporation) == missing(DateOfIncorporation) format _DateOfIncorporation %td drop DateOfIncorporation rename _* * frame put BvdIdNumber DMAppointmentDate DMResignationDate DateOfIncorporation, /// into(changes) frame change changes gen `c(obs_t)' obs_no = _n reshape long DM@Date, i(obs_no) j(event) string drop if missing(DMDate) drop obs_no event duplicates drop drop if DMDate - DateOfIncorporation < 365 // DON'T COUNT CHANGES IN FIRST YEAR by BvdIdNumber (DMDate), sort: gen _j = _n rename DMDate change_ reshape wide change_, i(BvdIdNumber) j(_j)

At the end of this code, the original data will still be in the default frame, except that the useless string dates have been replaced by Stata numeric dates. The contents of frame changes will be a reduced data set containing each firm's BvdIdNumber and a series of change_* variables containing the corresponding appointment and resignation dates. Change events occurring within 365 days of incorporation are excluded.

Your DateOfIncorporation variable is sometimes given as a complete date, and sometimes as just the year. For those where only the year is given, I have imputed an incorporation date of 1 January of that year. If that doesn't seem appropriate to you, you can change that accordingly.

Note that if two or more change events occur on the same date, they are counted as a single change

You do not have to drop the duplicate observations from your data in order for this particular code to work properly. But it is seldom a good idea to use a data set that has purely duplicate observations. In fact, just having such a data set usually means that something went wrong in the data management that created it. So I urge you to carefully review how this data set was created, and fix any errors you find in the process. If you cannot find any errors (other than perhaps forgetting to drop duplicate observations at a point where they were created without error) then -duplicates drop- will get rid of the duplicates for you easily.
1 like
Comment
Laura Hill

Join Date: Aug 2021

Posts: 42
#3

15 Mar 2023, 11:09

Clyde Schechter, thank you for your help and advice! Also thanks for pointing out the issues with my date variable!

I am not yet versed in the use of frames, but will read up about it now, as it seems quite useful for regular use (until now I have been saving each dataset separately in a different file).
However, trying to run the file, I immediately get the error "command frame is unrecognized".

I will look into this, but in the meantime, I ran your code on the dataex sample, and it produced the desired outcome!
Thank you very much!

Best regards,
Laura
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30065

15 Mar 2023, 12:40

However, trying to run the file, I immediately get the error "command frame is unrecognized".

That almost certainly means you are using a version of Stata from before version 16, which is when frames were introduced. The Forum FAQ does ask people to explicitly state in their posts what version of Stata they are using if it is not the current one, which is presently 17.

Here is a revised version of the code that does not rely on frames. As a side effect, the original data is lost. Since the original data set contains those dysfunctional string date variables, what I suggest you do is, at the point indicated below by the comment, -save- the data in a new data set. Then, when you need to go back to the individual level data, but with numeric date variables, you can. At the end of the code, the data set of BvdIdNumbers and associated change variables is in memory.

Code:

//  REPLACE STRING VARIABLES THAT LOOK LIKE DATES WITH STATA NUMERIC DATE VARIABLES
foreach v of varlist DMAppointmentDate DMResignationDate {
    gen _`v' = daily(`v', "DMY"), after(`v')
    assert missing(_`v') == missing(`v')
    format _`v' %td
    drop `v'
}

gen _DateOfIncorporation = daily(DateOfIncorporation, "DMY"), after(DateOfIncorporation)
replace _DateOfIncorporation = mdy(1, 1, real(DateOfIncorporation)) ///
    if missing(_DateOfIncorporation)
assert missing(_DateOfIncorporation) == missing(DateOfIncorporation)
format _DateOfIncorporation %td
drop DateOfIncorporation
rename _* *
// I SUGGEST YOU STORE THIS DATA IN A FILE NOW


keep BvdIdNumber DMAppointmentDate DMResignationDate DateOfIncorporation
gen `c(obs_t)' obs_no = _n
reshape long DM@Date, i(obs_no) j(event) string
drop if missing(DMDate)
drop obs_no event
duplicates drop
drop if DMDate - DateOfIncorporation < 365 // DON'T COUNT CHANGES IN FIRST YEAR
by BvdIdNumber (DMDate), sort: gen _j = _n
rename DMDate change_
reshape wide change_, i(BvdIdNumber) j(_j)

Comment

Laura Hill

Join Date: Aug 2021

Posts: 42
#5

17 Mar 2023, 04:57

Hi Clyde Schechter, sorry for not mentioning it, I completely forgot. I checked and indeed my organization uses Stata 15.1. Will make sure I mention this in the future, thanks for making me aware!

Thanks for the revised code!
Comment

Announcement

Creating new variables (change_1,change_n) according to changes in variable A and/or B

Comment

Comment

Comment

Comment