creating a dummy based on the first(oldest) entry for each company ID

Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#1

creating a dummy based on the first(oldest) entry for each company ID

03 Mar 2023, 07:31

I have this data:

----------------------- copy starting from the next line -----------------------
[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input long COM_NAME int ROUND_DATE long STAGE
1 22592 2
1 22592 2
2 22420 3
2 22420 3
2 22818 3
2 22818 3
2 22818 3
2 22818 3
2 22818 3
3 19205 6
4 21192 2
4 21192 2
4 22195 3
4 22195 3
4 22195 3
4 22195 3
4 22712 3
4 22712 3
4 22712 3
4 22712 3
5 19926 4
6 22834 2
6 22834 2
6 22834 2
6 22834 2
6 22834 2
6 22834 2
7 15809 1
8 18995 2
8 20515 3
8 20515 3
8 21444 3
8 22362 4
8 22362 4
8 22362 4
end
format %tdnn/dd/CCYY ROUND_DATE
label values COM_NAME COM_N
label def COM_N 1 "&Charge GmbH", modify
label def COM_N 2 "&Open Gifts Ltd", modify
label def COM_N 3 "+Plugg Srl", modify
label def COM_N 4 "+Simple.Fr SAS", modify
label def COM_N 5 "004 GmbH", modify
label def COM_N 6 "011h Sustainable Construction SL", modify
label def COM_N 7 "02 Netherlands BV", modify
label def COM_N 8 "07446749 Ltd", modify
label def COM_N 10 "1-2-3TV GmbH", modify
label def COM_N 11 "10-Vins SAS", modify
label def COM_N 12 "100 Percent IT Ltd", modify
label def COM_N 15 "1000mercis SA", modify
label def COM_N 16 "1001 Listes", modify
label def COM_N 19 "100world AG", modify
label def COM_N 20 "10Start GmbH", modify
label def COM_N 21 "10Tacle Studios AG", modify

I want to create a dummy variable ‘Early’ that equals 1 only if the ‘STAGE’ variable in the first (oldest) entry (raw) only is categorized as ‘Startup/Seed’ or ‘Early stage’ and zero otherwise (means if STAGE = 2 or 6). In other words, this is how the file should look like :

----------------------- copy starting from the next line -----------------------
[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input long COM_NAME int ROUND_DATE long STAGE float Early
1 22592 2 1
1 22592 2 1
2 22420 3 0
2 22420 3 0
2 22818 3 0
2 22818 3 0
2 22818 3 0
2 22818 3 0
2 22818 3 0
3 19205 6 1
4 21192 2 1
4 21192 2 1
4 22195 3 1
4 22195 3 1
4 22195 3 1
4 22195 3 1
4 22712 3 1
4 22712 3 1
4 22712 3 1
4 22712 3 1
5 19926 4 0
6 22834 2 1
6 22834 2 1
6 22834 2 1
6 22834 2 1
6 22834 2 1
6 22834 2 1
7 15809 1 0
8 18995 2 1
8 20515 3 1
8 20515 3 1
8 21444 3 1
8 22362 4 1
8 22362 4 1
8 22362 4 1
end
format %tdnn/dd/CCYY ROUND_DATE
label values COM_NAME COM_N
label def COM_N 1 "&Charge GmbH", modify
label def COM_N 2 "&Open Gifts Ltd", modify
label def COM_N 3 "+Plugg Srl", modify
label def COM_N 4 "+Simple.Fr SAS", modify
label def COM_N 5 "004 GmbH", modify
label def COM_N 6 "011h Sustainable Construction SL", modify
label def COM_N 7 "02 Netherlands BV", modify
label def COM_N 8 "07446749 Ltd", modify
label def COM_N 10 "1-2-3TV GmbH", modify
label def COM_N 11 "10-Vins SAS", modify
label def COM_N 12 "100 Percent IT Ltd", modify
label def COM_N 15 "1000mercis SA", modify
label def COM_N 16 "1001 Listes", modify
label def COM_N 19 "100world AG", modify
label def COM_N 20 "10Start GmbH", modify
label def COM_N 21 "10Tacle Studios AG", modify

I tried to create other data file that has these variables only: COM_NAME and ROUND_DATE and STAGE. Then I used these commands to keep the first observation only:

Code:

sort COM_NAME ROUND_DATE by COM_NAME: keep if _n==1

then merge this data file with the original file using "m:1" command

but then the code keeps only 18354 observations in COM_NAME while I know from my data that there should be 25890 observations (unique company names).

is there any other easier accurate way to do what I need?

Many thanks[
Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1063

03 Mar 2023, 07:54

Basing on your code, should the target be like this?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long COM_NAME int ROUND_DATE long STAGE float Early
1 22592 2 1
1 22592 2 0
2 22420 3 1
2 22420 3 0
2 22818 3 1
2 22818 3 0
2 22818 3 0
2 22818 3 0
2 22818 3 0
3 19205 6 1
4 21192 2 1
4 21192 2 0
4 22195 3 1
4 22195 3 0
4 22195 3 0
4 22195 3 0
4 22712 3 1
4 22712 3 0
4 22712 3 0
4 22712 3 0
5 19926 4 1
6 22834 2 1
6 22834 2 0
6 22834 2 0
6 22834 2 0
6 22834 2 0
6 22834 2 0
7 15809 1 1
8 18995 2 1
8 20515 3 1
8 20515 3 0
8 21444 3 1
8 22362 4 1
8 22362 4 0
8 22362 4 0
end
format %tdnn/dd/CCYY ROUND_DATE
label values COM_NAME COM_N
label def COM_N 1 "&Charge GmbH", modify
label def COM_N 2 "&Open Gifts Ltd", modify
label def COM_N 3 "+Plugg Srl", modify
label def COM_N 4 "+Simple.Fr SAS", modify
label def COM_N 5 "004 GmbH", modify
label def COM_N 6 "011h Sustainable Construction SL", modify
label def COM_N 7 "02 Netherlands BV", modify
label def COM_N 8 "07446749 Ltd", modify

It's unclear how that "Early" came about in thread #1.

Also, I'm confused by this:

I want to create a dummy variable ‘Early’ that equals 1 only if the ‘STAGE’ variable in the first (oldest) entry (raw) only is categorized as ‘Startup/Seed’ or ‘Early stage’ and zero otherwise (means if STAGE = 2 or 6).

This is not what the "Early" shows. For example, the first company is in stage 2, but its "Early" is both "1". Yet, if the quoted rule above is correct, shouldn't that be "0"?

I'd recommend revising the question to make it clear. Especially the rules.

Comment

Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#3

03 Mar 2023, 09:03

Hi Ken,

no the example you showed is not what I want

basically, I want my dummy variable (Early) to be equal to 1 in all observations with the same COM_NAME if the STAGE variable == 2 or 6 in the first COM_NAME observation (row).

COM_NAME represents company ID, each company could have many observations because each observation(row) represents an investment in the company by a particular investor. the variable (Early) should capture if the company was at an early STAGE when it first received investment

This is not what the "Early" shows. For example, the first company is in stage 2, but its "Early" is both "1". Yet if the quoted rule above is correct, shouldn't that be "0"?

No, because the variable should capture if the STAGE ie Early on the first time the company received the investment, that's why the variable should = 1 in all observations if the first observation for the company has the STAGE variable = 2 or 6

is that more clear?

Last edited by Yusra Noorwali; 03 Mar 2023, 09:07.
Comment

Ken Chui

Join Date: Aug 2014
Posts: 1063

03 Mar 2023, 14:24

See if this does the job:

Code:

bysort COM_NAME ROUND_DATE: gen Include = inlist(STAGE, 2, 6) & _n == 1
bysort COM_NAME: egen Early = max(Include)
drop Include

Comment

Yusra Noorwali

Join Date: Mar 2022

Posts: 23
#5

06 Mar 2023, 05:40

Hello Ken,

your code works!
I cannot thank you enough!

Best wishes
Comment

Announcement