Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • creating a dummy based on the first(oldest) entry for each company ID

    I have this data:


    ----------------------- copy starting from the next line -----------------------
    [CODE]
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long COM_NAME int ROUND_DATE long STAGE
    1 22592 2
    1 22592 2
    2 22420 3
    2 22420 3
    2 22818 3
    2 22818 3
    2 22818 3
    2 22818 3
    2 22818 3
    3 19205 6
    4 21192 2
    4 21192 2
    4 22195 3
    4 22195 3
    4 22195 3
    4 22195 3
    4 22712 3
    4 22712 3
    4 22712 3
    4 22712 3
    5 19926 4
    6 22834 2
    6 22834 2
    6 22834 2
    6 22834 2
    6 22834 2
    6 22834 2
    7 15809 1
    8 18995 2
    8 20515 3
    8 20515 3
    8 21444 3
    8 22362 4
    8 22362 4
    8 22362 4
    end
    format %tdnn/dd/CCYY ROUND_DATE
    label values COM_NAME COM_N
    label def COM_N 1 "&Charge GmbH", modify
    label def COM_N 2 "&Open Gifts Ltd", modify
    label def COM_N 3 "+Plugg Srl", modify
    label def COM_N 4 "+Simple.Fr SAS", modify
    label def COM_N 5 "004 GmbH", modify
    label def COM_N 6 "011h Sustainable Construction SL", modify
    label def COM_N 7 "02 Netherlands BV", modify
    label def COM_N 8 "07446749 Ltd", modify
    label def COM_N 10 "1-2-3TV GmbH", modify
    label def COM_N 11 "10-Vins SAS", modify
    label def COM_N 12 "100 Percent IT Ltd", modify
    label def COM_N 15 "1000mercis SA", modify
    label def COM_N 16 "1001 Listes", modify
    label def COM_N 19 "100world AG", modify
    label def COM_N 20 "10Start GmbH", modify
    label def COM_N 21 "10Tacle Studios AG", modify


    I want to create a dummy variable ‘Early’ that equals 1 only if the ‘STAGE’ variable in the first (oldest) entry (raw) only is categorized as ‘Startup/Seed’ or ‘Early stage’ and zero otherwise (means if STAGE = 2 or 6). In other words, this is how the file should look like :




    ----------------------- copy starting from the next line -----------------------
    [CODE]
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long COM_NAME int ROUND_DATE long STAGE float Early
    1 22592 2 1
    1 22592 2 1
    2 22420 3 0
    2 22420 3 0
    2 22818 3 0
    2 22818 3 0
    2 22818 3 0
    2 22818 3 0
    2 22818 3 0
    3 19205 6 1
    4 21192 2 1
    4 21192 2 1
    4 22195 3 1
    4 22195 3 1
    4 22195 3 1
    4 22195 3 1
    4 22712 3 1
    4 22712 3 1
    4 22712 3 1
    4 22712 3 1
    5 19926 4 0
    6 22834 2 1
    6 22834 2 1
    6 22834 2 1
    6 22834 2 1
    6 22834 2 1
    6 22834 2 1
    7 15809 1 0
    8 18995 2 1
    8 20515 3 1
    8 20515 3 1
    8 21444 3 1
    8 22362 4 1
    8 22362 4 1
    8 22362 4 1
    end
    format %tdnn/dd/CCYY ROUND_DATE
    label values COM_NAME COM_N
    label def COM_N 1 "&Charge GmbH", modify
    label def COM_N 2 "&Open Gifts Ltd", modify
    label def COM_N 3 "+Plugg Srl", modify
    label def COM_N 4 "+Simple.Fr SAS", modify
    label def COM_N 5 "004 GmbH", modify
    label def COM_N 6 "011h Sustainable Construction SL", modify
    label def COM_N 7 "02 Netherlands BV", modify
    label def COM_N 8 "07446749 Ltd", modify
    label def COM_N 10 "1-2-3TV GmbH", modify
    label def COM_N 11 "10-Vins SAS", modify
    label def COM_N 12 "100 Percent IT Ltd", modify
    label def COM_N 15 "1000mercis SA", modify
    label def COM_N 16 "1001 Listes", modify
    label def COM_N 19 "100world AG", modify
    label def COM_N 20 "10Start GmbH", modify
    label def COM_N 21 "10Tacle Studios AG", modify


    I tried to create other data file that has these variables only: COM_NAME and ROUND_DATE and STAGE. Then I used these commands to keep the first observation only:

    Code:
    sort COM_NAME ROUND_DATE
    by COM_NAME: keep if _n==1

    then merge this data file with the original file using "m:1" command

    but then the code keeps only 18354 observations in COM_NAME while I know from my data that there should be 25890 observations (unique company names).

    is there any other easier accurate way to do what I need?

    Many thanks[

  • #2
    Basing on your code, should the target be like this?

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long COM_NAME int ROUND_DATE long STAGE float Early
    1 22592 2 1
    1 22592 2 0
    2 22420 3 1
    2 22420 3 0
    2 22818 3 1
    2 22818 3 0
    2 22818 3 0
    2 22818 3 0
    2 22818 3 0
    3 19205 6 1
    4 21192 2 1
    4 21192 2 0
    4 22195 3 1
    4 22195 3 0
    4 22195 3 0
    4 22195 3 0
    4 22712 3 1
    4 22712 3 0
    4 22712 3 0
    4 22712 3 0
    5 19926 4 1
    6 22834 2 1
    6 22834 2 0
    6 22834 2 0
    6 22834 2 0
    6 22834 2 0
    6 22834 2 0
    7 15809 1 1
    8 18995 2 1
    8 20515 3 1
    8 20515 3 0
    8 21444 3 1
    8 22362 4 1
    8 22362 4 0
    8 22362 4 0
    end
    format %tdnn/dd/CCYY ROUND_DATE
    label values COM_NAME COM_N
    label def COM_N 1 "&Charge GmbH", modify
    label def COM_N 2 "&Open Gifts Ltd", modify
    label def COM_N 3 "+Plugg Srl", modify
    label def COM_N 4 "+Simple.Fr SAS", modify
    label def COM_N 5 "004 GmbH", modify
    label def COM_N 6 "011h Sustainable Construction SL", modify
    label def COM_N 7 "02 Netherlands BV", modify
    label def COM_N 8 "07446749 Ltd", modify
    It's unclear how that "Early" came about in thread #1.

    Also, I'm confused by this:

    I want to create a dummy variable ‘Early’ that equals 1 only if the ‘STAGE’ variable in the first (oldest) entry (raw) only is categorized as ‘Startup/Seed’ or ‘Early stage’ and zero otherwise (means if STAGE = 2 or 6).
    This is not what the "Early" shows. For example, the first company is in stage 2, but its "Early" is both "1". Yet, if the quoted rule above is correct, shouldn't that be "0"?

    I'd recommend revising the question to make it clear. Especially the rules.

    Comment


    • #3
      Hi Ken,

      no the example you showed is not what I want

      basically, I want my dummy variable (Early) to be equal to 1 in all observations with the same COM_NAME if the STAGE variable == 2 or 6 in the first COM_NAME observation (row).

      COM_NAME represents company ID, each company could have many observations because each observation(row) represents an investment in the company by a particular investor. the variable (Early) should capture if the company was at an early STAGE when it first received investment

      This is not what the "Early" shows. For example, the first company is in stage 2, but its "Early" is both "1". Yet if the quoted rule above is correct, shouldn't that be "0"?
      No, because the variable should capture if the STAGE ie Early on the first time the company received the investment, that's why the variable should = 1 in all observations if the first observation for the company has the STAGE variable = 2 or 6

      is that more clear?
      Last edited by Yusra Noorwali; 03 Mar 2023, 09:07.

      Comment


      • #4
        See if this does the job:

        Code:
        bysort COM_NAME ROUND_DATE: gen Include = inlist(STAGE, 2, 6) & _n == 1
        bysort COM_NAME: egen Early = max(Include)
        drop Include

        Comment


        • #5
          Hello Ken,

          your code works!
          I cannot thank you enough!

          Best wishes

          Comment

          Working...
          X