Extracting the first year of data available for one variable

Fimi Karimi

Join Date: Jun 2023

Posts: 71
#1

Extracting the first year of data available for one variable

16 Jun 2023, 20:50

Hello everyone,
I need to calculate the founding year of each company based on the first year of data available for Total Asset. For example for Company with ID 001000 founding year is 1961 and for company with ID 001111 the founding year is 1986.

ID Year Total Asset
001000 1961 44.025
001000 1962 38.586
001000 1963 23.905
001000 1964 1.416
001000 1965 2.31
001000 1966 2.43
001000 1967 2.456
001000 1968 5.922
001111 1985 .
001111 1986 33.45
001111 1987 29.33
001111 1988 19.907
001111 1989 21.771
001111 1990 25.638

Could you please tell me which command I can use to extract founding year for each company in my sample?
Thanks in advance.
Tags: None

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1396

16 Jun 2023, 22:17

Welcome to Statalist, Fimi! Please go through the FAQ for advice on how to most effectively post questions to Statalist. Read especially #12, which covers amongst other things, how you should use the dataex command to provide data examples. That would be much better than copy-pasting the data the way you did.

Given data as follows:

Code:

clear
input str6 ID int Year float Total_Asset
"001000" 1961 44.025
"001000" 1962 38.586
"001000" 1963 23.905
"001000" 1964 1.416
"001000" 1965 2.31
"001000" 1966 2.43
"001000" 1967 2.456
"001000" 1968 5.922
"001111" 1985 .
"001111" 1986 33.45
"001111" 1987 29.33
"001111" 1988 19.907
"001111" 1989 21.771
"001111" 1990 25.638
end

you just need to do

Code:

egen wanted = min(cond(!missing(Total_Asset), Year, .)), by(ID)

which gives you

Code:

. list, noobs sep(0) abbrev(20)

  +--------------------------------------+
  |     ID   Year   Total_Asset   wanted |
  |--------------------------------------|
  | 001000   1961        44.025     1961 |
  | 001000   1962        38.586     1961 |
  | 001000   1963        23.905     1961 |
  | 001000   1964         1.416     1961 |
  | 001000   1965          2.31     1961 |
  | 001000   1966          2.43     1961 |
  | 001000   1967         2.456     1961 |
  | 001000   1968         5.922     1961 |
  | 001111   1985             .     1986 |
  | 001111   1986         33.45     1986 |
  | 001111   1987         29.33     1986 |
  | 001111   1988        19.907     1986 |
  | 001111   1989        21.771     1986 |
  | 001111   1990        25.638     1986 |
  +--------------------------------------+

Last edited by Hemanshu Kumar; 16 Jun 2023, 22:23.

Comment

Fimi Karimi

Join Date: Jun 2023

Posts: 71
#3

16 Jun 2023, 23:30

Thank you so much for yourcomments, Hemanshu. I really appreciate your help.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35694

17 Jun 2023, 00:19

For more on Hemanshu Kumar's method. see Section 9 in https://journals.sagepub.com/doi/pdf...867X1101100210

Note also how a search for FAQs yields pointers:

Code:

. search first , faq

Search of official help files, FAQs, Examples, and Stata Journals

FAQ     . . . . . . . . . . . . . . . . . . . . . . . Generating the last date
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        7/11    How can I generate a variable containing the last of
                several dates?
                http://www.stata.com/support/faqs/data-management/
                generating-last-date/

FAQ     . . . . . . . . . . . . . . . First and last occurrences in panel data
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        3/07    How can I identify first and last occurrences
                systematically in panel data?
                http://www.stata.com/support/faqs/data-management/
                first-and-last-occurrences/

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35694

17 Jun 2023, 03:34

The method in #2 is a great method -- which I have publicised too -- but it might lead a reader to wonder how one could work out such a solution for yourself.

Here is another approach, which may or may not appeal. The first idea is that we need to segregate missing values, which makes it easier to pick the smallest (first) value of year for each panel. With an indicator for missing total assets, the observations with missing values can be sorted to the end of each block of observations for each panel.

The code catches the extreme case of a panel holding missing values only, presumably not much use for research, but nothing in the question implies that you might not have such a case.

Code:

clear
input str6 ID int Year float Total_Asset
"001000" 1961 44.025
"001000" 1962 38.586
"001000" 1963 23.905
"001000" 1964 1.416
"001000" 1965 2.31
"001000" 1966 2.43
"001000" 1967 2.456
"001000" 1968 5.922
"001111" 1985 .
"001111" 1986 33.45
"001111" 1987 29.33
"001111" 1988 19.907
"001111" 1989 21.771
"001111" 1990 25.638
end

* get an indicator (0 for not missing, 1 for missing) 
gen ismissing = missing(Total_Asset)

* sort to segregate missings 
bysort ID (ismissing Year) : gen first = cond(Total_Asset[1] < ., Year[1], .) 

list, sepby(ID)

* sort back to more usual ordr 
sort ID Year 

list, sepby(ID)

Here are the listings:

Code:

. list, sepby(ID)

     +---------------------------------------------+
     |     ID   Year   Total_~t   ismiss~g   first |
     |---------------------------------------------|
  1. | 001000   1961     44.025          0    1961 |
  2. | 001000   1962     38.586          0    1961 |
  3. | 001000   1963     23.905          0    1961 |
  4. | 001000   1964      1.416          0    1961 |
  5. | 001000   1965       2.31          0    1961 |
  6. | 001000   1966       2.43          0    1961 |
  7. | 001000   1967      2.456          0    1961 |
  8. | 001000   1968      5.922          0    1961 |
     |---------------------------------------------|
  9. | 001111   1986      33.45          0    1986 |
 10. | 001111   1987      29.33          0    1986 |
 11. | 001111   1988     19.907          0    1986 |
 12. | 001111   1989     21.771          0    1986 |
 13. | 001111   1990     25.638          0    1986 |
 14. | 001111   1985          .          1    1986 |
     +---------------------------------------------+
 
. list, sepby(ID)

     +---------------------------------------------+
     |     ID   Year   Total_~t   ismiss~g   first |
     |---------------------------------------------|
  1. | 001000   1961     44.025          0    1961 |
  2. | 001000   1962     38.586          0    1961 |
  3. | 001000   1963     23.905          0    1961 |
  4. | 001000   1964      1.416          0    1961 |
  5. | 001000   1965       2.31          0    1961 |
  6. | 001000   1966       2.43          0    1961 |
  7. | 001000   1967      2.456          0    1961 |
  8. | 001000   1968      5.922          0    1961 |
     |---------------------------------------------|
  9. | 001111   1985          .          1    1986 |
 10. | 001111   1986      33.45          0    1986 |
 11. | 001111   1987      29.33          0    1986 |
 12. | 001111   1988     19.907          0    1986 |
 13. | 001111   1989     21.771          0    1986 |
 14. | 001111   1990     25.638          0    1986 |
     +---------------------------------------------+

That is still a little tricksy. Consider this line

Code:

bysort ID (ismissing Year) : gen first = cond(Total_Asset[1] < ., Year[1], .)

We could have gone

Code:

bysort ID (ismissing Year) : gen first = Year[1] if Total_Asset[1] < .

but then we would need to spread the value in the first observation to all the others. That is a standard copying down as explained at https://www.stata.com/support/faqs/d...issing-values/

Code:

bysort ID (ismissing Year) : replace first = first[_n-1] if missing(first)

Here just as the so-called subscript

Code:

[_n-1]

indicates under by: the previous observation in the same panel, so also

Code:

[1]

indicates under by: the first observation in the same panel.

Comment

Fimi Karimi

Join Date: Jun 2023

Posts: 71
#6

18 Jun 2023, 14:43

Nick, Thank you for your complete explanations. It was very helpful
Comment

Announcement