Counting unique observations by group, before a certain date

Ben James

Join Date: May 2016
Posts: 19

Counting unique observations by group, before a certain date

16 May 2016, 08:53

Dear Statalist,

I want to count the unique number of projects done by a company before the first_idate ("experience"). How do I get from my inputs to my desired output?

Thank you in advance!

Input:

Code:

clear
input float(company fund idate project iamount first_idate)
1 1 2000 1 10 2000
1 2 2001 2 9 2001
1 2 2002 2 8 2001
2 1 2003 1 4 2003
2 1 2004 1 3 2003
3 1 2003 1 12 2003
3 3 2004 2 14 2004
3 4 2006 3 22 2006
4 1 2000 5 15 2000
4 2 2000 6 6 2000
end

Output:

Code:

   clear
  input float(company fund idate project iamount first_idate experience)
  1 1 2000 1 10 2000 0
  1 2 2001 2 9 2001 1
  1 2 2002 2 8 2001 0
  2 1 2003 1 4 2003 0
  2 1 2004 1 3 2003 0
  3 1 2003 1 12 2003 0
  3 3 2004 2 14 2004 1
  3 4 2006 3 22 2006 2
  4 1 2000 5 15 2000 0
  4 2 2000 6 6 2000 0
  end

Last edited by Ben James; 16 May 2016, 09:07.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35726

16 May 2016, 09:22

Ben:

You seem to have asked this three times, once here and twice within http://www.statalist.org/forums/foru...inifs-function

Please read http://www.statalist.org/forums/help#adviceextras Section 1 on why asking the same question repeatedly and rapidly is a bad idea and why you should try to clarify any question which is not understood by rewriting it.

I guess I don't understand this question either because the code I came up with produces answers of zero for every possibility.

My wild guesses:

1. Unique here doesn't mean what dictionaries say, occurring once only. It means distinct. See also http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

2. Before the first date must mean idate < first_idate

So I tried this, but as said get all zeros.

Code:

clear
input float(company fund idate project iamount first_idate)
1 1 2000 1 10 2000
1 2 2001 2 9 2001
1 2 2002 2 8 2001
2 1 2003 1 4 2003
2 1 2004 1 3 2003
3 1 2003 1 12 2003
3 3 2004 2 14 2004
3 4 2006 3 22 2006
4 1 2000 5 15 2000
4 2 2000 6 6 2000
end

egen tag = tag(company project) 
gen tocount = idate < first_idate 
egen ndistinct = total(tag * tocount), by(company) 

list , sepby(company) 

     +----------------------------------------------------------------------------------+
     | company   fund   idate   project   iamount   first_~e   tag   tocount   ndisti~t |
     |----------------------------------------------------------------------------------|
  1. |       1      1    2000         1        10       2000     1         0          0 |
  2. |       1      2    2001         2         9       2001     1         0          0 |
  3. |       1      2    2002         2         8       2001     0         0          0 |
     |----------------------------------------------------------------------------------|
  4. |       2      1    2003         1         4       2003     1         0          0 |
  5. |       2      1    2004         1         3       2003     0         0          0 |
     |----------------------------------------------------------------------------------|
  6. |       3      1    2003         1        12       2003     1         0          0 |
  7. |       3      3    2004         2        14       2004     1         0          0 |
  8. |       3      4    2006         3        22       2006     1         0          0 |
     |----------------------------------------------------------------------------------|
  9. |       4      1    2000         5        15       2000     1         0          0 |
 10. |       4      2    2000         6         6       2000     1         0          0 |
     +----------------------------------------------------------------------------------+

Comment

Ben James

Join Date: May 2016

Posts: 19
#3

16 May 2016, 09:58

Hi Nick,
I’m just new to the stata list forums, my apologies.
Seeing from the link you gave, indeed it is distinct values I am referring to, which can be seen as “those left over after all duplicates have been removed”.

Before the first_idate means: idate of all observations in that group which are <first_idate

In accordance with what you said, let me try to rephrase/explain why the experience values should not be zero.

In line:
1 2 2001 2 9 2001
The experience value should be 1, as we have company 1 in 2000 doing project 1.

In line:
3 4 2006 3 22 2006
The experience value should equate to 2, as we have company 3 doing both project 1 in 2003 and project 2 in 2004.

In line:
1 2 2002 2 8 2001
I realized I made a mistake, the experience value should again be 1, as we have company 1 in 2000 doing project 1.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#4

16 May 2016, 12:06

Thanks for the reply, but #1 and #2 seem to match my previous understanding, so I am no further forward.

You don't seem to mention fund or iamount anywhere.

If no one else can make sense of this, we need a simpler account.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

16 May 2016, 12:17

Before the first_idate means: idate of all observations in that group which are <first_idate

There are no observations in the sample data for which idate<first_idate.
Comment
Ben James

Join Date: May 2016

Posts: 19
#6

17 May 2016, 03:12

I hope I do not repeat myself, but since there is a lot of unclarity about the question, I have responded to both of the above posts, and included an extra example:

PART 1)

Originally posted by William Lisowski View Post

There are no observations in the sample data for which idate<first_idate.

If we take a look at observation 2 (OBS2) and OBS3 from Nick’s first post in this thread, you see that the first_idate is 2001 for both. In the company group of 1 (which is OBS1 until OBS3), however, there is one observation with an idate of 2000 (OBS1), which is below the first_idate of OBS2 and OBS3 of 2001. Of those observations (in this case only OBS1), there is one distinct project (with a value of 1).

Therefore, the experience variable would yield 1 for OBS2 and OBS3.
For OBS1: since there are no observations below 2000 in the company group of 1, experience would yield 0.

PART 2)

Originally posted by Nick Cox View Post

Thanks for the reply, but #1 and #2 seem to match my previous understanding, so I am no further forward.

You don't seem to mention fund or iamount anywhere.

If no one else can make sense of this, we need a simpler account.

iamount:
is indeed not part of the question.

fund:
is not directly part of the question, but links up to first_idate; as each fund is assigned one first_idate. Sorry if this was unclear.

first_idate:
is the very first date corresponding to an investment of that fund. By definition each fund only has one idate, as this is the very date an investment was ever done by that fund.

experience
To get the experience variable, I want to count all distinct projects for all observations where the idate for in that company group is smaller than the first_idate of that observation.
(I do not want to compare the observation of the idate in that particular line to the observation of the first_idate in that same line, as by definition, first_idate will always be lower or equal to idate, as t is the very first investment of that fund)

PART 3)
PS. In my example counting:

all observations where the idate for in that company group is smaller than the first_idate of that observation. -versus-
all distinct projects for all observations where the idate for in that company group is smaller than the first_idate of that observation.

both lead to the same outcome.
However, changing observation 6 to:

6. | 2 2 2006 1 12 2006

would lead to the idate of observation 5 and 6 to be below the first_idate of new observation 6. These are two observations. However, since the projects are both 1, and I want the distinct number of projects, new observation 6 would yield an experience value of 1.
Comment
Ben James

Join Date: May 2016

Posts: 19
#7

17 May 2016, 06:56

Could this question be solved with a loop? (Since its seems I have to loop over all the values in each company and see if the (idate of group)<(first_idate of individual observation) separately)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#8

17 May 2016, 07:47

Just to note that I looked at this again and failed once more to grasp it all.
Comment
Ben James

Join Date: May 2016

Posts: 19
#9

17 May 2016, 08:38

Let me try again completely.

Code:

input firm fund idate f_idate 1 1 1999 1999 1 1 2001 1999 1 1 2002 1999 1 2 2000 2000 1 2 2004 2000 2 1 1980 1980 2 1 1981 1980 2 1 1982 1980 2 2 1985 1985 2 2 1987 1985 end

We have investment firms (firms), which have a number of funds. This could be for instance KKR as a firm, and KKR I and KKR II as its funds. (A firm can be seen as a overarching company, and a fund as a type of miniature company within the firm.).

A particular firm gains investment experience throughout the years, by investing in different investments (each observation in my data is an investment transaction).

I would like to measure this investment experience when a fund has made its first investment (f_idate). The number of investments made before a f_idate shows me how many companies the firm has invested in, before it starts this new fund.

For instance fund 2 of firm 1 has a starting date of 2000 (f_idate). The previous fund of firm 1 has has made several investments, however not all of these are before the f_idate of the newest fund. Only the first observation, has an idate of 1999. That means that when fund2 of firm1 was started in 2000, the firm has only made 1 previous investment before, in 1999. Therefore, my new variable of experience would be equal to 0 for all the "fund 1 firm1" observations and equal to 1 for all the "fund 2 firm 1" observations.

Thank you again for your help and thoughts,

Ben
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#10

17 May 2016, 12:15

I think I understand what you want. If so, this should do it:

Code:

clear* input firm fund idate f_idate 1 1 1999 1999 1 1 2001 1999 1 1 2002 1999 1 2 2000 2000 1 2 2004 2000 2 1 1980 1980 2 1 1981 1980 2 1 1982 1980 2 2 1985 1985 2 2 1987 1985 end gen experience = . levelsof firm, local(firms) foreach f of local firms { levelsof fund if firm == `f', local(funds) foreach g of local funds { summ f_idate if firm == `f' & fund == `g', meanonly local cutoff = r(mean) count if idate < `cutoff' & firm == `f' replace experience = r(N) if firm == `f' & fund == `g' } } list, noobs sepby(firm fund)

This is not a very "Stata-ish" solution to the problem as it comes close to looping over observations. My gut tells me that there should be a more elegant solution using -by:-, but I struggled in vain to find it. The difficulty arises because the result is calculated by fund, but the calculation must refer to observations of the firm that are not necessarily associated with that fund. So perhaps this is the best that can be done. I'd be eager to see if anybody comes up with something better.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#11

17 May 2016, 12:40

Here's a potential solution using rangestat (from SSC:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(firm fund idate f_idate)
1 1 1999 1999
1 1 2001 1999
1 1 2002 1999
1 2 2000 2000
1 2 2004 2000
2 1 1980 1980
2 1 1981 1980
2 1 1982 1980
2 2 1985 1985
2 2 1987 1985
end

* identify investment start
bysort firm fund: gen tocount = idate == f_idate

* experience from the beginning of time up to the year before the fund's first date
gen low = 0
gen high = f_idate - 1
rangestat (sum) experience = tocount, interval(idate low high) by(firm)
replace experience = 0 if mi(experience)

list, sepby(firm fund)

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35726

#12

17 May 2016, 12:44

Clyde's solution yields this:

Code:

  +------------------------------------------+
  | firm   fund   idate   f_idate   experi~e |
  |------------------------------------------|
  |    1      1    1999      1999          0 |
  |    1      1    2001      1999          0 |
  |    1      1    2002      1999          0 |
  |------------------------------------------|
  |    1      2    2000      2000          1 |
  |    1      2    2004      2000          1 |
  |------------------------------------------|
  |    2      1    1980      1980          0 |
  |    2      1    1981      1980          0 |
  |    2      1    1982      1980          0 |
  |------------------------------------------|
  |    2      2    1985      1985          3 |
  |    2      2    1987      1985          3 |
  +------------------------------------------+

Is 3 right?

His code encouraged me to have another go. Here it is. My trick, which is what you can call it if it works, is to ignore observations after a fund starts, which just get in the way.

Code:

clear
input firm fund idate f_idate
1 1 1999 1999
1 1 2001 1999
1 1 2002 1999
1 2 2000 2000
1 2 2004 2000
2 1 1980 1980
2 1 1981 1980
2 1 1982 1980
2 2 1985 1985
2 2 1987 1985
end

bysort firm fund (idate) : gen touse = _n == 1
bysort touse firm (idate) : gen exp = _n - 1 if touse
bysort firm fund (idate) : replace exp = sum(exp)
drop touse
list , sepby(firm fund)

     +-------------------------------------+
     | firm   fund   idate   f_idate   exp |
     |-------------------------------------|
  1. |    1      1    1999      1999     0 |
  2. |    1      1    2001      1999     0 |
  3. |    1      1    2002      1999     0 |
     |-------------------------------------|
  4. |    1      2    2000      2000     1 |
  5. |    1      2    2004      2000     1 |
     |-------------------------------------|
  6. |    2      1    1980      1980     0 |
  7. |    2      1    1981      1980     0 |
  8. |    2      1    1982      1980     0 |
     |-------------------------------------|
  9. |    2      2    1985      1985     1 |
 10. |    2      2    1987      1985     1 |
     +-------------------------------------+

Last edited by Nick Cox; 17 May 2016, 12:46.

Comment

Ben James

Join Date: May 2016

Posts: 19
#13

17 May 2016, 13:33

Thank you so much Clyde, Robert and Nick.

Clyde that is what I am looking for!

The value 3 for observation 9 and 10 (observation numbers from post #12) is indeed correct, as the idate of observation 6 (OBS6), OBS7 and OBS8 are all below the f_idate 1985.

Some remarks:
Expanding the code to my sample, I get a “type mismatch error” (r109). I am guessing this is because my funds and firms variables are non-numeric (strings). How would this be solved?

This is very helpful already. I hope it does not make things even more complex, but if possible I want to construct two additional variables which are almost the same as ‘experience’
exp_usd
same as experience, except instead of counting investments, it adds up all the investment amounts made (iamount)

(This is thus an alternative way to measure experience, based on dollars invested)

exp_distinct
same as experience, except it only counts the distinct number of companies in the company variable (company)

(A fund can invest multiple times in the same company. This variable thus only counts the number of distinct companies invested in)

Code:

clear* input str2(firm fund) idate f_idate iamount str2(company) f1 v1 1999 1999 3 c1 f1 v1 2001 1999 5 c2 f1 v1 2002 1999 7 c3 f1 v2 2000 2000 12 c4 f1 v2 2004 2000 2 c5 f2 v1 1980 1980 5 c6 f2 v1 1981 1980 8 c6 f2 v1 1982 1980 9 c6 f2 v2 1985 1985 12 c7 f2 v2 1987 1985 8 c8 end
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30119

#14

17 May 2016, 14:28

I think this does it:

Code:

clear*
input str2(firm fund) idate f_idate iamount str2(company)
f1 v1 1999 1999 3 c1
f1 v1 2001 1999 5 c2
f1 v1 2002 1999 7 c3
f1 v2 2000 2000 12 c4
f1 v2 2004 2000 2 c5
f2 v1 1980 1980 5 c6
f2 v1 1981 1980 8 c6
f2 v1 1982 1980 9 c6
f2 v2 1985 1985 12 c7
f2 v2 1987 1985 8 c8
end

gen experience = .
gen exp_used = .
gen exp_distinct = .
levelsof firm, local(firms)
foreach f of local firms {
    levelsof fund if firm == `"`f'"', local(funds)
    foreach g of local funds {
        summ f_idate if firm == `"`f'"' & fund == `"`g'"', meanonly
        local cutoff = r(mean)
        count if idate < `cutoff' & firm == `"`f'"'
        replace experience = r(N) if firm == `"`f'"' & fund == `"`g'"'
        summ iamount if idate < `cutoff' & firm == `"`f'"', meanonly
        replace exp_used = r(sum) if firm == `"`f'"' & fund == `"`g'"'
        distinct company if idate < `cutoff' & firm == `"`f'"'
        replace exp_distinct = `r(ndistinct)' if firm == `"`f'"' & fund == `"`g'"'        
    }
}

list, noobs sepby(firm fund)

Note: The calculation of exp_distinct uses Nick Cox's distinct.ado. Unless you already have it installed, you will need to get it by running -ssc install distinct-.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35726
#15

17 May 2016, 15:45

distinct is by Gary Longton and myself. The latest version should be downloaded from the Stata Journal files.
Comment

Announcement

Counting unique observations by group, before a certain date

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment