Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting unique observations by group, before a certain date

    Dear Statalist,

    I want to count the unique number of projects done by a company before the first_idate ("experience"). How do I get from my inputs to my desired output?

    Thank you in advance!


    Input:
    Code:
    clear
    input float(company fund idate project iamount first_idate)
    1 1 2000 1 10 2000
    1 2 2001 2 9 2001
    1 2 2002 2 8 2001
    2 1 2003 1 4 2003
    2 1 2004 1 3 2003
    3 1 2003 1 12 2003
    3 3 2004 2 14 2004
    3 4 2006 3 22 2006
    4 1 2000 5 15 2000
    4 2 2000 6 6 2000
    end
    Output:
    Code:
       clear
      input float(company fund idate project iamount first_idate experience)
      1 1 2000 1 10 2000 0
      1 2 2001 2 9 2001 1
      1 2 2002 2 8 2001 0
      2 1 2003 1 4 2003 0
      2 1 2004 1 3 2003 0
      3 1 2003 1 12 2003 0
      3 3 2004 2 14 2004 1
      3 4 2006 3 22 2006 2
      4 1 2000 5 15 2000 0
      4 2 2000 6 6 2000 0
      end
    Last edited by Ben James; 16 May 2016, 09:07.

  • #2
    Ben:

    You seem to have asked this three times, once here and twice within http://www.statalist.org/forums/foru...inifs-function

    Please read http://www.statalist.org/forums/help#adviceextras Section 1 on why asking the same question repeatedly and rapidly is a bad idea and why you should try to clarify any question which is not understood by rewriting it.

    I guess I don't understand this question either because the code I came up with produces answers of zero for every possibility.

    My wild guesses:

    1. Unique here doesn't mean what dictionaries say, occurring once only. It means distinct. See also http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

    2. Before the first date must mean idate < first_idate

    So I tried this, but as said get all zeros.

    Code:
    clear
    input float(company fund idate project iamount first_idate)
    1 1 2000 1 10 2000
    1 2 2001 2 9 2001
    1 2 2002 2 8 2001
    2 1 2003 1 4 2003
    2 1 2004 1 3 2003
    3 1 2003 1 12 2003
    3 3 2004 2 14 2004
    3 4 2006 3 22 2006
    4 1 2000 5 15 2000
    4 2 2000 6 6 2000
    end
    
    egen tag = tag(company project) 
    gen tocount = idate < first_idate 
    egen ndistinct = total(tag * tocount), by(company) 
    
    list , sepby(company) 
    
         +----------------------------------------------------------------------------------+
         | company   fund   idate   project   iamount   first_~e   tag   tocount   ndisti~t |
         |----------------------------------------------------------------------------------|
      1. |       1      1    2000         1        10       2000     1         0          0 |
      2. |       1      2    2001         2         9       2001     1         0          0 |
      3. |       1      2    2002         2         8       2001     0         0          0 |
         |----------------------------------------------------------------------------------|
      4. |       2      1    2003         1         4       2003     1         0          0 |
      5. |       2      1    2004         1         3       2003     0         0          0 |
         |----------------------------------------------------------------------------------|
      6. |       3      1    2003         1        12       2003     1         0          0 |
      7. |       3      3    2004         2        14       2004     1         0          0 |
      8. |       3      4    2006         3        22       2006     1         0          0 |
         |----------------------------------------------------------------------------------|
      9. |       4      1    2000         5        15       2000     1         0          0 |
     10. |       4      2    2000         6         6       2000     1         0          0 |
         +----------------------------------------------------------------------------------+

    Comment


    • #3
      Hi Nick,
      I’m just new to the stata list forums, my apologies.
      1. Seeing from the link you gave, indeed it is distinct values I am referring to, which can be seen as “those left over after all duplicates have been removed”.
      2. Before the first_idate means: idate of all observations in that group which are <first_idate
      In accordance with what you said, let me try to rephrase/explain why the experience values should not be zero.

      In line:
      1 2 2001 2 9 2001
      The experience value should be 1, as we have company 1 in 2000 doing project 1.

      In line:
      3 4 2006 3 22 2006
      The experience value should equate to 2, as we have company 3 doing both project 1 in 2003 and project 2 in 2004.

      In line:
      1 2 2002 2 8 2001
      I realized I made a mistake, the experience value should again be 1, as we have company 1 in 2000 doing project 1.

      Comment


      • #4
        Thanks for the reply, but #1 and #2 seem to match my previous understanding, so I am no further forward.

        You don't seem to mention fund or iamount anywhere.

        If no one else can make sense of this, we need a simpler account.

        Comment


        • #5
          Before the first_idate means: idate of all observations in that group which are <first_idate
          There are no observations in the sample data for which idate<first_idate.

          Comment


          • #6
            I hope I do not repeat myself, but since there is a lot of unclarity about the question, I have responded to both of the above posts, and included an extra example:


            PART 1)
            Originally posted by William Lisowski View Post

            There are no observations in the sample data for which idate<first_idate.
            If we take a look at observation 2 (OBS2) and OBS3 from Nick’s first post in this thread, you see that the first_idate is 2001 for both. In the company group of 1 (which is OBS1 until OBS3), however, there is one observation with an idate of 2000 (OBS1), which is below the first_idate of OBS2 and OBS3 of 2001. Of those observations (in this case only OBS1), there is one distinct project (with a value of 1).

            Therefore, the experience variable would yield 1 for OBS2 and OBS3.
            For OBS1: since there are no observations below 2000 in the company group of 1, experience would yield 0.


            PART 2)
            Originally posted by Nick Cox View Post
            Thanks for the reply, but #1 and #2 seem to match my previous understanding, so I am no further forward.

            You don't seem to mention fund or iamount anywhere.

            If no one else can make sense of this, we need a simpler account.

            iamount:
            is indeed not part of the question.

            fund:
            is not directly part of the question, but links up to first_idate; as each fund is assigned one first_idate. Sorry if this was unclear.

            first_idate:
            is the very first date corresponding to an investment of that fund. By definition each fund only has one idate, as this is the very date an investment was ever done by that fund.

            experience
            To get the experience variable, I want to count all distinct projects for all observations where the idate for in that company group is smaller than the first_idate of that observation.
            (I do not want to compare the observation of the idate in that particular line to the observation of the first_idate in that same line, as by definition, first_idate will always be lower or equal to idate, as t is the very first investment of that fund)


            PART 3)
            PS. In my example counting:

            all observations where the idate for in that company group is smaller than the first_idate of that observation. -versus-
            all distinct projects for all observations where the idate for in that company group is smaller than the first_idate of that observation.


            both lead to the same outcome.
            However, changing observation 6 to:

            6. | 2 2 2006 1 12 2006

            would lead to the idate of observation 5 and 6 to be below the first_idate of new observation 6. These are two observations. However, since the projects are both 1, and I want the distinct number of projects, new observation 6 would yield an experience value of 1.

            Comment


            • #7
              Could this question be solved with a loop? (Since its seems I have to loop over all the values in each company and see if the (idate of group)<(first_idate of individual observation) separately)

              Comment


              • #8
                Just to note that I looked at this again and failed once more to grasp it all.

                Comment


                • #9
                  Let me try again completely.

                  Code:
                  input firm fund idate f_idate
                  1 1 1999 1999
                  1 1 2001 1999
                  1 1 2002 1999
                  1 2 2000 2000
                  1 2 2004 2000
                  2 1 1980 1980
                  2 1 1981 1980
                  2 1 1982 1980
                  2 2 1985 1985
                  2 2 1987 1985
                  end
                  We have investment firms (firms), which have a number of funds. This could be for instance KKR as a firm, and KKR I and KKR II as its funds. (A firm can be seen as a overarching company, and a fund as a type of miniature company within the firm.).

                  A particular firm gains investment experience throughout the years, by investing in different investments (each observation in my data is an investment transaction).

                  I would like to measure this investment experience when a fund has made its first investment (f_idate). The number of investments made before a f_idate shows me how many companies the firm has invested in, before it starts this new fund.

                  For instance fund 2 of firm 1 has a starting date of 2000 (f_idate). The previous fund of firm 1 has has made several investments, however not all of these are before the f_idate of the newest fund. Only the first observation, has an idate of 1999. That means that when fund2 of firm1 was started in 2000, the firm has only made 1 previous investment before, in 1999. Therefore, my new variable of experience would be equal to 0 for all the "fund 1 firm1" observations and equal to 1 for all the "fund 2 firm 1" observations.

                  Thank you again for your help and thoughts,

                  Ben

                  Comment


                  • #10
                    I think I understand what you want. If so, this should do it:

                    Code:
                    clear*
                    input firm fund idate f_idate
                    1 1 1999 1999
                    1 1 2001 1999
                    1 1 2002 1999
                    1 2 2000 2000
                    1 2 2004 2000
                    2 1 1980 1980
                    2 1 1981 1980
                    2 1 1982 1980
                    2 2 1985 1985
                    2 2 1987 1985
                    end
                    
                    gen experience = .
                    levelsof firm, local(firms)
                    foreach f of local firms {
                        levelsof fund if firm == `f', local(funds)
                        foreach g of local funds {
                            summ f_idate if firm == `f' & fund == `g', meanonly
                            local cutoff = r(mean)
                            count if idate < `cutoff' & firm == `f'
                            replace experience = r(N) if firm == `f' & fund == `g'
                        }
                    }
                    
                    list, noobs sepby(firm fund)
                    This is not a very "Stata-ish" solution to the problem as it comes close to looping over observations. My gut tells me that there should be a more elegant solution using -by:-, but I struggled in vain to find it. The difficulty arises because the result is calculated by fund, but the calculation must refer to observations of the firm that are not necessarily associated with that fund. So perhaps this is the best that can be done. I'd be eager to see if anybody comes up with something better.

                    Comment


                    • #11
                      Here's a potential solution using rangestat (from SSC:

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input float(firm fund idate f_idate)
                      1 1 1999 1999
                      1 1 2001 1999
                      1 1 2002 1999
                      1 2 2000 2000
                      1 2 2004 2000
                      2 1 1980 1980
                      2 1 1981 1980
                      2 1 1982 1980
                      2 2 1985 1985
                      2 2 1987 1985
                      end
                      
                      * identify investment start
                      bysort firm fund: gen tocount = idate == f_idate
                      
                      * experience from the beginning of time up to the year before the fund's first date
                      gen low = 0
                      gen high = f_idate - 1
                      rangestat (sum) experience = tocount, interval(idate low high) by(firm)
                      replace experience = 0 if mi(experience)
                      
                      list, sepby(firm fund)

                      Comment


                      • #12
                        Clyde's solution yields this:

                        Code:
                          +------------------------------------------+
                          | firm   fund   idate   f_idate   experi~e |
                          |------------------------------------------|
                          |    1      1    1999      1999          0 |
                          |    1      1    2001      1999          0 |
                          |    1      1    2002      1999          0 |
                          |------------------------------------------|
                          |    1      2    2000      2000          1 |
                          |    1      2    2004      2000          1 |
                          |------------------------------------------|
                          |    2      1    1980      1980          0 |
                          |    2      1    1981      1980          0 |
                          |    2      1    1982      1980          0 |
                          |------------------------------------------|
                          |    2      2    1985      1985          3 |
                          |    2      2    1987      1985          3 |
                          +------------------------------------------+
                        Is 3 right?

                        His code encouraged me to have another go. Here it is. My trick, which is what you can call it if it works, is to ignore observations after a fund starts, which just get in the way.

                        Code:
                        clear
                        input firm fund idate f_idate
                        1 1 1999 1999
                        1 1 2001 1999
                        1 1 2002 1999
                        1 2 2000 2000
                        1 2 2004 2000
                        2 1 1980 1980
                        2 1 1981 1980
                        2 1 1982 1980
                        2 2 1985 1985
                        2 2 1987 1985
                        end
                        
                        bysort firm fund (idate) : gen touse = _n == 1
                        bysort touse firm (idate) : gen exp = _n - 1 if touse
                        bysort firm fund (idate) : replace exp = sum(exp)
                        drop touse
                        list , sepby(firm fund)
                        
                             +-------------------------------------+
                             | firm   fund   idate   f_idate   exp |
                             |-------------------------------------|
                          1. |    1      1    1999      1999     0 |
                          2. |    1      1    2001      1999     0 |
                          3. |    1      1    2002      1999     0 |
                             |-------------------------------------|
                          4. |    1      2    2000      2000     1 |
                          5. |    1      2    2004      2000     1 |
                             |-------------------------------------|
                          6. |    2      1    1980      1980     0 |
                          7. |    2      1    1981      1980     0 |
                          8. |    2      1    1982      1980     0 |
                             |-------------------------------------|
                          9. |    2      2    1985      1985     1 |
                         10. |    2      2    1987      1985     1 |
                             +-------------------------------------+
                        .
                        Last edited by Nick Cox; 17 May 2016, 12:46.

                        Comment


                        • #13
                          Thank you so much Clyde, Robert and Nick.

                          Clyde that is what I am looking for!

                          The value 3 for observation 9 and 10 (observation numbers from post #12) is indeed correct, as the idate of observation 6 (OBS6), OBS7 and OBS8 are all below the f_idate 1985.

                          Some remarks:
                          1. Expanding the code to my sample, I get a “type mismatch error” (r109). I am guessing this is because my funds and firms variables are non-numeric (strings). How would this be solved?
                          2. This is very helpful already. I hope it does not make things even more complex, but if possible I want to construct two additional variables which are almost the same as ‘experience’
                            1. exp_usd
                              1. same as experience, except instead of counting investments, it adds up all the investment amounts made (iamount)
                              2. (This is thus an alternative way to measure experience, based on dollars invested)
                            2. exp_distinct
                              1. same as experience, except it only counts the distinct number of companies in the company variable (company)
                              2. (A fund can invest multiple times in the same company. This variable thus only counts the number of distinct companies invested in)
                          Code:
                          clear*
                          input str2(firm fund) idate f_idate iamount str2(company)
                          f1 v1 1999 1999 3 c1
                          f1 v1 2001 1999 5 c2
                          f1 v1 2002 1999 7 c3
                          f1 v2 2000 2000 12 c4
                          f1 v2 2004 2000 2 c5
                          f2 v1 1980 1980 5 c6
                          f2 v1 1981 1980 8 c6
                          f2 v1 1982 1980 9 c6
                          f2 v2 1985 1985 12 c7
                          f2 v2 1987 1985 8 c8
                          end

                          Comment


                          • #14
                            I think this does it:

                            Code:
                            clear*
                            input str2(firm fund) idate f_idate iamount str2(company)
                            f1 v1 1999 1999 3 c1
                            f1 v1 2001 1999 5 c2
                            f1 v1 2002 1999 7 c3
                            f1 v2 2000 2000 12 c4
                            f1 v2 2004 2000 2 c5
                            f2 v1 1980 1980 5 c6
                            f2 v1 1981 1980 8 c6
                            f2 v1 1982 1980 9 c6
                            f2 v2 1985 1985 12 c7
                            f2 v2 1987 1985 8 c8
                            end
                            
                            gen experience = .
                            gen exp_used = .
                            gen exp_distinct = .
                            levelsof firm, local(firms)
                            foreach f of local firms {
                                levelsof fund if firm == `"`f'"', local(funds)
                                foreach g of local funds {
                                    summ f_idate if firm == `"`f'"' & fund == `"`g'"', meanonly
                                    local cutoff = r(mean)
                                    count if idate < `cutoff' & firm == `"`f'"'
                                    replace experience = r(N) if firm == `"`f'"' & fund == `"`g'"'
                                    summ iamount if idate < `cutoff' & firm == `"`f'"', meanonly
                                    replace exp_used = r(sum) if firm == `"`f'"' & fund == `"`g'"'
                                    distinct company if idate < `cutoff' & firm == `"`f'"'
                                    replace exp_distinct = `r(ndistinct)' if firm == `"`f'"' & fund == `"`g'"'        
                                }
                            }
                            
                            list, noobs sepby(firm fund)
                            Note: The calculation of exp_distinct uses Nick Cox's distinct.ado. Unless you already have it installed, you will need to get it by running -ssc install distinct-.

                            Comment


                            • #15
                              distinct is by Gary Longton and myself. The latest version should be downloaded from the Stata Journal files.

                              Comment

                              Working...
                              X