Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count number of obs according to condition

    Dear statalister,

    I am facing a difficult problem for me to solve on stata.
    I have data on footballer and football clubs as follow:

    FootBaller id / Club id / Date
    1 / 1 / 1995
    1 / 3 / 1999
    1 / 6 / 2002
    2 / 3 / 1994
    2 / 6 / 2005

    etc..

    Date is actually the year when the footballer joined a new club. My goal is to compute for each entry the number of different clubs attended in the last 5 years (for exemple here, value for _n = 1 is 1, 2 for _n = 2, 2 for _n = 2 etc...

    One solution would be looping over each obs and counting, unfortunatly I have more than a million observations and this would take forever...

    Thanks a lot for suggestions

    Totonio

  • #2
    If I understand you correctly, combining by with egen, total should solve the problem. This is an example that counts the number of clubs a player attended between 1998 and 2006 (none in your sample data would meet the "last 5 years" criterion since it is 2014)
    Code:
    clear
    input pid cid year
    1 1 1995
    1 3 1999
    1 6 2002
    2 3 1994
    2 6 2005
    end
    gen c=0
    replace c=1 if (year>1998 & year<2006)
    bysort pid: egen nocount=total(c)
    drop c
    list

    Comment


    • #3
      Aspen is right, but I think he has misunderstood what the original poster (please re-register with your real first and last name) intended. Since I'm feeling telepathic today , I'll venture a guess that what he wants Iin each observation is the number of clubs that the footballer attended in the five years preceding Date. I'm feeling less telepathic as to whether, say, for 1995, this includes 1990 or only goes back through 1991. In the code below I assume that 1990 would be included, but it is easy enough to change it to the other way if I got that wrong.

      Code:
      clear*
      input FootBaller Club Date
      1 1 1995
      1 3 1999
      1 6 2002
      2 3 1994
      2 6 2005
      end
      list
      
      gen int wanted = .
      levelsof Date, local(dates)
      foreach d of local dates {
          egen current = total(inrange(Date, `d'-5, `d')), by(FootBaller)
          replace wanted = current if Date == `d'
          drop current
      }
      
      sort FootBaller Date
      list
      It does include a loop over the values of Date--but presumably there aren't more than a few dozen of those, so that shouldn't be too bad.

      Comment


      • #4
        Hi,
        Here is a solution that does not involve looping over observations. If you have more than one entry for a player in a given year this soultion would not work. You may have to aggregate those cases to a single year observation before you use this solution. You can change the number of lags from 5 years to any in the loop.

        Code:
        tsset Fid Date
        gen played = 1
        gen played5 = 0
        forval i = 1/5 {
            gen played_`i' = L`i'.played
            replace played_`i' = 0 if played_`i' == .
            replace played5 = played5 + played_`i'
        }   
        drop played_*
        Abraham

        Comment


        • #5
          Dear Clyde,

          You are absolutly right about what I asked. And btw, sorry about the pseudo, I am changing it.

          Thanks a lot for your answer.

          Dear Abraham,

          Sorry but I have multiple entry in the same id*year. But I will think about adapting your code if Clyde's takes too long to run.

          Best

          Tony

          Comment

          Working...
          X