Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Loop over a variable to count repetitions of observations based on group and time

    Dear Statalisters,

    I am using Stata/MP 16.1 to analyze a large dataset (~100 mil obs.) of patent citations. The goal is to count how many times any given patent is cited by a firm in the last 5 years. I have 3 primary variables -- 'firm' (unique id for each firm in dataset), 'year' (the year in which the firm cited the patent), and 'ref' (id of patent cited by a firm in a year). I am trying to create a new variable 'reps' (no. of repeat citations), based on how many times a patent cited by a firm in a given year is cited by the same firm in the last 5 years. It's an unbalanced panel structure of firms and years, where some firms start and end in different years and may miss years.

    The data example is:
    clear
    input str40 firm int year str15 ref
    "org_001uxXJc9ahwSUaQgf2B" 2018 "4195266"
    "org_001uxXJc9ahwSUaQgf2B" 2018 "4210814"
    "org_001uxXJc9ahwSUaQgf2B" 2018 "7811477"
    "org_001uxXJc9ahwSUaQgf2B" 2018 "8022620"
    "org_001uxXJc9ahwSUaQgf2B" 2018 "8791455"
    "org_001uxXJc9ahwSUaQgf2B" 2019 "10042199"
    "org_001uxXJc9ahwSUaQgf2B" 2019 "8098358"
    "org_001uxXJc9ahwSUaQgf2B" 2019 "8550032"
    "org_001uxXJc9ahwSUaQgf2B" 2019 "8969106"
    "org_001uxXJc9ahwSUaQgf2B" 2019 "9093669"
    "org_001uxXJc9ahwSUaQgf2B" 2019 "9129923"
    "org_004MZT9Rx9WZqD4aLlea" 2020 "3180728"
    "org_006JvugQuZ7ZQwoFIIXI" 2020 "10203459"
    "org_00Bxpjs1eOC1SUnzixOZ" 2020 "10044678"
    "org_00Bxpjs1eOC1SUnzixOZ" 2020 "10044678"
    "org_00Bxpjs1eOC1SUnzixOZ" 2020 "10044678"
    "org_00FbWgcvLslVHVdnQoDe" 2019 "4796608"
    "org_00Ieu37WCr43N9Htd0sC" 1979 "3077677"
    "org_00Ieu37WCr43N9Htd0sC" 1980 "1809786"
    "org_00Ieu37WCr43N9Htd0sC" 1981 "1485863"
    "org_00Ieu37WCr43N9Htd0sC" 1983 "1879982"
    "org_00JpELbQnS0nBSO8KxRe" 1976 "1046062"
    "org_00KJ0SlLWuG2bRDqOamx" 2019 "5445373"
    "org_00OnX12PN29N9aQBBUel" 2020 "3195425"
    "org_00P7CxipBoXJEnCe0R7z" 2018 "6056237"
    "org_00SN9Z5CIxJYKvx9IGoR" 2019 "2023452"
    "org_00SN9Z5CIxJYKvx9IGoR" 2019 "4361993"
    "org_00T8Ckgd3lSpwSJEAm2R" 1980 "2197783"
    "org_00ddO5EIFfpuJQyPm0O8" 2020 "10058766"
    "org_00ddO5EIFfpuJQyPm0O8" 2020 "10058766"
    "org_00fsXVFSWShpv9zdv5hL" 1977 "3054197"
    "org_00guCeEyJC09wDtob9Eu" 1980 "1465968"
    "org_00hIRgZuc8JuA39Ed8hh" 2018 "4930516"
    "org_00qOxTpjqQLDdo35Otvr" 1983 "1189921"
    "org_00qjpDOu8a19j4FT13Qr" 2019 "3964482"
    "org_00sCUp5IVdIvjTmGs1fZ" 1980 "2602536"
    "org_00sClfPnBJwuELJPUgyb" 2020 "3964111"
    "org_00ty0HOZV2r6rJPHmKKj" 1976 "2362653"
    "org_00ujnLmXQYDXJTahBhdH" 2020 "3199171"
    "org_00vZyJwPtLwEuoHMTa9N" 2018 "4959319"
    "org_00wCodDb16ONzXYBn9uC" 1976 "3043782"
    "org_00wCodDb16ONzXYBn9uC" 1978 "1100519"
    "org_00wCodDb16ONzXYBn9uC" 1979 "3165421"
    "org_00wCodDb16ONzXYBn9uC" 1980 "2470199"
    "org_00wCodDb16ONzXYBn9uC" 1980 "3185584"
    "org_00wCodDb16ONzXYBn9uC" 1981 "3352759"
    "org_00wCodDb16ONzXYBn9uC" 1981 "3933590"
    "org_00wCodDb16ONzXYBn9uC" 1983 "3904361"
    "org_00ykbCxnQCkKjckVkE2T" 2018 "5525884"
    "org_0137m2K2JKuqYJCjnz31" 1982 "2773482"
    "org_014iIXTfIUaGipvX9f4u" 1977 "2337064"
    "org_016BcVTB2fr5HhcP62rH" 2018 "1099921"
    "org_017F5bZ7kTisGPyXMO6E" 2019 "5039832"
    "org_01A5iBxw0fQXt50uc2Eu" 1980 "2758231"
    "org_01A5iBxw0fQXt50uc2Eu" 1982 "3797307"
    "org_01EGbEGACKS2x8f4v4MK" 1976 "2797201"
    "org_01EGbEGACKS2x8f4v4MK" 1980 "2045452"
    "org_01GO5JWFLqOntP8Hax10" 2019 "4310523"
    "org_01HZGOOIUfbNcpFJ7mLs" 2019 "5643874"
    "org_01Jp9tjqUCyp71xno5UM" 1981 "1501047"
    "org_01KerRMM81Q3voALZLNH" 1976 "3581276"
    "org_01KerRMM81Q3voALZLNH" 1977 "1383269"
    "org_01KerRMM81Q3voALZLNH" 1977 "3457831"
    "org_01KerRMM81Q3voALZLNH" 1977 "3785037"
    "org_01KerRMM81Q3voALZLNH" 1981 "1636381"
    "org_01KerRMM81Q3voALZLNH" 1982 "1188667"
    "org_01LLN0TuMf61h8De1Mgr" 2018 "1996539"
    "org_01Njig2rTpu0Qzwq2Yge" 2020 "7308423"
    "org_01NyPoAYDK9Gxr8FLEfd" 2019 "1247034"



    I'm working on the following syntax. However, I'm not confident that this is the right code to achieve the goal. Basically, I need to loop over the variable 'ref' (cited patent) to check each observation and see whether it appears again for the same firm within the last 5 years. If it does, I need to update the variable 'reps' with the count of repetitions. I appreciate any help on correcting my syntax.

    gen reps = .
    local N = _N
    set trace on
    forval i = 1/`N'{
    bysort firm year: count if ref == ref[`i'] & (year[`i'] - year) <= 5
    replace reps = r(N) in `i'
    }
    set trace off

  • #2
    You can do it that way, but easier and faster is:

    Code:
    gen long obs_no = _n
    rangestat (count) wanted = obs_no, by(firm ref) interval(year -4 0)
    -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer; it is available from SSC.

    Note: I assume by "last five years" you mean the current year and the four immediately preceding years. If you mean starting with the preceding year and going back to the fifth preceding year, change the -interval()- option to -interval(year -5 -1)-.

    Note: If years are skipped because a firm simply doesn't cite any patents in a given year, that is not a problem. But if there is really missing data, then realize that the best you can hope to do is count the number of repeat citations that appear in your data set--the real number may be larger and there is no statistical way around that problem.

    Comment


    • #3
      Clyde,

      Thanks a lot for your reply and that amazing solution! I installed and checked out the rangestat package. It's a great option! Also, given the size of my dataset, rangestat works much faster compared to running loops!

      Comment

      Working...
      X