Dear Statalisters,
I am using Stata/MP 16.1 to analyze a large dataset (~100 mil obs.) of patent citations. The goal is to count how many times any given patent is cited by a firm in the last 5 years. I have 3 primary variables -- 'firm' (unique id for each firm in dataset), 'year' (the year in which the firm cited the patent), and 'ref' (id of patent cited by a firm in a year). I am trying to create a new variable 'reps' (no. of repeat citations), based on how many times a patent cited by a firm in a given year is cited by the same firm in the last 5 years. It's an unbalanced panel structure of firms and years, where some firms start and end in different years and may miss years.
The data example is:
clear
input str40 firm int year str15 ref
"org_001uxXJc9ahwSUaQgf2B" 2018 "4195266"
"org_001uxXJc9ahwSUaQgf2B" 2018 "4210814"
"org_001uxXJc9ahwSUaQgf2B" 2018 "7811477"
"org_001uxXJc9ahwSUaQgf2B" 2018 "8022620"
"org_001uxXJc9ahwSUaQgf2B" 2018 "8791455"
"org_001uxXJc9ahwSUaQgf2B" 2019 "10042199"
"org_001uxXJc9ahwSUaQgf2B" 2019 "8098358"
"org_001uxXJc9ahwSUaQgf2B" 2019 "8550032"
"org_001uxXJc9ahwSUaQgf2B" 2019 "8969106"
"org_001uxXJc9ahwSUaQgf2B" 2019 "9093669"
"org_001uxXJc9ahwSUaQgf2B" 2019 "9129923"
"org_004MZT9Rx9WZqD4aLlea" 2020 "3180728"
"org_006JvugQuZ7ZQwoFIIXI" 2020 "10203459"
"org_00Bxpjs1eOC1SUnzixOZ" 2020 "10044678"
"org_00Bxpjs1eOC1SUnzixOZ" 2020 "10044678"
"org_00Bxpjs1eOC1SUnzixOZ" 2020 "10044678"
"org_00FbWgcvLslVHVdnQoDe" 2019 "4796608"
"org_00Ieu37WCr43N9Htd0sC" 1979 "3077677"
"org_00Ieu37WCr43N9Htd0sC" 1980 "1809786"
"org_00Ieu37WCr43N9Htd0sC" 1981 "1485863"
"org_00Ieu37WCr43N9Htd0sC" 1983 "1879982"
"org_00JpELbQnS0nBSO8KxRe" 1976 "1046062"
"org_00KJ0SlLWuG2bRDqOamx" 2019 "5445373"
"org_00OnX12PN29N9aQBBUel" 2020 "3195425"
"org_00P7CxipBoXJEnCe0R7z" 2018 "6056237"
"org_00SN9Z5CIxJYKvx9IGoR" 2019 "2023452"
"org_00SN9Z5CIxJYKvx9IGoR" 2019 "4361993"
"org_00T8Ckgd3lSpwSJEAm2R" 1980 "2197783"
"org_00ddO5EIFfpuJQyPm0O8" 2020 "10058766"
"org_00ddO5EIFfpuJQyPm0O8" 2020 "10058766"
"org_00fsXVFSWShpv9zdv5hL" 1977 "3054197"
"org_00guCeEyJC09wDtob9Eu" 1980 "1465968"
"org_00hIRgZuc8JuA39Ed8hh" 2018 "4930516"
"org_00qOxTpjqQLDdo35Otvr" 1983 "1189921"
"org_00qjpDOu8a19j4FT13Qr" 2019 "3964482"
"org_00sCUp5IVdIvjTmGs1fZ" 1980 "2602536"
"org_00sClfPnBJwuELJPUgyb" 2020 "3964111"
"org_00ty0HOZV2r6rJPHmKKj" 1976 "2362653"
"org_00ujnLmXQYDXJTahBhdH" 2020 "3199171"
"org_00vZyJwPtLwEuoHMTa9N" 2018 "4959319"
"org_00wCodDb16ONzXYBn9uC" 1976 "3043782"
"org_00wCodDb16ONzXYBn9uC" 1978 "1100519"
"org_00wCodDb16ONzXYBn9uC" 1979 "3165421"
"org_00wCodDb16ONzXYBn9uC" 1980 "2470199"
"org_00wCodDb16ONzXYBn9uC" 1980 "3185584"
"org_00wCodDb16ONzXYBn9uC" 1981 "3352759"
"org_00wCodDb16ONzXYBn9uC" 1981 "3933590"
"org_00wCodDb16ONzXYBn9uC" 1983 "3904361"
"org_00ykbCxnQCkKjckVkE2T" 2018 "5525884"
"org_0137m2K2JKuqYJCjnz31" 1982 "2773482"
"org_014iIXTfIUaGipvX9f4u" 1977 "2337064"
"org_016BcVTB2fr5HhcP62rH" 2018 "1099921"
"org_017F5bZ7kTisGPyXMO6E" 2019 "5039832"
"org_01A5iBxw0fQXt50uc2Eu" 1980 "2758231"
"org_01A5iBxw0fQXt50uc2Eu" 1982 "3797307"
"org_01EGbEGACKS2x8f4v4MK" 1976 "2797201"
"org_01EGbEGACKS2x8f4v4MK" 1980 "2045452"
"org_01GO5JWFLqOntP8Hax10" 2019 "4310523"
"org_01HZGOOIUfbNcpFJ7mLs" 2019 "5643874"
"org_01Jp9tjqUCyp71xno5UM" 1981 "1501047"
"org_01KerRMM81Q3voALZLNH" 1976 "3581276"
"org_01KerRMM81Q3voALZLNH" 1977 "1383269"
"org_01KerRMM81Q3voALZLNH" 1977 "3457831"
"org_01KerRMM81Q3voALZLNH" 1977 "3785037"
"org_01KerRMM81Q3voALZLNH" 1981 "1636381"
"org_01KerRMM81Q3voALZLNH" 1982 "1188667"
"org_01LLN0TuMf61h8De1Mgr" 2018 "1996539"
"org_01Njig2rTpu0Qzwq2Yge" 2020 "7308423"
"org_01NyPoAYDK9Gxr8FLEfd" 2019 "1247034"
I'm working on the following syntax. However, I'm not confident that this is the right code to achieve the goal. Basically, I need to loop over the variable 'ref' (cited patent) to check each observation and see whether it appears again for the same firm within the last 5 years. If it does, I need to update the variable 'reps' with the count of repetitions. I appreciate any help on correcting my syntax.
gen reps = .
local N = _N
set trace on
forval i = 1/`N'{
bysort firm year: count if ref == ref[`i'] & (year[`i'] - year) <= 5
replace reps = r(N) in `i'
}
set trace off
I am using Stata/MP 16.1 to analyze a large dataset (~100 mil obs.) of patent citations. The goal is to count how many times any given patent is cited by a firm in the last 5 years. I have 3 primary variables -- 'firm' (unique id for each firm in dataset), 'year' (the year in which the firm cited the patent), and 'ref' (id of patent cited by a firm in a year). I am trying to create a new variable 'reps' (no. of repeat citations), based on how many times a patent cited by a firm in a given year is cited by the same firm in the last 5 years. It's an unbalanced panel structure of firms and years, where some firms start and end in different years and may miss years.
The data example is:
clear
input str40 firm int year str15 ref
"org_001uxXJc9ahwSUaQgf2B" 2018 "4195266"
"org_001uxXJc9ahwSUaQgf2B" 2018 "4210814"
"org_001uxXJc9ahwSUaQgf2B" 2018 "7811477"
"org_001uxXJc9ahwSUaQgf2B" 2018 "8022620"
"org_001uxXJc9ahwSUaQgf2B" 2018 "8791455"
"org_001uxXJc9ahwSUaQgf2B" 2019 "10042199"
"org_001uxXJc9ahwSUaQgf2B" 2019 "8098358"
"org_001uxXJc9ahwSUaQgf2B" 2019 "8550032"
"org_001uxXJc9ahwSUaQgf2B" 2019 "8969106"
"org_001uxXJc9ahwSUaQgf2B" 2019 "9093669"
"org_001uxXJc9ahwSUaQgf2B" 2019 "9129923"
"org_004MZT9Rx9WZqD4aLlea" 2020 "3180728"
"org_006JvugQuZ7ZQwoFIIXI" 2020 "10203459"
"org_00Bxpjs1eOC1SUnzixOZ" 2020 "10044678"
"org_00Bxpjs1eOC1SUnzixOZ" 2020 "10044678"
"org_00Bxpjs1eOC1SUnzixOZ" 2020 "10044678"
"org_00FbWgcvLslVHVdnQoDe" 2019 "4796608"
"org_00Ieu37WCr43N9Htd0sC" 1979 "3077677"
"org_00Ieu37WCr43N9Htd0sC" 1980 "1809786"
"org_00Ieu37WCr43N9Htd0sC" 1981 "1485863"
"org_00Ieu37WCr43N9Htd0sC" 1983 "1879982"
"org_00JpELbQnS0nBSO8KxRe" 1976 "1046062"
"org_00KJ0SlLWuG2bRDqOamx" 2019 "5445373"
"org_00OnX12PN29N9aQBBUel" 2020 "3195425"
"org_00P7CxipBoXJEnCe0R7z" 2018 "6056237"
"org_00SN9Z5CIxJYKvx9IGoR" 2019 "2023452"
"org_00SN9Z5CIxJYKvx9IGoR" 2019 "4361993"
"org_00T8Ckgd3lSpwSJEAm2R" 1980 "2197783"
"org_00ddO5EIFfpuJQyPm0O8" 2020 "10058766"
"org_00ddO5EIFfpuJQyPm0O8" 2020 "10058766"
"org_00fsXVFSWShpv9zdv5hL" 1977 "3054197"
"org_00guCeEyJC09wDtob9Eu" 1980 "1465968"
"org_00hIRgZuc8JuA39Ed8hh" 2018 "4930516"
"org_00qOxTpjqQLDdo35Otvr" 1983 "1189921"
"org_00qjpDOu8a19j4FT13Qr" 2019 "3964482"
"org_00sCUp5IVdIvjTmGs1fZ" 1980 "2602536"
"org_00sClfPnBJwuELJPUgyb" 2020 "3964111"
"org_00ty0HOZV2r6rJPHmKKj" 1976 "2362653"
"org_00ujnLmXQYDXJTahBhdH" 2020 "3199171"
"org_00vZyJwPtLwEuoHMTa9N" 2018 "4959319"
"org_00wCodDb16ONzXYBn9uC" 1976 "3043782"
"org_00wCodDb16ONzXYBn9uC" 1978 "1100519"
"org_00wCodDb16ONzXYBn9uC" 1979 "3165421"
"org_00wCodDb16ONzXYBn9uC" 1980 "2470199"
"org_00wCodDb16ONzXYBn9uC" 1980 "3185584"
"org_00wCodDb16ONzXYBn9uC" 1981 "3352759"
"org_00wCodDb16ONzXYBn9uC" 1981 "3933590"
"org_00wCodDb16ONzXYBn9uC" 1983 "3904361"
"org_00ykbCxnQCkKjckVkE2T" 2018 "5525884"
"org_0137m2K2JKuqYJCjnz31" 1982 "2773482"
"org_014iIXTfIUaGipvX9f4u" 1977 "2337064"
"org_016BcVTB2fr5HhcP62rH" 2018 "1099921"
"org_017F5bZ7kTisGPyXMO6E" 2019 "5039832"
"org_01A5iBxw0fQXt50uc2Eu" 1980 "2758231"
"org_01A5iBxw0fQXt50uc2Eu" 1982 "3797307"
"org_01EGbEGACKS2x8f4v4MK" 1976 "2797201"
"org_01EGbEGACKS2x8f4v4MK" 1980 "2045452"
"org_01GO5JWFLqOntP8Hax10" 2019 "4310523"
"org_01HZGOOIUfbNcpFJ7mLs" 2019 "5643874"
"org_01Jp9tjqUCyp71xno5UM" 1981 "1501047"
"org_01KerRMM81Q3voALZLNH" 1976 "3581276"
"org_01KerRMM81Q3voALZLNH" 1977 "1383269"
"org_01KerRMM81Q3voALZLNH" 1977 "3457831"
"org_01KerRMM81Q3voALZLNH" 1977 "3785037"
"org_01KerRMM81Q3voALZLNH" 1981 "1636381"
"org_01KerRMM81Q3voALZLNH" 1982 "1188667"
"org_01LLN0TuMf61h8De1Mgr" 2018 "1996539"
"org_01Njig2rTpu0Qzwq2Yge" 2020 "7308423"
"org_01NyPoAYDK9Gxr8FLEfd" 2019 "1247034"
I'm working on the following syntax. However, I'm not confident that this is the right code to achieve the goal. Basically, I need to loop over the variable 'ref' (cited patent) to check each observation and see whether it appears again for the same firm within the last 5 years. If it does, I need to update the variable 'reps' with the count of repetitions. I appreciate any help on correcting my syntax.
gen reps = .
local N = _N
set trace on
forval i = 1/`N'{
bysort firm year: count if ref == ref[`i'] & (year[`i'] - year) <= 5
replace reps = r(N) in `i'
}
set trace off
Comment