Hello,
The aim of my code below is to extract patent information contained in the 'tls201_part01.csv' 'tls201_part02.csv' 'tls201_part03.csv' files for a list of patent applications collected in 'temp_209_PVblock.dta'. I am performing the matching based on the application identifier 'appln_id', which uniquely identifies observations in both 'temp_209_PVblock.dta' and the 'tls201' files. Every 'appln_id' that is contained in 'temp_209_PVblock.dta' should be contained in 'tls201' (I cannot directly check this easily, but I am fairly confident that it is true).
Normally, I would just import the csv files and match these with 'temp_209_PVblock.dta' - keeping only the matched elements - however, the csvs are very large (with around 50 million observations each), which prompted me to carry out the matching process in bits with the rowrange option.
The problem that I am experiencing is once I run the full code below, it seems that 'temp_209_PVblock.dta' contains application identifiers for which no matches were created. Do you have any ideas how could this happen?
Thank you,
Gabor
Here you can see the values taken by 'appln_id':
The aim of my code below is to extract patent information contained in the 'tls201_part01.csv' 'tls201_part02.csv' 'tls201_part03.csv' files for a list of patent applications collected in 'temp_209_PVblock.dta'. I am performing the matching based on the application identifier 'appln_id', which uniquely identifies observations in both 'temp_209_PVblock.dta' and the 'tls201' files. Every 'appln_id' that is contained in 'temp_209_PVblock.dta' should be contained in 'tls201' (I cannot directly check this easily, but I am fairly confident that it is true).
Normally, I would just import the csv files and match these with 'temp_209_PVblock.dta' - keeping only the matched elements - however, the csvs are very large (with around 50 million observations each), which prompted me to carry out the matching process in bits with the rowrange option.
The problem that I am experiencing is once I run the full code below, it seems that 'temp_209_PVblock.dta' contains application identifiers for which no matches were created. Do you have any ideas how could this happen?
Thank you,
Gabor
Code:
local N=1
local i=1
quietly {
while `N'>0 {
local k=`i'*1000000
local b=`k'-999999
import delimited using "raw/tls201_part01.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k')
local N=_N
merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate
save "temp/temp_201_PVblock`i'_full.dta", replace
local i=`i'+1
global end_section=`i'
}
}
local N=1
local i=1
quietly {
while `N'>0 {
local k=`i'*1000000
local b=`k'-999999
import delimited using "raw/tls201_part02.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k')
local N=_N
merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate
local j=`i'+$end_section
save "temp/temp_201_PVblock`j'_full.dta", replace
local i=`i'+1
global end=`j'
}
}
*52
global end_section=$end
local N=1
local i=1
quietly {
while `N'>0 {
local k=`i'*1000000
local b=`k'-999999
import delimited using "raw/tls201_part03.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k')
local N=_N
merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate
local j=`i'+$end_section
save "temp/temp_201_PVblock`j'_full.dta", replace
local i=`i'+1
}
}
use "temp/temp_201_PVblock1_full.dta", clear
forval i=2/118 {
append using "temp/temp_201_PVblock`i'_full.dta"
}
save "temp/temp_201_USPVblock_prefamily.dta", replace
Code:
. su appln_id, det
appln_id
-------------------------------------------------------------
Percentiles Smallest
1% 2680387 240
5% 1.40e+07 376
10% 2.61e+07 636 Obs 794,245
25% 3.95e+07 647 Sum of wgt. 794,245
50% 3.77e+08 Mean 2.95e+08
Largest Std. dev. 2.06e+08
75% 4.81e+08 5.45e+08
90% 5.19e+08 5.45e+08 Variance 4.23e+16
95% 5.32e+08 5.45e+08 Skewness -.356266
99% 5.42e+08 5.45e+08 Kurtosis 1.363318

Comment