Disappearing ids while matching with rowrange

Gabor Mugge

Join Date: Apr 2021
Posts: 30

Disappearing ids while matching with rowrange

12 Apr 2022, 10:11

Hello,

The aim of my code below is to extract patent information contained in the 'tls201_part01.csv' 'tls201_part02.csv' 'tls201_part03.csv' files for a list of patent applications collected in 'temp_209_PVblock.dta'. I am performing the matching based on the application identifier 'appln_id', which uniquely identifies observations in both 'temp_209_PVblock.dta' and the 'tls201' files. Every 'appln_id' that is contained in 'temp_209_PVblock.dta' should be contained in 'tls201' (I cannot directly check this easily, but I am fairly confident that it is true).
Normally, I would just import the csv files and match these with 'temp_209_PVblock.dta' - keeping only the matched elements - however, the csvs are very large (with around 50 million observations each), which prompted me to carry out the matching process in bits with the rowrange option.
The problem that I am experiencing is once I run the full code below, it seems that 'temp_209_PVblock.dta' contains application identifiers for which no matches were created. Do you have any ideas how could this happen?

Thank you,
Gabor

Code:

local N=1
local i=1
quietly {
while `N'>0 {
local k=`i'*1000000
local b=`k'-999999
import delimited using "raw/tls201_part01.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k')
local N=_N
merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate
save "temp/temp_201_PVblock`i'_full.dta", replace
local i=`i'+1
global end_section=`i'
}
}

local N=1
local i=1

quietly {
while `N'>0 {
local k=`i'*1000000
local b=`k'-999999
import delimited using "raw/tls201_part02.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k')
local N=_N
merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate
local j=`i'+$end_section
save "temp/temp_201_PVblock`j'_full.dta", replace
local i=`i'+1
global end=`j'
}
}

*52
global end_section=$end

local N=1
local i=1

quietly {
while `N'>0 {
local k=`i'*1000000
local b=`k'-999999
import delimited using "raw/tls201_part03.csv", varnames(1) encoding(ISO-8859-1)clear rowrange(`b':`k')
local N=_N
merge 1:1 appln_id using "temp/temp_209_PVblock.dta", keep(match) nogenerate
local j=`i'+$end_section
save "temp/temp_201_PVblock`j'_full.dta", replace
local i=`i'+1
}
}

use "temp/temp_201_PVblock1_full.dta", clear
forval i=2/118 {
append using "temp/temp_201_PVblock`i'_full.dta"
}
save "temp/temp_201_USPVblock_prefamily.dta", replace

Here you can see the values taken by 'appln_id':

Code:

. su appln_id, det

                          appln_id
-------------------------------------------------------------
      Percentiles      Smallest
 1%      2680387            240
 5%     1.40e+07            376
10%     2.61e+07            636       Obs             794,245
25%     3.95e+07            647       Sum of wgt.     794,245

50%     3.77e+08                      Mean           2.95e+08
                        Largest       Std. dev.      2.06e+08
75%     4.81e+08       5.45e+08
90%     5.19e+08       5.45e+08       Variance       4.23e+16
95%     5.32e+08       5.45e+08       Skewness       -.356266
99%     5.42e+08       5.45e+08       Kurtosis       1.363318

Tags: None

Gabor Mugge

Join Date: Apr 2021

Posts: 30
#2

12 Apr 2022, 13:42

This is now resolved. The problem is unrelated to the matching, sorry.
Comment

Announcement

Disappearing ids while matching with rowrange

Comment