Hello Stata Forum,
I have data of job postings with the following variables.
In the last line of the code, I'm trying to call the user-defined function that executes for each group of cclusterid.
I want to write a byable function in ado file, here's the code
But even if running without the by command I get the invalid syntax r(198) error. I'm new to Stata programming and using ado-files, and I'm clueless now. Any help is appreciated!
Best,
Yifeng
I have data of job postings with the following variables.
- employer_original -- unprocessed employer name
- emp -- employer name that has been trimmed spaces, lowercase, and remove commonly seen suffix.
- clusterid -- used clustering algorithm to fuzzy match 'emp' to groups
- sector -- two digit NAICS sector code of that job posting
Code:
order employer_original emp clusterid sector diff_sector gsort -diff_sector clusterid sector * first drop all duplicates that has the same "employer_original" "emp" and "sector" duplicates drop employer_original emp sector, force * drop if missing clusterid or emp drop if mi(emp) | mi(clusterid) * 1 -- Based on Normalized Levensthein distance of emp, generate new clusterid for those with diff_sector == 1 strgroup emp if diff_sector == 1, gen(newclusterid) threshold(0.3) first norm(longer) force // Increasing threshold will results in more dis-similar string being group in the same cluster, and vice versa. There are tradeoffs. Think of this as the maximum dissimilarity we can allow for different firm names, regardless of their sector. qui sum clusterid local maxclusterid = r(max) // store max clusterid in local gen cclusterid = cond(diff_sector == 1, newclusterid + `maxclusterid', clusterid) // newclusterid are mapped so that no overlapping with old clusterid where diff_sector == 0 order employer_original emp clusterid cclusterid sector diff_sector * 2 -- Combine sector and new clusterid to create clusters levelsof cclusterid, local(cclusterids) matrix input D = (0,1,5,5,5,2,3,5,5,5,5,5,5,5,5,5,5,2,5,5\1,0,4,4,3,3,5,4,5,5,5,5,5,5,5,5,5,5,5,5\5,4,0,2,2,4,5,2,5,5,3,5,4,2,5,5,5,4,5,3\5,4,2,0,4,5,5,3,5,5,3,5,5,5,5,5,5,5,5,5\5,3,2,4,0,3,3,3,5,5,4,5,5,5,5,5,5,3,5,5\2,3,4,5,3,0,1,2,5,5,4,5,5,5,5,5,5,5,5,5\3,5,5,5,3,1,0,4,4,4,5,4,3,5,5,5,3,1,5,5\5,4,2,3,3,2,4,0,5,5,5,5,5,5,5,5,5,5,5,5\5,5,5,5,5,5,4,5,0,2,4,1,1,5,2,3,2,4,5,2\5,5,5,5,5,5,4,5,2,0,3,2,2,5,4,5,4,5,5,4\5,5,3,3,4,4,5,5,4,3,0,5,5,4,5,5,5,1,5,5\5,5,5,5,5,5,4,5,1,2,5,0,1,4,1,3,2,5,5,2\5,5,4,5,5,5,3,5,1,2,5,1,0,4,2,5,5,5,5,2\5,5,2,5,5,5,5,5,5,5,4,4,4,0,5,5,5,5,5,3\5,5,5,5,5,5,5,5,2,4,5,1,2,5,0,5,3,5,5,4\5,5,5,5,5,5,5,5,3,5,5,3,5,5,5,0,5,5,5,5\5,5,5,5,5,5,3,5,2,4,5,2,5,5,3,5,0,5,5,5\2,5,4,5,3,5,1,5,4,5,1,5,5,5,5,5,5,0,5,5\5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,0,3\5,5,3,5,5,5,5,5,2,4,5,2,2,3,4,5,5,5,3,0) // industry distance matrix matrix rownames D = "11" "21" "22" "23" "31-33" "42" "44-45" "48-49" "51" "52" "53" "54" "55" "56" "61" "62" "71" "72" "81" "92" matrix colnames D = "11" "21" "22" "23" "31-33" "42" "44-45" "48-49" "51" "52" "53" "54" "55" "56" "61" "62" "71" "72" "81" "92" qui sum cclusterid local maxcclusterid = r(max) gen ncclusterid = . order ncclusterid, af(cclusterid) bys cclusterid: ilcluster
I want to write a byable function in ado file, here's the code
Code:
program define ilcluster, sortpreserve byable(recall) version 18.0 set trace on local n = _N qui sum cclusterid local maxcclusterid = r(max) * Loop through each combination of observations within the cluster forval i = 1/`n'{ forval j = 1/`n'{ local emp1 = emp[`i'] local emp2 = emp[`j'] local sector1 = sector[`i'] local sector2 = sector[`j'] * Calculate the normalized Levenshtein distance and store it in local local emp_len1 = strlen("`emp1'") local emp_len2 = strlen("`emp2'") local maxstrlen = max(`emp_len1',`emp_len2') qui ustrdist "`emp1'" "`emp2'" local levensthein = r(d)/`maxstrlen' // normalized Levenshtein distance using the max string length local industhry_dist = D["`sector1'","`sector2'"] local t = `industry_dist'*`levensthein' if `t' >= 0.5 { replace ncclusterid[`i'] = cclusterid[`i'] + `maxcclusterid' } } } end
Best,
Yifeng
Comment