Invalid syntax r(198) when trying to use user-written ado file

Yifeng Deng

Join Date: Sep 2023
Posts: 2

Invalid syntax r(198) when trying to use user-written ado file

23 Sep 2023, 19:48

Hello Stata Forum,

I have data of job postings with the following variables.

employer_original -- unprocessed employer name
emp -- employer name that has been trimmed spaces, lowercase, and remove commonly seen suffix.
clusterid -- used clustering algorithm to fuzzy match 'emp' to groups
sector -- two digit NAICS sector code of that job posting

The main do-file is

Code:

order employer_original emp clusterid sector diff_sector
gsort -diff_sector clusterid sector 

* first drop all duplicates that has the same "employer_original" "emp" and "sector"
duplicates drop employer_original emp sector, force

* drop if missing clusterid or emp 
drop if mi(emp) | mi(clusterid)

* 1 -- Based on Normalized Levensthein distance of emp, generate new clusterid for those with diff_sector == 1
strgroup emp if diff_sector == 1, gen(newclusterid) threshold(0.3) first norm(longer) force // Increasing threshold will results in more dis-similar string being group in the same cluster, and vice versa. There are tradeoffs. Think of this as the maximum dissimilarity we can allow for different firm names, regardless of their sector. 

qui sum clusterid
local maxclusterid = r(max) // store max clusterid in local

gen cclusterid = cond(diff_sector == 1, newclusterid + `maxclusterid', clusterid) // newclusterid are mapped so that no overlapping with old clusterid where diff_sector == 0

order employer_original emp clusterid cclusterid sector diff_sector

* 2 -- Combine sector and new clusterid to create clusters
levelsof cclusterid, local(cclusterids)

matrix input D = (0,1,5,5,5,2,3,5,5,5,5,5,5,5,5,5,5,2,5,5\1,0,4,4,3,3,5,4,5,5,5,5,5,5,5,5,5,5,5,5\5,4,0,2,2,4,5,2,5,5,3,5,4,2,5,5,5,4,5,3\5,4,2,0,4,5,5,3,5,5,3,5,5,5,5,5,5,5,5,5\5,3,2,4,0,3,3,3,5,5,4,5,5,5,5,5,5,3,5,5\2,3,4,5,3,0,1,2,5,5,4,5,5,5,5,5,5,5,5,5\3,5,5,5,3,1,0,4,4,4,5,4,3,5,5,5,3,1,5,5\5,4,2,3,3,2,4,0,5,5,5,5,5,5,5,5,5,5,5,5\5,5,5,5,5,5,4,5,0,2,4,1,1,5,2,3,2,4,5,2\5,5,5,5,5,5,4,5,2,0,3,2,2,5,4,5,4,5,5,4\5,5,3,3,4,4,5,5,4,3,0,5,5,4,5,5,5,1,5,5\5,5,5,5,5,5,4,5,1,2,5,0,1,4,1,3,2,5,5,2\5,5,4,5,5,5,3,5,1,2,5,1,0,4,2,5,5,5,5,2\5,5,2,5,5,5,5,5,5,5,4,4,4,0,5,5,5,5,5,3\5,5,5,5,5,5,5,5,2,4,5,1,2,5,0,5,3,5,5,4\5,5,5,5,5,5,5,5,3,5,5,3,5,5,5,0,5,5,5,5\5,5,5,5,5,5,3,5,2,4,5,2,5,5,3,5,0,5,5,5\2,5,4,5,3,5,1,5,4,5,1,5,5,5,5,5,5,0,5,5\5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,0,3\5,5,3,5,5,5,5,5,2,4,5,2,2,3,4,5,5,5,3,0) // industry distance matrix

matrix rownames D = "11" "21" "22" "23" "31-33" "42" "44-45" "48-49" "51" "52" "53" "54" "55" "56" "61" "62" "71" "72" "81" "92"
matrix colnames D = "11" "21" "22" "23" "31-33" "42" "44-45" "48-49" "51" "52" "53" "54" "55" "56" "61" "62" "71" "72" "81" "92"

qui sum cclusterid
local maxcclusterid = r(max)

gen ncclusterid = .
order ncclusterid, af(cclusterid)

bys cclusterid: ilcluster

In the last line of the code, I'm trying to call the user-defined function that executes for each group of cclusterid.

I want to write a byable function in ado file, here's the code

Code:

program define ilcluster, sortpreserve byable(recall)
    version 18.0
    set trace on
    local n = _N
    qui sum cclusterid
    local maxcclusterid = r(max)

* Loop through each combination of observations within the cluster
    forval i = 1/`n'{
        forval j = 1/`n'{
            local emp1 = emp[`i']
            local emp2 = emp[`j']
            local sector1 = sector[`i']
            local sector2 = sector[`j']
            
            * Calculate the normalized Levenshtein distance and store it in local
            local emp_len1 = strlen("`emp1'")
            local emp_len2 = strlen("`emp2'")
            local maxstrlen = max(`emp_len1',`emp_len2')
            qui ustrdist "`emp1'" "`emp2'"
            local levensthein = r(d)/`maxstrlen' // normalized Levenshtein distance using the max string length
            local industhry_dist = D["`sector1'","`sector2'"]
            local t = `industry_dist'*`levensthein'
            
            if `t' >= 0.5 {
                replace ncclusterid[`i'] = cclusterid[`i'] + `maxcclusterid' 
            }            
        }
    }
end

But even if running without the by command I get the invalid syntax r(198) error. I'm new to Stata programming and using ado-files, and I'm clueless now. Any help is appreciated!

Best,
Yifeng

Tags: ADO, byable, invalid syntax

Nick Cox

Join Date: Mar 2014

Posts: 35775
#2

24 Sep 2023, 05:41

Code:

local industhry_dist = D["`sector1'","`sector2'"] local t = `industry_dist'*`levensthein'

The first local macro needs to be named industry_dist to be used in the second.

Last edited by Nick Cox; 24 Sep 2023, 05:47.
Comment
Yifeng Deng

Join Date: Sep 2023

Posts: 2
#3

25 Sep 2023, 10:58

Originally posted by Nick Cox View Post

Code:

local industhry_dist = D["`sector1'","`sector2'"] local t = `industry_dist'*`levensthein'

The first local macro needs to be named industry_dist to be used in the second.

Many thanks Nick. However, I fixed the typo and the problem seems to persist. This is my first time writing an ado file so I apologize if some mistakes seems stupid.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35775
#4

25 Sep 2023, 12:04

Where does the error occur? You have set trace on so we should be able to see where the program stops.
Comment

Announcement

Invalid syntax r(198) when trying to use user-written ado file

Comment

Comment

Comment