Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Invalid syntax r(198) when trying to use user-written ado file

    Hello Stata Forum,

    I have data of job postings with the following variables.
    • employer_original -- unprocessed employer name
    • emp -- employer name that has been trimmed spaces, lowercase, and remove commonly seen suffix.
    • clusterid -- used clustering algorithm to fuzzy match 'emp' to groups
    • sector -- two digit NAICS sector code of that job posting
    The main do-file is

    Code:
    order employer_original emp clusterid sector diff_sector
    gsort -diff_sector clusterid sector 
    
    * first drop all duplicates that has the same "employer_original" "emp" and "sector"
    duplicates drop employer_original emp sector, force
    
    * drop if missing clusterid or emp 
    drop if mi(emp) | mi(clusterid)
    
    * 1 -- Based on Normalized Levensthein distance of emp, generate new clusterid for those with diff_sector == 1
    strgroup emp if diff_sector == 1, gen(newclusterid) threshold(0.3) first norm(longer) force // Increasing threshold will results in more dis-similar string being group in the same cluster, and vice versa. There are tradeoffs. Think of this as the maximum dissimilarity we can allow for different firm names, regardless of their sector. 
    
    qui sum clusterid
    local maxclusterid = r(max) // store max clusterid in local
    
    gen cclusterid = cond(diff_sector == 1, newclusterid + `maxclusterid', clusterid) // newclusterid are mapped so that no overlapping with old clusterid where diff_sector == 0
    
    order employer_original emp clusterid cclusterid sector diff_sector
    
    * 2 -- Combine sector and new clusterid to create clusters
    levelsof cclusterid, local(cclusterids)
    
    matrix input D = (0,1,5,5,5,2,3,5,5,5,5,5,5,5,5,5,5,2,5,5\1,0,4,4,3,3,5,4,5,5,5,5,5,5,5,5,5,5,5,5\5,4,0,2,2,4,5,2,5,5,3,5,4,2,5,5,5,4,5,3\5,4,2,0,4,5,5,3,5,5,3,5,5,5,5,5,5,5,5,5\5,3,2,4,0,3,3,3,5,5,4,5,5,5,5,5,5,3,5,5\2,3,4,5,3,0,1,2,5,5,4,5,5,5,5,5,5,5,5,5\3,5,5,5,3,1,0,4,4,4,5,4,3,5,5,5,3,1,5,5\5,4,2,3,3,2,4,0,5,5,5,5,5,5,5,5,5,5,5,5\5,5,5,5,5,5,4,5,0,2,4,1,1,5,2,3,2,4,5,2\5,5,5,5,5,5,4,5,2,0,3,2,2,5,4,5,4,5,5,4\5,5,3,3,4,4,5,5,4,3,0,5,5,4,5,5,5,1,5,5\5,5,5,5,5,5,4,5,1,2,5,0,1,4,1,3,2,5,5,2\5,5,4,5,5,5,3,5,1,2,5,1,0,4,2,5,5,5,5,2\5,5,2,5,5,5,5,5,5,5,4,4,4,0,5,5,5,5,5,3\5,5,5,5,5,5,5,5,2,4,5,1,2,5,0,5,3,5,5,4\5,5,5,5,5,5,5,5,3,5,5,3,5,5,5,0,5,5,5,5\5,5,5,5,5,5,3,5,2,4,5,2,5,5,3,5,0,5,5,5\2,5,4,5,3,5,1,5,4,5,1,5,5,5,5,5,5,0,5,5\5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,0,3\5,5,3,5,5,5,5,5,2,4,5,2,2,3,4,5,5,5,3,0) // industry distance matrix
    
    matrix rownames D = "11" "21" "22" "23" "31-33" "42" "44-45" "48-49" "51" "52" "53" "54" "55" "56" "61" "62" "71" "72" "81" "92"
    matrix colnames D = "11" "21" "22" "23" "31-33" "42" "44-45" "48-49" "51" "52" "53" "54" "55" "56" "61" "62" "71" "72" "81" "92"
    
    qui sum cclusterid
    local maxcclusterid = r(max)
    
    gen ncclusterid = .
    order ncclusterid, af(cclusterid)
    
    bys cclusterid: ilcluster
    In the last line of the code, I'm trying to call the user-defined function that executes for each group of cclusterid.

    I want to write a byable function in ado file, here's the code

    Code:
    program define ilcluster, sortpreserve byable(recall)
        version 18.0
        set trace on
        local n = _N
        qui sum cclusterid
        local maxcclusterid = r(max)
    
    * Loop through each combination of observations within the cluster
        forval i = 1/`n'{
            forval j = 1/`n'{
                local emp1 = emp[`i']
                local emp2 = emp[`j']
                local sector1 = sector[`i']
                local sector2 = sector[`j']
                
                * Calculate the normalized Levenshtein distance and store it in local
                local emp_len1 = strlen("`emp1'")
                local emp_len2 = strlen("`emp2'")
                local maxstrlen = max(`emp_len1',`emp_len2')
                qui ustrdist "`emp1'" "`emp2'"
                local levensthein = r(d)/`maxstrlen' // normalized Levenshtein distance using the max string length
                local industhry_dist = D["`sector1'","`sector2'"]
                local t = `industry_dist'*`levensthein'
                
                if `t' >= 0.5 {
                    replace ncclusterid[`i'] = cclusterid[`i'] + `maxcclusterid' 
                }            
            }
        }
    end
    But even if running without the by command I get the invalid syntax r(198) error. I'm new to Stata programming and using ado-files, and I'm clueless now. Any help is appreciated!

    Best,
    Yifeng

  • #2
    Code:
    local industhry_dist = D["`sector1'","`sector2'"]  
    
    local t = `industry_dist'*`levensthein'
    The first local macro needs to be named industry_dist to be used in the second.
    Last edited by Nick Cox; 24 Sep 2023, 05:47.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Code:
      local industhry_dist = D["`sector1'","`sector2'"]
      
      local t = `industry_dist'*`levensthein'
      The first local macro needs to be named industry_dist to be used in the second.
      Many thanks Nick. However, I fixed the typo and the problem seems to persist. This is my first time writing an ado file so I apologize if some mistakes seems stupid.

      Comment


      • #4
        Where does the error occur? You have set trace on so we should be able to see where the program stops.

        Comment

        Working...
        X