Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hitting levelsof limit

    Hi,

    My question relates to maximum macro length with levelsof. I hope that it is not redundant- I went through the previous posts related to levelsof but could not find a solution that addresses the issue I'm facing. I have a dataset containing 8,000 firm names. Some firm names include a location. For example:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str165 comp_extracted
    "Abbontiakoon Mines "                                    
    "Aberdeen Commercial "                                    
    "Aberdeen Gas "                                                                        
    "Aberdeen Steam "                                        
    "Aberdeen Leith and Clyde Steam Shipping "                
    "Abford Estates"                                          
    "Abosso "                                                
    "Aboukir Company "                                                                    
    "Agricultural Hall "                                                              
    "Alexandra (Newport) Dock "                                                            
    "Alliance and Dublin Consumers Gas"                      
    "Alliance and Dublin Consumers Gas Consolidated Ordinary "                  
    end
    For each firm name, I want to extract the location and store it into a variable called city. To do this, I downloaded a separate dataset from geonames.org that lists all city names worldwide, and ran the following code:

    Code:
    clear all
    "$user/Firm_names", replace
    
    // Store a list of cities in a macro using levelsof
    preserve
    clear all
    use "/$user/company_names/geonames-all-cities-with-a-population-1000.dta", clear
    keep if Population > 40000
    levelsof City, local(cityname_list_temp)
    restore
    
    // Extract cities from firm names
    gen city=""
    foreach keyword of local cityname_list_temp{
    local pattern "\(?\b`keyword'\b\)?"
    quietly replace city = ustrregexs(0) if ustrregexm(comp_extracted, "`pattern'")
    }
    The code works as intended when only considering cities with more than 40,000 inhabitants. However when considering all cities, I got the following error: macro substitution results in line that is too long r(920). This makes sense, as -help levelsof- warns: "levelsof may hit the limits imposed by Stata. However, it is typically used when the number of distinct values of varname is modest." How can I work around this constraint in my particular case? I tried Method 1 suggested by Nick Cox in the FAQ https://www.stata.com/support/faqs/data-management/try-all-values-with-foreach/index.html but I am not sure how it would apply here considering that the list of cities I want to extract is in a separate dataset.

    Any suggestion is welcome, many thanks for the help!
    Last edited by Maia DEBS; 21 Jul 2023, 09:06.

  • #2
    Code:
    help limits
    for limits of macros.

    I downloaded a separate dataset from geonames.org that lists all city names worldwide, and ran the following code:
    How many cities are listed here? You may look at

    Code:
    help cross
    but this is very memory intensive as it will form combinations of each observation in one dataset with all observations in a second dataset. So maybe breaking the task into several tasks or defining several local macros.
    Last edited by Andrew Musau; 21 Jul 2023, 09:10.

    Comment


    • #3
      Dear Andrew,

      Thank you so much for your swift reply.

      How many cities are listed here?
      About 120,000 cities. Following your advice, I checked -help limit- and I used

      Code:
        set maxvar 120000
      which should in principle significantly increase the maximum macro length. I however get a different error message now:

      Code:
      Sichuan"' `"Miaojie"' `"Miaoyu"' `"Miaozi"' `"Miass"' `"Mibu"' `"Michalovce"' `"Micheng"'
      >  `"MichiganCity"' `"Michurinsk"' `"Middelburg"' `"Middlesbrough"' `"Middleton"' `"Middl
      > etown"' `"Midelt"' `"Midland"' `"Midori"' `"Midoun"' `"Midrand"' `"Midsayap"' `"Midvale
      > "' `"MidwestCity"' `"Midyat"' `"Mielec"' `"Miercurea invalid name
      r(198);
      Do you know why this is? I also tried to break the task into several tasks:

      Code:
      clear all
      use "$user/Firm_names.dta"
      
      * Use levelsof to store all city names in macros
      preserve
      clear all
      use "/$user/company_names/geonames-all-cities-with-a-population-1000.dta"
      egen bucket = cut(Population), group(10)
      forvalues i=0/9{
      levelsof City if bucket==`i', local(cityname_list_temp`i')
      }
      restore
      
      * Extract cities from firm names 
      gen city=""
      foreach local_list in cityname_list_temp0 cityname_list_temp1 cityname_list_temp2 cityname_list_temp3 cityname_list_temp4 cityname_list_temp5 cityname_list_temp6 cityname_list_temp7 cityname_list_temp8 cityname_list_temp9 {
      foreach keyword of local `local_list' {
      local pattern "\(?\b`keyword'\b\)?"
      quietly replace city = ustrregexs(0) if ustrregexm(comp_extracted, "`pattern'")
      }
      }
      The loop runs smoothly over the first two macros, until it hits the third macro and stops with the following error message:

      Code:
      Grossdorf"' `"Ugljan"' `"Uhland"' `"Ulea"' `"UleiladelCampo"' `"Ulenurme"' `"Ulla"' `"Ull
      > ava"' `"Ulvik"' `"UmatacVillage"' `"UmmelQutuf"' `"Unanov"' `"Uncastillo"' `"Ungerdorf"
      > ' `"Ungerhausen"' `"UnidadGrajalesINFONAVIT"' `"UnidadHabitacionalMarianoMatamoros"' `"
      > UnidosAvanzamos"' `"UnionAgropecuariosLazaroCardenasdelNorte"' `"UniondeAzuero"' `"Unte
      > reisenfeld"' `"Unternberg"' `"Unverre"' `"UnyLelant"' `"UpperBearCreek"' `"UpperLake"'
      > `"Urbancrest"' `"Urbe"' `"Urdorf invalid name
      r(198);
      Again, any insights as to what is the issue here?

      Many thanks for your help!

      Last edited by Maia DEBS; 24 Jul 2023, 16:16.

      Comment


      • #4
        Provide a data example.

        Code:
        clear all
        use "/$user/company_names/geonames-all-cities-with-a-population-1000.dta"
        dataex City if regexm(City, "^Mie") | regexm(City, "^Ur")

        Comment


        • #5
          Yes, here is the original data:

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str65 City
          "Miechów"              
          "Miechów Charsznica"    
          "Mieders"                
          "Miedes de Atienza"      
          "Miedziana Góra"        
          "Miedzichowo"            
          "Miedzna"                
          "Miedzno"                
          "Miedźna"              
          "Miehikkälä"          
          "Miehlen"                
          "Miejsce Piastowe"      
          "Miejska Górka"        
          "Mielec"                
          "Mielen-boven-Aalst"    
          "Mieleszyn"              
          "Mielkendorf"            
          "Mielno"                
          "Mieming"                
          "Miengo"                
          "Mier"                  
          "Mier y Noriega"        
          "Miercurea Nirajului"    
          "Miercurea Sibiului"    
          "Miercurea-Ciuc"        
          "Mieres"                
          "Mierlo"                
          "Mieroszów"            
          "Mierzęcice"            
          "Miesbach"              
          "Miesenbach"            
          "Miesenbach bei Birkfeld"
          "Mieste"                
          "Mieszkowice"            
          "Mietingen"              
          "Mietoinen"              
          "Mieussy"                
          "Mieza"                  
          "Mieścisko"            
          "Ura Vajgurore"          
          "Urachiche"              
          "Uracoa"                
          "Urago d'Oglio"          
          "Urakawa"                
          "Ural"                  
          "Uralets"                
          "Uralla"                
          "Uralo-Kavkaz"          
          "Urambo"                
          "Uramita"                
          "Uran"                  
          "Urangan"                
          "Urania"                
          "Uras"                  
          "Urasoe"                
          "Urasqui"                
          "Uravakonda"            
          "Uray"                  
          "Urayasu"                
          "Urazovka"              
          "Urazovo"                
          "Urb. Santo Domingo"    
          "Urbach"                
          "Urbach-Überdorf"      
          "Urbana"                
          "Urbancrest"            
          "Urbandale"              
          "Urbania"                
          "Urbano Santos"          
          "Urbar"                  
          "Urbe"                  
          "Urbino"                
          "Urbisaglia"            
          "Urbiztondo"            
          "Urca"                  
          "Urcos"                  
          "Urcuit"                
          "Urda"                  
          "Urdaneta"              
          "Urdari"                
          "Urdazubi / Urdax"      
          "Urdgol"                
          "Urdiales del Páramo"  
          "Urdinarrain"            
          "Urdoma"                
          "Urdorf"                
          "Urdorf / Bodenfeld"    
          "Urdorf / Moos"          
          "Urdorf / Oberurdorf"    
          "Urduña / Orduña"      
          "Urechcha"              
          "Urecheni"              
          "Urecheşti"            
          "Urecho"                
          "Urek’i"              
          "Urengoy"                
          "Uren’"                
          "Ures"                  
          "Ureshino"              
          "Ureshinomachi-shimojuku"
          end
          I didn't include it in my original post, but I run the command...

          Code:
          replace City= ustrregexra(ustrnormalize(City, "nfd" ), "\p{Mark}", "" )
          ...right before running:
          Code:
           levelsof City, local(cityname_list_temp)
          This is to remove all special characters before matching datasets.

          Comment


          • #6
            This works for me with the example dataset.

            Code:
            replace City= ustrregexra(ustrnormalize(City, "nfd" ), "\p{Mark}", "" )
            levelsof City, local(cityname_list_temp)
            
            
            gen city=""
            foreach local_list in cityname_list_temp{
                foreach keyword of local `local_list' {
                    local pattern "\(?\b`keyword'\b\)?"
                    quietly replace city = ustrregexs(0) if ustrregexm(City, "`pattern'")
                }
            }
            My only guess is that you are hitting limits. I do not see the downside of having a large number of macros. Try

            Code:
            use "/$user/company_names/geonames-all-cities-with-a-population-1000.dta", clear
            macro drop _all
            forval i= 1/1200{
                levelsof City if ceil(_n/100)==`i', local(city_list`i')
            }
            use "$user/Firm_names.dta", clear
            replace comp_extracted= ustrregexra(ustrnormalize(comp_extracted, "nfd" ), "\p{Mark}", "" )
            gen city=""
            forval i=1/1200{
                foreach city of local city_list`i'{
                    local pattern "\(?\b`city'\b\)?"
                    quietly replace city = ustrregexs(0) if ustrregexm(comp_extracted, "`pattern'")
                }
            }
            Last edited by Andrew Musau; 25 Jul 2023, 12:17.

            Comment


            • #7
              Many thanks for your reply, it solved the issue of maximum macro length. I however kept getting the error message mentioned in my previous reply, until I found the mistake: it lays in special characters. My guess is that the special characters were interacting with the regular expression. Adding these lines of code solved the issue:

              Code:
               
               use "/$user/company_names/geonames-all-cities-with-a-population-1000.dta", clear macro drop _all replace City = subinstr(City, "(", "",.)  replace City = subinstr(City, ")", "",.) replace City = subinstr(City, "/", "",.) replace City = subinstr(City, "`", "",.) replace City= ustrregexra(ustrnormalize(City, "nfd" ), "\p{Mark}", "" ) forval i= 1/1200{     levelsof City if ceil(_n/100)==`i', local(city_list`i') }

              Comment

              Working...
              X