Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • local with semi-colon using #delimit

    Hello,

    I am cleaning a large dataset & now need to make some manual changes to a string variable so that it is in the correct format to split & reshape.

    I want to build a local in the format <new>|<old> and use gettoken to separate the local & then run the replace command. Some of the strings are very long & so I have changed the delimiter in order to split over rows. An extract of the do file code for the local is as follows (there are many more strings to be changed):

    Code:
    * syntax convention: `" <new>|<old> <new>|<old> [...] "'
    #delimit ;
    
    local change `"
    "Pre-registration, USA, Europe.|Pre-registration, USA and Europe."
    "Marketed, UK. Phase III, USA.|Marketed, UK, Phase III, USA."
    "Registered, UK. Pre-registration, Worldwide.|Registered, UK; Pre-registration, Worldwide."
    "' ;
    
    #delimit cr
    The code works fine until I reach a string with a semicolon where I get the error "invalid syntax". Is there a way to overcome this using this current method? Is it possible to change the delimiter to something else?

    Alternatively, I can add to the local in each line using local change `" `change' "new text|old text" "' , but I would prefer the first method for readability of the do file. I am working on a PC in Stata/MP 14.2.

    Thank you for any help you can provide.

    Best wishes,
    Bryony

  • #2
    Two alternatives; using a local for the colon, reading from file:
    Code:
    local colon = ";"
    
    #delimit ;
    
    local change `"
    "Pre-registration, USA, Europe.|Pre-registration, USA and Europe."
    "Marketed, UK. Phase III, USA.|Marketed, UK, Phase III, USA."
    "Registered, UK. Pre-registration, Worldwide.|Registered, UK`colon' Pre-registration, Worldwide."
    "' ;
    
    #delimit cr
    
    di `"`change'"'
    
    * make example text file
    tempfile changetext
    local OK = filewrite("`changetext'",`"`change'"',1) 
    type `changetext'
    
    * read textfile
    
    local change2 = fileread("`changetext'")
    
    assert `"`change'"' == `"`change2'"'

    Comment


    • #3
      Hi - thank you for your really useful answer. The first method, using the local containing the semi-colon works really well.

      I am having a bit of trouble understanding & implementing the second method & would like to get my head around it. Would you be able to explain any further?

      Thank you again for your time,
      Bryony

      Comment


      • #4
        Hi, you could have the definitions saved in a text file (change.txt) :

        "Pre-registration, USA, Europe.|Pre-registration, USA and Europe."
        "Marketed, UK. Phase III, USA.|Marketed, UK, Phase III, USA."
        "Registered, UK. Pre-registration, Worldwide.|Registered, UK; Pre-registration, Worldwide."


        Then read the definitions and parse:
        Code:
        ********************************************************************************
        * assumption: text file with definitions (change.txt) 
        ********************************************************************************
        
        version 14
        
        type "change.txt"
        
        ********************************************************************************
        * split using gettoken
        ********************************************************************************
        
        local change = ustrregexra( fileread("change.txt"), "\r\n", "" )
        
        tokenize `"`change'"'
            
        qui forvalues i = 1/1000 {
        
            if ( "``i''" == "" ) {
            
                continue, break
            }
            
            gettoken from to : `i' , parse("|")
            gettoken  sep to : to  , parse("|")
            
            noi di _n "from:  `from'" _n "  to:  `to'"
        }
        Results:
        Code:
        . ********************************************************************************
        . * assumption: text file with definitions (change.txt) 
        . ********************************************************************************
        . 
        . type "change.txt"
         "Pre-registration, USA, Europe.|Pre-registration, USA and Europe." 
         "Marketed, UK. Phase III, USA.|Marketed, UK, Phase III, USA." 
         "Registered, UK. Pre-registration, Worldwide.|Registered, UK; Pre-registration, Worldwide." 
        
        . 
        . ********************************************************************************
        . * split using gettoken
        . ********************************************************************************
        . 
        . local change = ustrregexra( fileread("change.txt"), "\r\n", "" )
        
        . 
        . tokenize `"`change'"'
        
        .         
        . qui forvalues i = 1/1000 {
        
        from:  Pre-registration, USA, Europe.
          to:  Pre-registration, USA and Europe.
        
        from:  Marketed, UK. Phase III, USA.
          to:  Marketed, UK, Phase III, USA.
        
        from:  Registered, UK. Pre-registration, Worldwide.
          to:  Registered, UK; Pre-registration, Worldwide.
        
        . 
        end of do-file
        The splitting of the "|" separated pairs may alternatively be done using a regex or using substr():
        Code:
        ********************************************************************************
        * split using regexm()
        ********************************************************************************
        
        local change = ustrregexra( fileread("change.txt"), "\r\n", "" )
        
        tokenize `"`change'"'
            
        qui forvalues i = 1/1000 {
        
            if ( "``i''" == "" ) {
            
                continue, break
            }
            
            local ismatch = regexm("``i''", "^(.*)[|](.*)$" )
            local from = regexs(1) /* 1. subexpression of regexm() */ 
            local to = regexs(2)   /* 2. subexpression of regexm() */ 
            
            noi di _n "from:  `from'" _n "  to:  `to'"
        }
        
        ********************************************************************************
        * split using substr() 
        ********************************************************************************
        
        local change = ustrregexra( fileread("change.txt"), "\r\n", "" )
        
        tokenize `"`change'"'
            
        qui forvalues i = 1/1000 {
        
            if ( "``i''" == "" ) { 
            
                continue, break
            }
            
            local pair "``i''"
            local sep = "|"
            
            local from = substr( "`pair'", 1 , strpos("`pair'", "`sep'" ) - 1 )
            local to   = substr( "`pair'", strpos("`pair'", "`sep'") + 1, . )
        
            noi di _n "from:  `from'" _n "  to:  `to'"
        }
        
        ********************************************************************************
        
        exit

        Comment


        • #5
          Thank you so much for such a comprehensive reply - this is really useful!

          Comment

          Working...
          X