Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove duplicates special characters in a string

    Dear Stata users,

    I am trying to separate addresses inside a string variable into new observations, like:

    from:

    "address11 // address12 // address13"
    "address21 // address22 // address23"

    to:

    address11
    address12
    address13
    address21
    address22
    address23

    The problem is not really it, but it is that the separator ("//") is not constant between observations and I actually have a dataframe like:

    "address11 //////// address12 / address13"
    "address21 ///// address22 /// address23"
    "address31 // address32 // address33"

    Now I am trying to make the number of ("/") constant between rows, in order to then I can separate the addresses. So I am trying to convert the previous dataframe into a one similar to the first I showed.

    I tried using "moss" ssc command, but if you can give any idea to performed that I will be very grateful.

    Thank you in advanced,
    Paulo
    Last edited by Paulo Matos; 06 Sep 2018, 18:40.

  • #2
    Hi Paulo,

    Welcome to Statalist.

    The first part of your problem just requires the use of the subinstr() function. For the second part, you can use a combination of split and stack to get what you want.

    Here's some code:

    Code:
    clear
    input str42 var1
    "address11 //////// address12 / address13"
    "address21 ///// address22 /// address23"
    "address31 // address32 // address33"    
    end
    
    *Remove the / characters
    replace var1 = subinstr(var1, "/", "",.)
    
    *Split the variable by empty space and remove the original
    split var1
    drop var1
    
    *Stack the three variables created by split into a single variable
    stack var11 var12 var13, into(var1) clear
    drop _stack
    
    *Now display the output
    sort var1
    list, clean
    
              var1  
      1.   address11  
      2.   address12  
      3.   address13  
      4.   address21  
      5.   address22  
      6.   address23  
      7.   address31  
      8.   address32  
      9.   address33
    Note above the use of code delimiters to display the code and share the data in this forum. Do use those in the future, and please share data using the dataex command (ssc install dataex)
    Last edited by Chris Larkin; 06 Sep 2018, 20:40.

    Comment


    • #3
      If there are also spaces in the address strings, you could do:
      Code:
      split var1, p("/ ")
      stack var11 var12 var13, into(var1) clear
      replace var1 = subinstr(var1, "/", "",.)
      replace var1 = trim(var1)
      drop _stack

      Comment


      • #4
        Another solution for replacing several occurrences of the separating character by a single instance is to do so via a regular expression.

        Here's my shot on the problem:
        Code:
        clear
        input str42 var1
        "address11 //////// address12 / address13"
        "address21 ///// address22 /// address23"
        "address31 // address32 // address33"    
        end
        
        * replace several conscutive occurences of "/" with a single "/"
        replace var1=ustrregexra(var1,"/+","/")
        list
        
        * split into several variables, remove original
        split var1 , parse("/")
        drop var1
        list
        
        * reshape to long format
        generate id=_n
        reshape long var1@ , i(id) j(addressno)
        list
        This also creates an id variable and an enumerator for the addresses per observation, which might be useful lateron.

        Regards
        Bela

        Comment


        • #5
          You can also find multiple occurrences of substrings that do not contain the delimiter with moss (from SSC):

          Code:
          clear
          input str42 var1
          "address11 //////// address12 / address13"
          "address21 ///// address22 /// address23"
          "address31 // address32 // address33"    
          end
          
          moss var1, match("([^/]+)") regex
          list _match*
          and the results:
          Code:
          . list _match*
          
               +---------------------------------------+
               |    _match1       _match2      _match3 |
               |---------------------------------------|
            1. | address11     address12     address13 |
            2. | address21     address22     address23 |
            3. | address31     address32     address33 |
               +---------------------------------------+
          
          .

          Comment


          • #6
            Thank you so much everyone for the quick answers! All the advices were very useful!

            Comment

            Working...
            X