Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I remove spaces in string variable

    Hello everybody!

    I have a string variable labelled "X", which contains a series of numerical codes (e.g., 243 563453 21, 354 44 6, 23435 67, etc.).
    I would need to remove all the spaces from each of these values.

    Now, I have tried the standard subinstr function:

    replace X = subinstr(X, " ", "", .)

    but apparently this works only in case of letters, not with numerical characters. Could you please help me? I can't find the correct code.

    Many thanks!

    Kodi

  • #2
    Can you provide an example? It should not matter if the string contains numbers and not words.

    Comment


    • #3
      Kodi: I'm not sure why you are encountering a problem with subinstr. This worked fine for me:
      Code:
      . gen str X="243 563453 21, 354 44 6, 23435 67"
      
      . list X in 1/5
      
           +-----------------------------------+
           |                                 X |
           |-----------------------------------|
        1. | 243 563453 21, 354 44 6, 23435 67 |
        2. | 243 563453 21, 354 44 6, 23435 67 |
        3. | 243 563453 21, 354 44 6, 23435 67 |
        4. | 243 563453 21, 354 44 6, 23435 67 |
        5. | 243 563453 21, 354 44 6, 23435 67 |
           +-----------------------------------+
      
      . replace X=subinstr(X," ","",.)
      (100 real changes made)
      
      . list X in 1/5
      
           +----------------------------+
           |                          X |
           |----------------------------|
        1. | 24356345321,354446,2343567 |
        2. | 24356345321,354446,2343567 |
        3. | 24356345321,354446,2343567 |
        4. | 24356345321,354446,2343567 |
        5. | 24356345321,354446,2343567 |
           +----------------------------+

      Comment


      • #4
        Perhaps it just looks like a space but is another character.

        Code:
        . di "Stata" uchar(160) "is great"
        Stata is great
        The first space is uchar(160). The second is a common or garden space.

        chartab (SSC) by the inimitable Robert Picard is the best tool I know for checking for problem characters. It superseded charlist (SSC) by someone else.

        Code:
        .  clear
        
        . set obs 100
        number of observations (_N) was 0, now 100
        
        . gen problem  = "Stata" + uchar(160) + "is great"
        
        . chartab problem
        
           decimal  hexadecimal   character |     frequency    unique name
        ------------------------------------+----------------------------------------
                32       \u0020             |           100    SPACE
                83       \u0053       S     |           100    LATIN CAPITAL LETTER S
                97       \u0061       a     |           300    LATIN SMALL LETTER A
               101       \u0065       e     |           100    LATIN SMALL LETTER E
               103       \u0067       g     |           100    LATIN SMALL LETTER G
               105       \u0069       i     |           100    LATIN SMALL LETTER I
               114       \u0072       r     |           100    LATIN SMALL LETTER R
               115       \u0073       s     |           100    LATIN SMALL LETTER S
               116       \u0074       t     |           300    LATIN SMALL LETTER T
               160       \u00a0             |           100    NO-BREAK SPACE
        ------------------------------------+----------------------------------------
        
                                            freq. count   distinct
        ASCII characters              =           1,300          9
        Multibyte UTF-8 characters    =             100          1
        Unicode replacement character =               0          0
        Total Unicode characters      =           1,400         10
        https://www.statalist.org/forums/for...equency-counts

        Comment


        • #5
          Thank you Nick! You are the man!
          Kodi

          Comment


          • #6
            Good to hear, but what was the precise problem?

            Comment


            • #7
              Thank you Nick for pointing out chartab (SSC). I've used it to locate Multibyte UTF-8 characters within a string variable in my dataset, that are not removed by
              Code:
              ustrltrim
              .
              They appear as blank space with
              Code:
              list,clean
              .
              How would one go about removing those Multibyte UTF-8 characters?

              Comment


              • #8
                I'm not sure about this, but perhaps this example will start you in a useful direction, removing all UTF-8 characters other than the single-byte characters.
                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str4 text
                "abc" 
                "déf"
                "ghi" 
                end
                generate new = ustrregexra(text,"[^\u0000-\u007F]","")
                list, clean
                Code:
                . list, clean
                
                       text   new  
                  1.    abc   abc  
                  2.    déf    df  
                  3.    ghi   ghi

                Comment


                • #9
                  Very kind of you to help, William.
                  Running chartab after your code confirms success in removing characters that look like a space.

                  Comment


                  • #10
                    Code:
                    generate new = ustrregexra(textvar,"\p{Z}","")
                    https://www.regular-expressions.info/unicode.html

                    http://jkorpela.fi/chars/spaces.html

                    Comment

                    Working...
                    X