How can I remove spaces in string variable

Kodi Hannon

Join Date: Feb 2015

Posts: 81
#1

How can I remove spaces in string variable

04 Dec 2019, 06:01

Hello everybody!

I have a string variable labelled "X", which contains a series of numerical codes (e.g., 243 563453 21, 354 44 6, 23435 67, etc.).
I would need to remove all the spaces from each of these values.

Now, I have tried the standard subinstr function:

replace X = subinstr(X, " ", "", .)

but apparently this works only in case of letters, not with numerical characters. Could you please help me? I can't find the correct code.

Many thanks!

Kodi
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10192
#2

04 Dec 2019, 06:17

Can you provide an example? It should not matter if the string contains numbers and not words.
Comment

John Mullahy

Join Date: Dec 2016
Posts: 751

04 Dec 2019, 06:21

Kodi: I'm not sure why you are encountering a problem with subinstr. This worked fine for me:

Code:

. gen str X="243 563453 21, 354 44 6, 23435 67"

. list X in 1/5

     +-----------------------------------+
     |                                 X |
     |-----------------------------------|
  1. | 243 563453 21, 354 44 6, 23435 67 |
  2. | 243 563453 21, 354 44 6, 23435 67 |
  3. | 243 563453 21, 354 44 6, 23435 67 |
  4. | 243 563453 21, 354 44 6, 23435 67 |
  5. | 243 563453 21, 354 44 6, 23435 67 |
     +-----------------------------------+

. replace X=subinstr(X," ","",.)
(100 real changes made)

. list X in 1/5

     +----------------------------+
     |                          X |
     |----------------------------|
  1. | 24356345321,354446,2343567 |
  2. | 24356345321,354446,2343567 |
  3. | 24356345321,354446,2343567 |
  4. | 24356345321,354446,2343567 |
  5. | 24356345321,354446,2343567 |
     +----------------------------+

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

04 Dec 2019, 06:39

Perhaps it just looks like a space but is another character.

Code:

. di "Stata" uchar(160) "is great"
Stata is great

The first space is uchar(160). The second is a common or garden space.

chartab (SSC) by the inimitable Robert Picard is the best tool I know for checking for problem characters. It superseded charlist (SSC) by someone else.

Code:

.  clear

. set obs 100
number of observations (_N) was 0, now 100

. gen problem  = "Stata" + uchar(160) + "is great"

. chartab problem

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+----------------------------------------
        32       \u0020             |           100    SPACE
        83       \u0053       S     |           100    LATIN CAPITAL LETTER S
        97       \u0061       a     |           300    LATIN SMALL LETTER A
       101       \u0065       e     |           100    LATIN SMALL LETTER E
       103       \u0067       g     |           100    LATIN SMALL LETTER G
       105       \u0069       i     |           100    LATIN SMALL LETTER I
       114       \u0072       r     |           100    LATIN SMALL LETTER R
       115       \u0073       s     |           100    LATIN SMALL LETTER S
       116       \u0074       t     |           300    LATIN SMALL LETTER T
       160       \u00a0             |           100    NO-BREAK SPACE
------------------------------------+----------------------------------------

                                    freq. count   distinct
ASCII characters              =           1,300          9
Multibyte UTF-8 characters    =             100          1
Unicode replacement character =               0          0
Total Unicode characters      =           1,400         10

https://www.statalist.org/forums/for...equency-counts

Comment

Kodi Hannon

Join Date: Feb 2015

Posts: 81
#5

04 Dec 2019, 06:51

Thank you Nick! You are the man!
Kodi
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

04 Dec 2019, 08:29

Good to hear, but what was the precise problem?
Comment
Michael McCulloch

Join Date: Jul 2025

Posts: 24
#7

26 Apr 2020, 17:15

Thank you Nick for pointing out chartab (SSC). I've used it to locate Multibyte UTF-8 characters within a string variable in my dataset, that are not removed by

Code:

ustrltrim

.
They appear as blank space with

Code:

list,clean

.
How would one go about removing those Multibyte UTF-8 characters?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

26 Apr 2020, 18:37

I'm not sure about this, but perhaps this example will start you in a useful direction, removing all UTF-8 characters other than the single-byte characters.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 text
"abc" 
"déf"
"ghi" 
end
generate new = ustrregexra(text,"[^\u0000-\u007F]","")
list, clean

Code:

. list, clean

       text   new  
  1.    abc   abc  
  2.    déf    df  
  3.    ghi   ghi

Comment

Michael McCulloch

Join Date: Jul 2025

Posts: 24
#9

26 Apr 2020, 19:01

Very kind of you to help, William.
Running chartab after your code confirms success in removing characters that look like a space.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#10

27 Apr 2020, 13:39

Code:

generate new = ustrregexra(textvar,"\p{Z}","")

https://www.regular-expressions.info/unicode.html

http://jkorpela.fi/chars/spaces.html
Comment

Announcement