replacing error characters

Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#1

replacing error characters

22 Sep 2019, 13:39

I would like to know how to make reference to this character which is displayed in the data editor as a square but is somehow read as �.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str13 registro "86057��" "150737�" "0019634�" "��" "11415�" "0264745�" "18226�0" "278145�" end

I would like to replace it with "". The code below does not work.

Code:

. replace registro = subinstr(registro,"�","",.) (0 real changes made)

Attached Files
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

22 Sep 2019, 13:52

When I use your -dataex- to load your data into my Stata, the offending character appears as �. By running

Code:

replace registro = subinstr(registro, "�", "", .)

these characters are eliminated from variable registro.

The way to get � into your -replace- command is to copy it from your -dataex- and paste it into the -replace- command.
Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

22 Sep 2019, 14:11

Hi Clyde Schechter. Something really strange is happening. When I run this -subinstr- command in my original dataset, nothing is replaced. However when I run the output of -dataex- and then run the -subinstr- command, then the replacement is properly done. The problem is that I have to run it from my original dataset as it is a huge one. Would you have any strategy in mind to get around this?

Running the -subinstr- command from my original dataset

Code:

. replace registro = subinstr(registro, "�", "", .)
(0 real changes made)

. 
. dataex in 1/8

----------------------- copy starting from the next line -----------------------


	Code:
	* Example generated by -dataex-. To install: ssc install dataex
clear
input str13 registro
"105747�" 
"182935�" 
"87967�"  
"85437�"  
"185027�" 
"�"       
"1947707�"
"114017�" 
end
------------------ copy up to and including the previous line ------------------

Listed 8 out of 92 observations

. 
end of do-file

.

Running the -dataex- output and then the -subinstr- command

Code:

. clear

. input str13 registro

          registro
  1. "105747�" 
  2. "182935�" 
  3. "87967�"  
  4. "85437�"  
  5. "185027�" 
  6. "�"       
  7. "1947707�"
  8. "114017�" 
  9. end

. 
. replace registro = subinstr(registro, "�", "", .)
(8 real changes made)

. 
end of do-file

.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2417
#4

22 Sep 2019, 15:47

Hi Paula and Clyde,

Am I correct that it is not known in advance what characters are "good?" That is, all that is known is that there are some bad nonprinting characters, also of unknown codes? If so, I have a thought that is somewhere between brute force and automation. What about using -hexdump- on the file to see if it's possible to come up with a reasonably short list of the numeric codes of the problem characters? Nick Cox's -charlist- (from SSC) also could be useful in this regard. Other possibilities using -fileread()- on the original file also are possible. (Read the file into a strL, check each character.)

With a list of the numeric codes of the bad characters, one could do something like:

Code:

local badnum = "6 9 186 224" foreach num of local badnum { replace registro = subinstr(registro, char(`num'), "", .) }

If the list of "good" characters is short, and the list of bad ones is long, then some other approach would be in order, such as:

Code:

local goodchar = "1 2 3 4 5 a b c" gen str newregistro = "" gen str1 s = "" forval i = 1/`=length(registro)' { replace s = substr(registro, `i', 1) replace newregistro = newregistro + s if regexm("`goodchar'", s) }

The preceding is untested, but should be in the right direction. And, someone who is comfortable with regular expressions could probably come up with something more elegant.
Comment

Red Owl

Join Date: Nov 2016
Posts: 127

22 Sep 2019, 18:06

Perpaps this would work.

Code:

clear
input str8 var1
  "105747�" 
  "182935�" 
  "87967�"  
  "85437�"  
  "185027�" 
  "�"       
  "1947707�"
  "114017�" 
end

list

destring var1, gen(newvar1) force ignore("�")

list

Red Owl
Stata/IC 16.0 (Windows 10, 64-bit)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#6

22 Sep 2019, 19:23

Re #3 and #4: The basic idea in #4 is sound. But -charlist- may not do the job. It was written for version 9, long before Unicode. If there are Unicode characters causing this problem, -charlist- will not help. Robert Picard's -chartab- (available from SSC) is more up to date for this purpose. Similarly, if the offending characters are unicode, then the code in #4 will, I think, need to use -uchar()- instead of -char()-.

Re #5: On it's surface, Red Owl's approach does not look like it will be more successful than my suggestions in #2 because the failure of those suggestions arises because the character being printed by -dataex- is apparently not the actual character in the data. (In the screenshot of the data, it appears as an empty square, whereas -dataex- is showing a diamond with a question mark within it.) But probably it will work if you copy one of those empty squares into the clipboard, and then paste it in to the -destring- command where the diamond with question mark is.

Another approach might be to use the (unicode) regular expression functions. I know there is some way in regular expression syntax to code for the removal of non-numeric characters, but I'm not entirely sure what it is; I use regular expressions very rarely and always have to spend a long time looking them up.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#7

22 Sep 2019, 20:06

I hadn't checked out -chartab- before now. It's very thorough and could be quite helpful here, though if it's just a matter of wanting numeric characters only, the regular expression approach would be trivial for a regular user of them (which I'm not either.)

However, my approach here (absent any prior information on why there are problem characters in the file) would be to use -chartab- and probably -hexdump- to see if I could figure out what's going on before I would be confident enough to just delete the "junk."
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4470
#8

23 Sep 2019, 06:57

I'm a little confused and suggest stepping back and using the -hexdump- command to see what Stata sees; read the help file because you will definitely want to use one or both of the following options: analyze/tabulate; depending on what is shown, you might want to follow-up with the -filefilter- command rather than one of the strategies above
Comment
Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#9

24 Sep 2019, 09:53

FYI: #5 worked out!!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#10

26 Sep 2019, 09:10

Thank you for closing the thread by showing a successful solution to the problem.
Comment

Announcement