Merging on string variables that seem identical, but no merges produced due to differences in str19 and str47

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#1

Merging on string variables that seem identical, but no merges produced due to differences in str19 and str47

20 Jun 2018, 08:31

Hi All,

I currently have two datasets. In the first dataset, I have information on country level variables. The countries are coded as string variables and take the following form, for Afghanistan for instance:

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str47 country " Afghanistan" end

------------------ copy up to and including the previous line ------------------

As can be noticed the string type is a str47. I wish to perform a 1:m merge with using data, with country as the mergng variable. In the using data, the same country, Afghanista, is inputted as:

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str19 country "Afghanistan" end

------------------ copy up to and including the previous line ------------------

As can be seen, Afghanistan is spelt the exact same way, but is now coded as a str19 variable. As such, when I perform a 1:,m merge using the using data, I get precisely 0 merges. Is there a solution to this problem?

Thanks!
Tags: None
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#2

20 Jun 2018, 09:10

I think it might actually be the space right before Afghanistan that is causing the issue. I'll look into it.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

20 Jun 2018, 09:20

If you have many situations like this, maybe - strtrim() - will be helpful.

Best regards,

Marcos
1 like
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#4

20 Jun 2018, 09:21

Thank you. I'll look into it.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4471
#5

20 Jun 2018, 09:23

always a good idea to use the trim functions before trying something like this; to get rid of leading and or trailing blanks, use "strtrim" (or its unicode friend if needed); if you think you might also have multiple internal blanks and want to trim those to just one, look at "stritrim" - another string function; you can use

Code:

help function

and click on "string functions" to find the help files
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#6

20 Jun 2018, 09:50

I did use strltrim, but to no avail. Number of matches produced continue to be 0. Could this be because of the Str47 vs Str19 nature of the string variables?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 Jun 2018, 10:51

Your problem has nothing to do with str47 v. str19 - Stata is (really, most languages are) smart enough to deal with that difference itself, not by forcing the user to make changes.

Since you don't show us your code and output, the best I can do is show you my code, which works, and leave it to you to compare it to your code and find your problem.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 country
"Afghanistan"
end
replace country = strtrim(country)
tempfile using
save `using'
* Example generated by -dataex-. To install: ssc install dataex
clear
input str47 country
" Afghanistan"
end
replace country = strtrim(country)
merge 1:m country using `using'

Code:

. merge 1:m country using `using'

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                 1  (_merge==3)
    -----------------------------------------

Comment

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#8

20 Jun 2018, 11:41

Thanks a lot for your solution- I cannot post more data because of confidentiality reasons. I tried the solution, but I have noticed the problem. Take for example:

clear
input str47 country
" Canada"
end
[/CODE]

I type to remove the space:

Code:

replace country=strtrim(country)

However, I obtain: a message saying that 0 real changes have been made, after which point,I still obtain data of the form:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str47 country
" Canada"
end
[/CODE]

I am not sure if this has to do with other observations in the dataset, but it seems to me that the strtrim command does not make a difference in my dataset.
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#9

20 Jun 2018, 11:49

Surely you shoudl be able to post actual data on the country names. Just select that variable only by doing

Code:

dataex country

Potentially there are leading/trailing characters that are not recognized by strtrim. The exact data example and code you post in #8 work as they should.
Comment

Chinmay Sharma

Join Date: Nov 2015
Posts: 351

#10

20 Jun 2018, 11:54

Here is the country \data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str47 country
" Canada"       
" Norway"       
" Indonesia"    
" Greenland"    
" Russia"       
" Philippines"  
" Japan"        
" Australia"    
" United States"
" Antarctica"   
end

If I type:

Code:

replace country=strtrim(country)

I still obtain:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str47 country
" Canada"       
" Norway"       
" Indonesia"    
" Greenland"    
" Russia"       
" Philippines"  
" Japan"        
" Australia"    
" United States"
" Antarctica"   
end

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#11

20 Jun 2018, 11:55

Have you tried this?

Code:

. gen myvar = strtrim(country)

P.S.: your - replace - command works fine to me.

Last edited by Marcos Almeida; 20 Jun 2018, 11:59.

Best regards,

Marcos
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#12

20 Jun 2018, 11:56

Yes, I did. I tried both the gen and the replace option, but to no avail.
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

#13

20 Jun 2018, 12:11

This is the output we get for the replace command:

Code:

. replace country = strtrim(country)
(10 real changes made)

Look at the output below:

Code:

. input str47 country

                                             country
  1. " Canada"       
  2. " Norway"       
  3. " Indonesia"    
  4. " Greenland"    
  5. " Russia"       
  6. " Philippines"  
  7. " Japan"        
  8. " Australia"    
  9. " United States"
 10. " Antarctica"   
 11. end

. des

Contains data
  obs:            10                          
 vars:             1                          
 size:           470                          
------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------
country         str47   %47s                  
------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. compress country
  variable country was str47 now str14
  (330 bytes saved)

. des

Contains data
  obs:            10                          
 vars:             1                          
 size:           140                          
------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------
country         str14   %14s                  
------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

  . replace country = strtrim(country)
(10 real changes made)

. des

Contains data
  obs:            10                          
 vars:             1                          
 size:           140                          
------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------
country         str14   %14s                  
------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. compress country
  variable country was str14 now str13
  (10 bytes saved)

. des

Contains data
  obs:            10                          
 vars:             1                          
 size:           130                          
------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------
country         str13   %13s                  
------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

Best regards,

Marcos

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#14

20 Jun 2018, 12:18

Jorrit's diagnosis in post #9 seems likely to be correct. However that data was created, the country variable was created with a leading character which is not a space and is apparently a non-printing character.

If every observation for the variable country in that dataset has a leading "space" like this, then

Code:

replace country = substr(country,2,.)

will eliminate the leading character whatever it is. But if some have it and others don't, you will need a different approach.

I would be concerned about the remainder of your data in that dataset, however.

Added in edit: Crossed with Marcos's post #13. In this case, I think we are being misled by the Statalist forum software, which I expect is replacing some nonprintable characters with a space.

And on further consideration, it is possible that using ustrtrim() rather than strtrim() would solve the problem; it has a more expansive definition of "whitespace" characters to be trimmed.

And another edit: The tobytes() function allows us to find out what the first character actually is. With the sample data from post #1:

Code:

. input str47 country country 1. " Afghanistan" 2. end . generate str40 b = tobytes(substr(country,1,4)) . list, clean noobs country b Afghanistan \d032\d065\d102\d103

we see that the first character of the variable as it appears in the forum is a space, followed by an uppercase A, then lowercase f and g. I hypothesize that the same code run on the original data will show something other than \d032 for the first character. If it is a 9, 10, 11, 12, or 13, then ustrtrim() will take care of it.

Last edited by William Lisowski; 20 Jun 2018, 12:33.
1 like
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#15

20 Jun 2018, 15:11

Thanks a lot! The ustrtrim() command worked with no issues.
Comment

Announcement