Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging on string variables that seem identical, but no merges produced due to differences in str19 and str47

    Hi All,

    I currently have two datasets. In the first dataset, I have information on country level variables. The countries are coded as string variables and take the following form, for Afghanistan for instance:


    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str47 country
    " Afghanistan"
    end
    ------------------ copy up to and including the previous line ------------------


    As can be noticed the string type is a str47. I wish to perform a 1:m merge with using data, with country as the mergng variable. In the using data, the same country, Afghanista, is inputted as:


    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str19 country
    "Afghanistan"
    end
    ------------------ copy up to and including the previous line ------------------


    As can be seen, Afghanistan is spelt the exact same way, but is now coded as a str19 variable. As such, when I perform a 1:,m merge using the using data, I get precisely 0 merges. Is there a solution to this problem?

    Thanks!

  • #2
    I think it might actually be the space right before Afghanistan that is causing the issue. I'll look into it.

    Comment


    • #3
      If you have many situations like this, maybe - strtrim() - will be helpful.
      Best regards,

      Marcos

      Comment


      • #4
        Thank you. I'll look into it.

        Comment


        • #5
          always a good idea to use the trim functions before trying something like this; to get rid of leading and or trailing blanks, use "strtrim" (or its unicode friend if needed); if you think you might also have multiple internal blanks and want to trim those to just one, look at "stritrim" - another string function; you can use
          Code:
          help function
          and click on "string functions" to find the help files

          Comment


          • #6
            I did use strltrim, but to no avail. Number of matches produced continue to be 0. Could this be because of the Str47 vs Str19 nature of the string variables?

            Comment


            • #7
              Your problem has nothing to do with str47 v. str19 - Stata is (really, most languages are) smart enough to deal with that difference itself, not by forcing the user to make changes.

              Since you don't show us your code and output, the best I can do is show you my code, which works, and leave it to you to compare it to your code and find your problem.
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str19 country
              "Afghanistan"
              end
              replace country = strtrim(country)
              tempfile using
              save `using'
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str47 country
              " Afghanistan"
              end
              replace country = strtrim(country)
              merge 1:m country using `using'
              Code:
              . merge 1:m country using `using'
              
                  Result                           # of obs.
                  -----------------------------------------
                  not matched                             0
                  matched                                 1  (_merge==3)
                  -----------------------------------------

              Comment


              • #8
                Thanks a lot for your solution- I cannot post more data because of confidentiality reasons. I tried the solution, but I have noticed the problem. Take for example:

                clear
                input str47 country
                " Canada"
                end
                [/CODE]

                I type to remove the space:
                Code:
                replace country=strtrim(country)
                However, I obtain: a message saying that 0 real changes have been made, after which point,I still obtain data of the form:

                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str47 country
                " Canada"
                end
                [/CODE]

                I am not sure if this has to do with other observations in the dataset, but it seems to me that the strtrim command does not make a difference in my dataset.

                Comment


                • #9
                  Surely you shoudl be able to post actual data on the country names. Just select that variable only by doing
                  Code:
                  dataex country
                  Potentially there are leading/trailing characters that are not recognized by strtrim. The exact data example and code you post in #8 work as they should.

                  Comment


                  • #10
                    Here is the country \data:

                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str47 country
                    " Canada"       
                    " Norway"       
                    " Indonesia"    
                    " Greenland"    
                    " Russia"       
                    " Philippines"  
                    " Japan"        
                    " Australia"    
                    " United States"
                    " Antarctica"   
                    end
                    If I type:
                    Code:
                    replace country=strtrim(country)
                    I still obtain:


                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str47 country
                    " Canada"       
                    " Norway"       
                    " Indonesia"    
                    " Greenland"    
                    " Russia"       
                    " Philippines"  
                    " Japan"        
                    " Australia"    
                    " United States"
                    " Antarctica"   
                    end

                    Comment


                    • #11
                      Have you tried this?

                      Code:
                      . gen myvar = strtrim(country)
                      P.S.: your - replace - command works fine to me.
                      Last edited by Marcos Almeida; 20 Jun 2018, 11:59.
                      Best regards,

                      Marcos

                      Comment


                      • #12
                        Yes, I did. I tried both the gen and the replace option, but to no avail.

                        Comment


                        • #13
                          This is the output we get for the replace command:

                          Code:
                          . replace country = strtrim(country)
                          (10 real changes made)
                          Look at the output below:

                          Code:
                          . input str47 country
                          
                                                                       country
                            1. " Canada"       
                            2. " Norway"       
                            3. " Indonesia"    
                            4. " Greenland"    
                            5. " Russia"       
                            6. " Philippines"  
                            7. " Japan"        
                            8. " Australia"    
                            9. " United States"
                           10. " Antarctica"   
                           11. end
                          
                          . des
                          
                          Contains data
                            obs:            10                          
                           vars:             1                          
                           size:           470                          
                          ------------------------------------------------------------------------------------------------------------
                                        storage   display    value
                          variable name   type    format     label      variable label
                          ------------------------------------------------------------------------------------------------------------
                          country         str47   %47s                  
                          ------------------------------------------------------------------------------------------------------------
                          Sorted by: 
                               Note: Dataset has changed since last saved.
                          
                          . compress country
                            variable country was str47 now str14
                            (330 bytes saved)
                          
                          . des
                          
                          Contains data
                            obs:            10                          
                           vars:             1                          
                           size:           140                          
                          ------------------------------------------------------------------------------------------------------------
                                        storage   display    value
                          variable name   type    format     label      variable label
                          ------------------------------------------------------------------------------------------------------------
                          country         str14   %14s                  
                          ------------------------------------------------------------------------------------------------------------
                          Sorted by: 
                               Note: Dataset has changed since last saved.
                          
                            . replace country = strtrim(country)
                          (10 real changes made)
                          
                          . des
                          
                          Contains data
                            obs:            10                          
                           vars:             1                          
                           size:           140                          
                          ------------------------------------------------------------------------------------------------------------
                                        storage   display    value
                          variable name   type    format     label      variable label
                          ------------------------------------------------------------------------------------------------------------
                          country         str14   %14s                  
                          ------------------------------------------------------------------------------------------------------------
                          Sorted by: 
                               Note: Dataset has changed since last saved.
                          
                          . compress country
                            variable country was str14 now str13
                            (10 bytes saved)
                          
                          . des
                          
                          Contains data
                            obs:            10                          
                           vars:             1                          
                           size:           130                          
                          ------------------------------------------------------------------------------------------------------------
                                        storage   display    value
                          variable name   type    format     label      variable label
                          ------------------------------------------------------------------------------------------------------------
                          country         str13   %13s                  
                          ------------------------------------------------------------------------------------------------------------
                          Sorted by: 
                               Note: Dataset has changed since last saved.
                          Best regards,

                          Marcos

                          Comment


                          • #14
                            Jorrit's diagnosis in post #9 seems likely to be correct. However that data was created, the country variable was created with a leading character which is not a space and is apparently a non-printing character.

                            If every observation for the variable country in that dataset has a leading "space" like this, then
                            Code:
                            replace country = substr(country,2,.)
                            will eliminate the leading character whatever it is. But if some have it and others don't, you will need a different approach.

                            I would be concerned about the remainder of your data in that dataset, however.

                            Added in edit: Crossed with Marcos's post #13. In this case, I think we are being misled by the Statalist forum software, which I expect is replacing some nonprintable characters with a space.

                            And on further consideration, it is possible that using ustrtrim() rather than strtrim() would solve the problem; it has a more expansive definition of "whitespace" characters to be trimmed.

                            And another edit: The tobytes() function allows us to find out what the first character actually is. With the sample data from post #1:
                            Code:
                            . input str47 country
                            
                                                                         country
                              1. " Afghanistan"
                              2. end
                            
                            . generate str40 b = tobytes(substr(country,1,4))
                            
                            . list, clean noobs
                            
                                     country                      b  
                                 Afghanistan   \d032\d065\d102\d103
                            we see that the first character of the variable as it appears in the forum is a space, followed by an uppercase A, then lowercase f and g. I hypothesize that the same code run on the original data will show something other than \d032 for the first character. If it is a 9, 10, 11, 12, or 13, then ustrtrim() will take care of it.
                            Last edited by William Lisowski; 20 Jun 2018, 12:33.

                            Comment


                            • #15
                              Thanks a lot! The ustrtrim() command worked with no issues.

                              Comment

                              Working...
                              X