Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching two string variables

    Hi,
    In the original dataset, var1 would have more than 200K values and var2 have more than 14K values (all strings).

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str4 var1 str5 var2
    "R560" "A00 " 
    "T393" "A000 "
    "J219" "A001 "
    "P59"  "A009 "
    "K092" "A01 " 
    "R509" "A010 "
    "G379" "A011 "
    "A162" "A012 "
    "P073" "A013 "
    "F209" "A014 "
    "Z523" "A02 " 
    "C169" "K092 "
    "H663" "A021 "
    "S835" "A022 "
    "Z501" "A028 "
    "B209" "A029 "
    "F192" "A03 " 
    "D414" "A030 "
    "C419" "A031 "
    "J039" "A032 "
    "U071" "A033 "
    "Z000" "A038 "
    "G129" "A039 "
    "J189" "A04 " 
    "E14"  "A040 "
    "K810" "A041 "
    "R560" "A042 "
    "G060" "A043 "
    "G473" "A044 "
    "P073" "A045 "
    "F209" "A046 "
    "D70"  "A047 "
    "C719" "A048 "
    "L021" "A049 "
    "M659" "A05 " 
    "Z501" "A050 "
    "A527" "A051 "
    "F151" "A052 "
    "N270" "A053 "
    "C910" "A054 "
    "U071" "A058 "
    "U071" "A059 "
    "K297" "A162 " 
    "Z922" "A060 "
    "J219" "A061 "
    "J219" "A062 "
    "S024" "A063 "
    "N390" "A064 "
    "Q039" "A065 "
    "G473" "A066 "
    end

    I want Stata to remember the string values of var2 and check if there is any occasion where there is a match in var1 then create a dummy variable and assign value of 1 to it. I tried the following:

    Code:
    levelsof var2 , local (codes)
    gen Flag =0
    foreach n of local codes {
        replace Flag =1 if ( var1 == "`n'")
        }
    But for some reason this does not work in terms of results.

    Is there anyway around to achieve it? I also tried egen any match but it says in the error msg that var1 is a string variable and cannot be included in the varlist!

  • #2
    You have trailing spaces in "var2". So eliminate these using

    Code:
    help strtrim
    or include this in the command

    Code:
    replace Flag =1 if (var1 == strtrim("`n'"))

    Comment


    • #3
      Thanks Andrew, we moving to right direction, but there are some instances where the command did not get the match string values especially those that starts with letter Y, V and W. Something odd and I cannot figure it out!

      Comment


      • #4
        Provide a dataex example of such cases.

        Comment


        • #5
          What I am trying to find is if there is any string values in var1 that is outside the range of string values of var2.

          Comment


          • #6
            var2 contains over 14K values and I copy and past it from excel file, and its not attached to any sort of id or obs unlike to var1.

            Comment


            • #7
              Indeed, that is clear. You state in #3 that there are cases where your code in #1 with modifications for trailing blanks fails. Show us such cases. Another way to do this is:

              Code:
              replace var1=strtrim(var1)
              frame put var2, into(match)
              frame match{
                  replace var2= strtrim(var2)
                  contract var2, nomiss
              }
              frlink m:1 var1, frame(match var2)
              *VAR1 MATCHED
              browse var1 match if !missing(match)
              You may switch var1 and var2 if you wish to find matches in the opposite direction.

              Comment


              • #8
                Thanks again, but I got same results as before, which I know is not correct. I wish I can share an example of the data but it is difficult to do so as it is not row-wise therefore it will not be good for comparison unless someone have the whole data which is not practical!

                Comment


                • #9
                  You can select a sample of the data, e.g.,

                  Code:
                  dataex in 2700/2730
                  for observations between and including the observation numbers 2700 and 2730. A guess is that what you think are spaces are other characters, in which case substitute


                  Code:
                  strtrim()
                  with its Unicode counterpart

                  Code:
                  ustrtrim()
                  Last edited by Andrew Musau; 08 Jun 2023, 16:54.

                  Comment


                  • #10
                    I think you are right about the list itself, because I opened a new stata file and entered the string values manually and then run the command and it have been picked up. How can I make sure the string values are intact, how can I use the both codes that you suggest?

                    Comment


                    • #11
                      I do not know specifically what solved your problem. The only way to find out is by presenting a reproducible example which will allow someone to investigate this.

                      Comment


                      • #12
                        I figured out the issue, it was var2 since it has all the list values. This codes would works
                        Code:
                         
                         levelsof var2 , local (codes) gen Flag =0 foreach n of local codes {     replace Flag =1 if ( var1 == "`n'")     }
                        But how to exclude string values that starts with initials V*, W*, X*, Y* from local codes before looping through var2? I tried replace Flag =1 if ( var1 == "`n'") & !inlist(var2, "V*", "W*", "X*", "Y*") but it did not work

                        Comment


                        • #13
                          If I remove all values in var2 that starts with initials VWXY, the results will be correct, but I don't want to remove those values from var2, so is there any way to overcome this issue?

                          Thanks

                          Comment


                          • #14
                            #12

                            Code:
                            & !ustrregexm(var2, "^V|^W|^X|^Y")
                            #13 - see #11.

                            Comment

                            Working...
                            X