Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • compare substrings in same var - same same, but different

    Hi.

    I have a problem with a huge list of >19,000 towns. There are > 1,200 towns with duplicates (regarding name (dup != 0)), and the same ("AB") or different string in region.
    Now, the aim is to identify the code of all duplicates out of region "AB" with same name if one town out of the respective "group" of duplicates is located in region "AB".
    In the following example, those desired codes would be:
    12345
    12346
    (both are located out of region "AB", but share the name "Village" with the town of code 12344 located in region "AB").

    Code:
    input code str25 name str2 region
    12344 "Village" "AB"    
    12345 "Village" "TB"    
    12346 "Village" "DC"    
    12347 "Brisbane City" "AB"    
    12348 "Torrento" "TB"    
    12349 "Torrento" "TB"    
    12350 "Brisbane City" "AB"
    12351 "Swanlake" "DC"
    12352 "Island" "WR"
    end
    
    duplicates tag name, gen(dup)
    How may I do this with Stata commands instead of manually in Excel or something similar?
    Is that clear?
    Thank you for reading (and some reply)
    Using Stata 16.1
    Extractions (-dataex-) of the data I'm working with is impossible, sorry!

  • #2
    the following works for your example; however, I can imagine more complicated cases where it might not work so be careful
    Code:
    sort name region
    gen byte wanted = (name==name[_n-1] & region!=region[_n-1]) | (name==name[_n+1] & region!=region[_n+1)

    Comment


    • #3
      yeah, looks aproppriate.
      since the code contains both _n-1, and _n+1, it works for duplicates "before" and "after" the name in region "AB", right?
      'Cause I don't know, if the duplicates follow, or precede (or both) the one in region "AB" (due to sorting by name, in first instance).
      But not for more than 1 duplicate before or after, respectively? I don't know, how many duplicates exist for one name.
      Last edited by Franz Gerbig; 23 May 2019, 06:59.
      Thank you for reading (and some reply)
      Using Stata 16.1
      Extractions (-dataex-) of the data I'm working with is impossible, sorry!

      Comment


      • #4
        If I type -20/+20 instead of -1/+1, it should work for upto 40 duplicates (20 before, 20 after the one in "AB"), right?
        Thank you for reading (and some reply)
        Using Stata 16.1
        Extractions (-dataex-) of the data I'm working with is impossible, sorry!

        Comment


        • #5
          re: #4 - no, that is not how subscripting works in Stata; if you think you have more than 2 consecutive you need to be clearer about what is going on; note that if you have 3 repeats of the name with 3 different regions, the code I gave in #2 will mark all 3 with a "1"; please try the code and examine your results and report back on anything that is not what you want (again showing data via -dataex- as you did above)

          Comment


          • #6
            it works fine with the example data:
            Code:
            clear
            input code str25 name str2 region
            12348 "Torrento" "TB"    
            12345 "Village" "AB"    
            12344 "Village" "TB"    
            12346 "Village" "DC"    
            12347 "Brisbane City" "AB"    
            12349 "Torrento" "TB"    
            12350 "Brisbane City" "AB"
            12351 "Swanlake" "DC"
            12352 "Island" "WR"
            12359 "Village" "WR"
            12366 "Village" "TB"
            end
            
            duplicates tag name, gen(dup)
            
            sort name region
            gen byte wanted = 1 if (name==name[_n-1] & region!=region[_n-1]) | (name==name[_n+1] & region!=region[_n+1])
            fre wanted
            fre code if wanted == 1
            
            gen wanted2 = wanted
            replace wanted2 = 0 if region == "AB" //only duplicates out of "AB" wanted
            fre wanted2
            fre code if wanted2 == 1 //fine
            but not with the (confidential) real ones - more duplicates (there are towns with more than 90 duplicates ...) should be coded wanted = 1 (and wanted2 = 1 of course).
            dunno, what's wrong or how to illustrate
            Last edited by Franz Gerbig; 23 May 2019, 09:16.
            Thank you for reading (and some reply)
            Using Stata 16.1
            Extractions (-dataex-) of the data I'm working with is impossible, sorry!

            Comment

            Working...
            X