Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying and replacing duplicate values across multiple string variables

    Hi. My data has each row identified by an id variable and multiple language string variables that list languages spoken for that id. However, the same language may be duplicated across the language variables. I want to retain each language in just the first variable instance that a language occurs in and replace all duplicates with NA. Is there a way I can write loop to run on all the language variables, look for duplicates across all the variables for each row and replace duplicates with a "NA"? For example, in the second row below, language1 will be Hindi" but all others would be NA

    Code:
    input double performer_id str8(language1 language2 language3 language4)
    81016542895 "Kannada"  "NA"       "NA"       "NA"     
    81013244989 "Hindi "   "Hindi"    "Hindi"    "Hindi " 
    81013267181 "Hindi"    "Hindi"    "NA"       "NA"     
    81009910893 "Hindi"    "Gujarati" "Marathi"  "Punjabi"
    81013228751 "Punjabi"  "Punjabi"  "NA"       "NA"     
    81015437069 "Hindi "   "Hindi"    "Gujarati" "Marathi"
    81010739006 "Hindi"    "Hindi "   "Gujarati" "Marathi"
    81015437069 "Hindi "   "Hindi"    "Gujarati" "Marathi"
    81011361270 "Hindi"    "Hindi"    "Hindi "   "Kannada"
    81013269951 "Kannada"  "Gujarati" "NA"       "NA"     
    81029734638 "Hindi"    "NA"       "NA"       "NA"

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double performer_id str8(language1 language2 language3 language4)
    81016542895 "Kannada" "NA"       "NA"       "NA"     
    81013244989 "Hindi "  "Hindi"    "Hindi"    "Hindi " 
    81013267181 "Hindi"   "Hindi"    "NA"       "NA"     
    81009910893 "Hindi"   "Gujarati" "Marathi"  "Punjabi"
    81013228751 "Punjabi" "Punjabi"  "NA"       "NA"     
    81015437069 "Hindi "  "Hindi"    "Gujarati" "Marathi"
    81010739006 "Hindi"   "Hindi "   "Gujarati" "Marathi"
    81015437069 "Hindi "  "Hindi"    "Gujarati" "Marathi"
    81011361270 "Hindi"   "Hindi"    "Hindi "   "Kannada"
    81013269951 "Kannada" "Gujarati" "NA"       "NA"     
    81029734638 "Hindi"   "NA"       "NA"       "NA"     
    end
    format performer_id %12.0f
    generate seq = _n
    reshape long language, i(seq performer_id) j(lnum)
    replace language = trim(language)
    by seq language (lnum), sort: replace language = "NA" if _n>1
    replace lnum = 9 if language=="NA"
    by seq (lnum), sort: replace lnum = _n
    reshape wide
    list, clean abbreviate(16)
    Code:
    . list, clean abbreviate(16)
    
           seq   performer_id   language1   language2   language3   language4  
      1.     1    81016542895     Kannada          NA          NA          NA  
      2.     2    81013244989       Hindi          NA          NA          NA  
      3.     3    81013267181       Hindi          NA          NA          NA  
      4.     4    81009910893       Hindi    Gujarati     Marathi     Punjabi  
      5.     5    81013228751     Punjabi          NA          NA          NA  
      6.     6    81015437069       Hindi    Gujarati     Marathi          NA  
      7.     7    81010739006       Hindi    Gujarati     Marathi          NA  
      8.     8    81015437069       Hindi    Gujarati     Marathi          NA  
      9.     9    81011361270       Hindi     Kannada          NA          NA  
     10.    10    81013269951     Kannada    Gujarati          NA          NA  
     11.    11    81029734638       Hindi          NA          NA          NA
    The example code gets rid of leading and trailing blanks in the languages (see performer_id 81013244989) and ensures that the NA's all come at the end of the list. The inclusion of the variable seq is needed because your performer_id is not distinct (performer_id 81015437069 appears twice in your data).

    Comment


    • #3
      Thanks William Lisowski. That worked perfectly!

      Comment

      Working...
      X