Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find occurrences of string across multiple variables

    Hi,

    I want to find the amount of occurrences of a string variable across multiple variables by different groups. There's a large number of variables with various names so they might have to be renamed but I don't know how to do that either, note that this fact also makes it difficult for me to use the reshape command even though that might be a solution as well.

    Here's an example:
    ID V1 V2 V3 count_relevant_string
    1 relevant_string relevant_string Gibberish word 2
    2 Gibberish word Gibberish word relevant_string 1
    3 Gibberish word Gibberish word Gibberish word 0
    Where I want to create the variable count_relevant_string.

    Thanks in advance
    Jonatan

  • #2
    Your example implies that you're testing for equality:

    Code:
    gen wanted = 0
    
    quietly foreach v in V1 V2 V3 {
         replace wanted = wanted + (`v' == "relevant_string")
    }

    Comment


    • #3
      Jonatan, Nick's code is probably the fastest & easiest way to get what you want. I'm just adding some alternate solutions (and ideas that might also help you along your way):

      You might also find the posts here and here to be helpful

      Code:
      * Could do rename v* word*
      * See help "rename group", and "help renvar" for help with rename groups of variables
      
      dataex id v1 v2 v3  // Data shared via -dataex-. To install: ssc install dataex
      clear
      input byte id str15(v1 v2 v3)
      1 "relevant_string" "relevant_string" "Gibberish word"
      2 "Gibberish word"  "Gibberish word"  "relevant_string"
      3 "Gibberish word"  "Gibberish word"  "Gibberish word"
      end
      ------------------ copy up to and including the previous line ------------------
      
      
      egen count_non_blanks = rownonmiss(v1-v3), strok  // counts non-blanks within v1-v3
      ssc install egenmore  // in case you don't have it already
      egen distinct = rowsvals(v*)   // counts unique values within v1 v2 v3, ignores missing
      
      
      . list, noobs abbrev(18)
      
        +----------------------------------------------------------------------------------------+
        | id                v1                v2                v3   count_non_blanks   distinct |
        |----------------------------------------------------------------------------------------|
        |  1   relevant_string   relevant_string    Gibberish word                  3          2 |
        |  2    Gibberish word    Gibberish word   relevant_string                  3          2 |
        |  3    Gibberish word    Gibberish word    Gibberish word                  3          1 |
        +----------------------------------------------------------------------------------------+
      
      * Reshaping to long
      drop count_non_blanks distinct
      reshape long v, i(id) j(word)
      bysort id (word): gen count = _N
      gen is_relevant = (strpos(v, "relevant_string") > 0)
      egen count_if_relevant = total(is_relevant), by(id)
      
      . list, sepby(id) noobs abbrev(18)
      
        +-----------------------------------------------------------------------+
        | id   word                 v   count   is_relevant   count_if_relevant |
        |-----------------------------------------------------------------------|
        |  1      1   relevant_string       3             1                   2 |
        |  1      2   relevant_string       3             1                   2 |
        |  1      3    Gibberish word       3             0                   2 |
        |-----------------------------------------------------------------------|
        |  2      1    Gibberish word       3             0                   1 |
        |  2      2    Gibberish word       3             0                   1 |
        |  2      3   relevant_string       3             1                   1 |
        |-----------------------------------------------------------------------|
        |  3      1    Gibberish word       3             0                   0 |
        |  3      2    Gibberish word       3             0                   0 |
        |  3      3    Gibberish word       3             0                   0 |
        +-----------------------------------------------------------------------+
      Last edited by David Benson; 13 Feb 2019, 14:05.

      Comment

      Working...
      X