Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify where in variable list a certain value appears

    Sorry for the basic question.

    I have a dataset with 100-variable list from v1-v100 with >10k observations. I would like to create a new column to indicate if a certain value is present for each observation and if it does, which variable this value belongs to.

    For example, if the value of interest is "water", for observation #1, if "water" appears on v2, I would like to assign 2 for observation #1....

    Any value appears only once in v1-v100, if any.

    Thank you, everyone.

  • #2
    This may help. In particular, you need capture if you have a mix of numeric and string variables and want to search for a string, and vice versa.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float v1 str5 v2 float v3
    1 "air"        42
    2 "water"     666
    3 "fire"  3.14159
    4 "earth" 2.71828
    end
    
    gen found = . 
    
    forval j = 1/3 { 
        capture replace found = `j' if v`j' == "water"  
    }
    
    list 
    
    
         +------------------------------+
         | v1      v2        v3   found |
         |------------------------------|
      1. |  1     air        42       . |
      2. |  2   water       666       2 |
      3. |  3    fire   3.14159       . |
      4. |  4   earth   2.71828       . |
         +------------------------------+
    For "column" read "variable". "Toto, I have a feeling we're not in Excel any more". https://www.youtube.com/watch?v=uPnfuczOWb8

    I would recommend not sticking with uninformative variable names v1-v100 for any serious analysis.

    Comment


    • #3
      Thank you so much for your reply. Your codes make a lot of sense and seem working just fine but the copycat version somehow did not work on my Stata. Maybe because mine is a Stata 14?
      I agree with you. v1-100 is less ideal; I used it here as an example....
      In fact, the variable list I'm using is from a healthcare billing code-based database so the var list is named as I9_DX`j' with `j' from 1-40 aka "I9_DX1" to "I9_DX40".
      All variables are string format. Essentially each value is assigned based on the diagnosis they use from a taxonomy tree to group clusters of diagnoses. For example, 995 means sepsis in the diagnosis book and based on the cause of sepsis the value entered in each variable can be anywhere between 995.1 - 995.100 (and the . will be omitted so they look as 9951 to 995100).
      And I am interested in counting all of those values starting with 995.

      Anyway, I tried writing my code as below based on what you suggested. I got an error message for invalid syntax (see below)

      gen found = .
      forval j = `I9_DX`j'' of varlist I9_DX1-I9_DX40 {
      capture replace found = `j' if substr(`I9_DX`j'', 1,3) == "995"
      }

      r(198) error for this syntax

      I don't know what I did wrong....

      Comment


      • #4
        Try something like the following modification.
        Code:
        quietly generate str found = ""
        forvalues j = 1/40 {
            quietly replace found = strtrim(found + " `j'") if substr(I9_DX`j', 1, 3) == "995"
        }
        assert !strpos(found, " ")
        The extra stuff is to check your assumption that there is at most one I9_DXj variable that contains a 995 diagnosis.

        The following is an alternative approach that will also check the validity of your assumption.
        Code:
        tempfile tmpfil
        generate int obs_nr = _n
        quietly save `tmpfil'
        
        quietly reshape long I9_DX, i(obs_nr) j(found)
        quietly keep if substr(I9_DX, 1, 3) == "995"
        merge 1:1 obs_nr using `tmpfil', assert(match using) nogenerate

        Comment


        • #5
          In addition to Joseph Coveney's solution note the following errors:

          1. Mushing foreach and forvalues together in the same statement.

          2. Referring to a set of local macros that are never defined. The code refers in turn to local macros I9_DX1 l9_DX2 and so forth, but in each case the appropriate reference is just the variable name.

          You could use foreach here.

          Code:
          gen found = ""
          
          foreach v of var l9DX*  { 
              capture replace found = "`v'" if substr(`v', 1, 3) == "995"  
          }
          But that code does nothing to check the assumption of uniqueness, which as Joseph urges should surely be tested.

          Comment

          Working...
          X