Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create a variable (0/1) based on multiple variables containing string

    Hello,

    I want to tell stata to create a variable giving a code of 0/1 based on around 200 variables containing string. If any values within those 200 variables contain F20 or F21 I want the new variable to be 1. Any advice greatly received. I want it to ignore any missing values but if all 200 values are missing I would want the new variable to give a missing value not a 0.

  • #2
    Give this a try. I've chosen transparent code over quick code. Didn't have anything to test it on, though. I hope you have some easy way to make YourVarlist from your 200 variables, e.g., wild cards.
    Code:
    // Make a temporary variable containing all your variables as one big string.
    egen strL temp200 = concat(YourVarlist)
    // Check for the strings of interest
    gen byte code = inlist(upper(temp200), "F20", "F21")  // F or f
    //  Check for all 200 being missing.  I use strtrim()) because I assume that
    // if one of your variables had nothing but spaces, you'd treat it as missing.
    replace code = . if strtrim(temp200) == ""
    drop temp200

    Comment


    • #3
      The lack of examples of your data leave us with unanswered questions.

      Which of the following strings containing F20 are possible in your data, and which meet your criteria for setting the new variable to 1?
      Code:
      "F20"
      "This is F20"
      "F5 F20 F31"
      "F5,F20,F31"
      "F201"
      "F20A"
      Even the best descriptions of data are no substitute for an actual example of the data. In order to get a helpful response, you need to show some example data.

      Be sure to use the dataex command to do this. If you are running version 15.1 or a fully updated version 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. Note that you can use it to select just a few typical variables from your 200, and you can set the number of observations to less than the default 100 if fewer observations are necessary to make your point. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      When asking for help with code, always show example data. When showing example data, always use dataex.


      Comment


      • #4
        Mike's code will work, I am just offering an alternative method:

        Code:
        *** I created some toy data
        dataex *  // Example shared via  -dataex-. To install: ssc install dataex
        clear
        input byte id str5 var1 str8 var2 str11 var3 str14 var4 str10 var5
        1 "Blank" "F20, F21" "Other words" "something else" "F1, F2, F3"
        2 "F1"    "F22"      "F203"        "F50, F51"       "F1, F2, F3"
        3 "F1"    ""         "F15, F16"    "F50, F51"       "F1, F2, F3"
        4 ""      ""         ""            ""               ""          
        end
        
        ssc install egenmore  // needed for the egen, incss()  
        egen has_F20 = incss(var1-var5), sub("F20") i    //  1 if "F20 occurs within any of the variables in a list of string variables. "i" makes it case-insensitive.
        // NOTE: Looking for substring "F20" means it will also count "F203" and "F2040" as matches
        egen count_notblank = rownonmiss(var1-var5), strok
        egen count_miss = rowmiss(var1-var5)
        
        . list, noobs abbrev(14)
        
          +-----------------------------------------------------------------------------------------------------------+
          | id    var1       var2          var3             var4         var5   has_F20   count_notblank   count_miss |
          |-----------------------------------------------------------------------------------------------------------|
          |  1   Blank   F20, F21   Other words   something else   F1, F2, F3         1                5            0 |
          |  2      F1        F22          F203         F50, F51   F1, F2, F3         1                5            0 |
          |  3      F1                 F15, F16         F50, F51   F1, F2, F3         0                4            1 |
          |  4                                                                        0                0            5 |
          +-----------------------------------------------------------------------------------------------------------+

        Comment


        • #5
          The devil is in the details, and "contains" is a weasel word. Let's continue to consider that it need not mean equals. Let's throw into the pot a loop solution without insisting (or even implying) that it's any better than other ideas, but merely making the point that a loop over variables is surely needed one way or another. I had forgotten about incss(). I have stolen david Benson's nice example.

          Code:
          clear
          input byte id str5 var1 str8 var2 str11 var3 str14 var4 str10 var5
          1 "Blank" "F20, F21" "Other words" "something else" "F1, F2, F3"
          2 "F1"    "F22"      "F203"        "F50, F51"       "F1, F2, F3"
          3 "F1"    ""         "F15, F16"    "F50, F51"       "F1, F2, F3"
          4 ""      ""         ""            ""               ""          
          end
          
          gen wanted = .
          
          foreach v of var var* {
              replace wanted = 1 if strpos(`v', "F20") | strpos(`v', "F21")
              replace wanted = 0 if !missing(`v') & missing(wanted)
          }
          
          list
          
               +----------------------------------------------------------------------------+
               | id    var1       var2          var3             var4         var5   wanted |
               |----------------------------------------------------------------------------|
            1. |  1   Blank   F20, F21   Other words   something else   F1, F2, F3        1 |
            2. |  2      F1        F22          F203         F50, F51   F1, F2, F3        1 |
            3. |  3      F1                 F15, F16         F50, F51   F1, F2, F3        0 |
            4. |  4                                                                       . |
               +----------------------------------------------------------------------------+
          The logic here is that the variable is born missing, and we never change our minds if we never see a non-empty string.

          We will change our minds irreversibly to 1 if we see either substring F20 F21 and provisionally to 0 if we see anything else.
          Last edited by Nick Cox; 21 Oct 2019, 13:29.

          Comment


          • #6
            Hi Nick, thank you that code has worked for me. Can I ask if it is possible to use the same code when trying to select numbers? So for example the part of your code would read:

            Code:
             replace wanted = 1 if strpos(`v', "7629") | strpos(`v', "7729")
            . Would I need to use a different piece of code to strpos? Thanks

            Comment


            • #7
              Indeed. strpos() only works with strings. As for what you need for trying to select numbers, perhaps you could spell out what that means?

              Comment


              • #8
                Sorry to be clearer:

                I have a lot of medications but they have all been coded as numbers, I want to create one variable to say whether a person is on treatment for a condition. The variable will be 1 to say they are on treatment or 0 to say they or not, or missing if all values are missing. There are many variables with lots of medications and some people may be on different medication but still treating the same condition. So while one person is on one drug and another is on another drug, if it is treating the same condition the created variable will give them a score of 1. So I want to do exactly as you have shown with the above code but with numbers instead. So variables may include: medication1 medication 2 medication 3 and values within those variables may be like 1234 7629 5483

                Comment


                • #9
                  So, the recipe might look more like


                  Code:
                   
                   gen wanted = .  foreach v of var var* {     replace wanted = 1 if inlist(`v', 1234, 7629, 5483)     replace wanted = 0 if !missing(`v') & missing(wanted) }  list
                  It seems that you might need a basic tutorial on functions, as for example that at https://www.stata-journal.com/articl...article=dm0058

                  Comment

                  Working...
                  X