Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating new variable that =1 if another variable start with specific character!

    Hi there,

    I have been trying and searching for a way to generate a new variable =1 if var4 for example start with string "5" and zero for otherwise.

    lets suppose there is a variable and its inputs like this:

    5194
    532
    1865
    8659
    3450
    5531

    What I want is to generate a new variable that include all observations within which this variable starts with 5 regardless of what come next or how many digits it has. So, in this case the new variable should contain 3 Obs.

    Any help is appreciated!

  • #2
    I am not sure whether your original variable is string or numeric. Here I assume numeric, and convert it to string. If it is already string, that line of code becomes redundant.

    I show two ways of producing what you want, one using regular expressions, the other not. Use whichever you like.

    Code:
    clear
    input int var4
    5194
    532
    1865
    8659
    3450
    5531
    end
    
    gen var4_str = string(var4)
    gen byte wanted1 = ustrregexm(var4_str,"^5")
    gen byte wanted2 = (strpos(var4_str,"5") == 1)
    drop var4_str
    which produces:
    Code:
    . list, noobs sep(0)
      +--------------------------+
      | var4   wanted1   wanted2 |
      |--------------------------|
      | 5194         1         1 |
      |  532         1         1 |
      | 1865         0         0 |
      | 8659         0         0 |
      | 3450         0         0 |
      | 5531         1         1 |
      +--------------------------+
    Last edited by Hemanshu Kumar; 03 Oct 2022, 05:55.

    Comment


    • #3
      Thanks Kumar. I will try these commands and see the results. What if the string is letter not number? And what if the variable is numerical not string, what would be the command here?

      Thanks and appreciate your help

      Comment


      • #4
        What if the string is letter not number?
        That makes no difference to the code.

        And what if the variable is numerical not string, what would be the command here?
        I don't understand. The code is written assuming the var4 variable is numerical. If it is string, then you don't need to create var4_str, just generate the wanted variable directly using var4.

        Comment


        • #5
          Thanks for your reply. What I meant Kumar is if I want to do the same analyses but the original variable is numerical and I don't want to convert it to string, what is the command in this case if the variable is numerical and want to do the analyses directly?

          Also, for learning, I checked the user guide for command ustrregexm and could not found the use of (^) before the string?

          Thanks in advance!

          Comment


          • #6
            If var4 were all the same length, then you could use a numerical approach (divide by 1000 and ceiling). Since different size, the string approach makes sense. #2 gets it done with ease.

            Comment


            • #7
              To avoid generating a new variable, you could do:
              Code:
              gen byte wanted1 = ustrregexm(string(var4),"^5")
              Also, for learning, I checked the user guide for command ustrregexm and could not found the use of (^) before the string?
              In that regular expression, the ^ asks to check that the string should begin with 5. See for instance this Stata FAQ or this guide.

              Comment


              • #8
                Originally posted by George Ford View Post
                If var4 were all the same length, then you could use a numerical approach (divide by 1000 and ceiling). Since different size, the string approach makes sense. #2 gets it done with ease.
                Yes I agree, but wanted to know if there is different command for just numerical variables. Thanks for your input George!

                Thanks all for your suggestions!

                Comment


                • #9
                  I am happy to try to use regular expression syntax if it seems needed, but I always think of trying more elementary functions first. This attitude arises personally because I got accustomed to using what Stata provided by way of string functionality before it introduced any support for regular expressions. (Conversely, in my favourite text editor Vim I use regular expressions just about daily.)

                  In this case, consider

                  Code:
                  clear
                  input int var4
                  5194
                  532
                  1865
                  8659
                  3450
                  5531
                  end
                  
                  gen wanted = substr(strofreal(var4), 1, 1) == "5"
                  
                  list , sepby(wanted)
                  
                       +---------------+
                       | var4   wanted |
                       |---------------|
                    1. | 5194        1 |
                    2. |  532        1 |
                       |---------------|
                    3. | 1865        0 |
                    4. | 8659        0 |
                    5. | 3450        0 |
                       |---------------|
                    6. | 5531        1 |
                       +---------------+
                  Last edited by Nick Cox; 03 Oct 2022, 09:43.

                  Comment


                  • #10
                    Originally posted by Nick Cox View Post
                    I am happy to try to use regular expression syntax if it seems needed, but I always think of trying more elementary functions first. This attitude arises personally because I got accustomed to using what Stata provided by way of string functionality before it introduced any support for regular expressions. (Conversely, in my favourite text editor Vim I use regular expressions just about daily.)

                    In this case, consider

                    Code:
                    clear
                    input int var4
                    5194
                    532
                    1865
                    8659
                    3450
                    5531
                    end
                    
                    gen wanted = substr(strofreal(var4), 1, 1) == "5"
                    
                    list , sepby(wanted)
                    
                    +---------------+
                    | var4 wanted |
                    |---------------|
                    1. | 5194 1 |
                    2. | 532 1 |
                    |---------------|
                    3. | 1865 0 |
                    4. | 8659 0 |
                    5. | 3450 0 |
                    |---------------|
                    6. | 5531 1 |
                    +---------------+


                    Thanks Cox, that's helpful and added to the arsenal of syntaxes!

                    Comment

                    Working...
                    X