Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to split numeric variables based on rule

    Hi Stata folks, I am working on a dataset where each ID is associated with a numeric value comprised of 0 and 1s. What I need to do now is to extract part of the numeric values based on a rule. For instance, for a value of 1001100, I need to extract the last three digits (100); for a value of 1010110, I need to extract the last two digit (10); for a value of 1011000, I need to extract the last four digits (1000). The rule is to extract the part where the last 1 was seen (together with the 0s behind the last 1s). Is there a way to achieve this using stata? Thanks!

  • #2
    See nsplit: ssc install nsplit
    Jorge Eduardo Pérez Pérez
    www.jorgeperezperez.com

    Comment


    • #3
      Hi Jorge, thanks for your reply. i am aware of the nsplit command but i don't think it helps in my situation. basically i need stata to find the last 1 in the value, but do you know how?

      Comment


      • #4
        This could be a one liner, I think, but it's more readable to me on multiple lines. I presume your numbers are integers stored as longs. I think it could be done with doubles with some finesse, but I thought I'd start here:
        Code:
        clear
        input long num
        1000100
        1111010
        10101000
        end
        //
        gen s = string(num, "%12.0f")
        gen byte pos1 = strpos(strreverse(s), "1")
        gen numlast = real(substr(s,-pos1, pos1))
        list

        Comment


        • #5
          EDIT: This crossed with Mike's post #4. My approach is very similar to his (I went back and added his numbers to make sure it worked).

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long id
           1001100
           1010110
           1011000
           1011110
           1001101
           1010111
           1011001
           1000100
           1111010
          10101000
          end
          Code:
          tostring(id), gen(id_string)
          egen count1 = noccur(id_string), string(1)  // count the number of "1" in id_string (I ended up not using it)
          gen last1 = strrpos(id_string, "1")  // list position of the last "1" in id_string
          gen id_new = substr(id_string, last1, .)
          destring id_new, replace  // this and prior line could be combined into single line as in Mike Lacy's code
          
          
          . list, abbrev(12) noobs
          
            +------------------------------------------------+
            |       id   id_string   count1   last1   id_new |
            |------------------------------------------------|
            |  1001100     1001100        3       5      100 |
            |  1010110     1010110        4       6       10 |
            |  1011000     1011000        3       4     1000 |
            |  1011110     1011110        5       6       10 |
            |  1001101     1001101        4       7        1 |
            |------------------------------------------------|
            |  1010111     1010111        5       7        1 |
            |  1011001     1011001        4       7        1 |
            |  1000100     1000100        2       5      100 |
            |  1111010     1111010        5       6       10 |
            | 10101000    10101000        3       5     1000 |
            +------------------------------------------------+

          Comment


          • #6
            Hi David and Mike, thank you both so much for this. This's exactly what I need.

            Can I ask a follow-up question? If I also need to count the total number of 0s in between 1s, what should I do then? For instance, using David's sample data, the first ID has a total of two 0s between 1s, the second ID has a total of two 0s between 1s, ...and ID 1001101 has a total of three 0s in between 1s.

            Comment


            • #7
              So, would that be the same as finding the position of the last "1", and then counting the number of 0's to the left of it?
              If that's the case, then this should work (using the same sample data as in post #5.):

              Edited to add: egen, noccur is part of egenmore (SSC). (SSC install egenmore)

              Code:
              gen last1 =  strrpos(id_string, "1")  // list position of the last "1" in id_string (from prior post)
              gen id_new2 = substr(id_string, 1, last1)  // create new variable that goes from beginning to last "1" in id_string
              egen count0 = noccur(id_new2), string(0)  // counting number of zeros in that truncated number
              
              . list id id_new2 count0
              
                   +-----------------------------+
                   |       id   id_new2   count0 |
                   |-----------------------------|
                1. |  1001100     10011        2 |
                2. |  1010110    101011        2 |
                3. |  1011000      1011        1 |
                4. |  1011110    101111        1 |
                5. |  1001101   1001101        3 |
                   |-----------------------------|
                6. |  1010111   1010111        2 |
                7. |  1011001   1011001        3 |
                8. |  1000100     10001        3 |
                9. |  1111010    111101        1 |
               10. | 10101000     10101        2 |
                   +-----------------------------+
              Last edited by David Benson; 13 Dec 2018, 17:05.

              Comment


              • #8
                Addressing #6. If you want the longest substring starting and ending with 1, then moss (SSC) offers a solution. Then a way to count characters is to compare (a) the length of the string and (b) the length the string would be if that character were removed. So the numbers of zeros in 11001 is the length of that (5) minus the length of 111 after the 0s have been removed. For more see https://www.stata-journal.com/sjpdf....iclenum=dm0056

                Here it all is in a nutshell:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long id
                 1001100
                 1010110
                 1011000
                 1011110
                 1001101
                 1010111
                 1011001
                 1000100
                 1111010
                10101000
                end
                
                gen sid = string(id, "%12.0f") 
                
                moss sid, match("(1.*1)") regex 
                
                gen n0 = length(_match1) - length(subinstr(_match1, "0", "", .)) 
                
                list, sep(0) 
                
                     +-----------------------------------------------------+
                     |       id        sid   _count   _match1   _pos1   n0 |
                     |-----------------------------------------------------|
                  1. |  1001100    1001100        1     10011       1    2 |
                  2. |  1010110    1010110        1    101011       1    2 |
                  3. |  1011000    1011000        1      1011       1    1 |
                  4. |  1011110    1011110        1    101111       1    1 |
                  5. |  1001101    1001101        1   1001101       1    3 |
                  6. |  1010111    1010111        1   1010111       1    2 |
                  7. |  1011001    1011001        1   1011001       1    3 |
                  8. |  1000100    1000100        1     10001       1    3 |
                  9. |  1111010    1111010        1    111101       1    1 |
                 10. | 10101000   10101000        1     10101       1    2 |
                     +-----------------------------------------------------+

                Comment


                • #9
                  Dear Stata folks,

                  I have the following datasets (in string format) that I want to add on all numbers and get the total. For example 11111 =5, 11=2 etc. I have used the following command for splitting each of number
                  gen first_num = real(ustrregexs(1)) if ustrregexm(dosage,"([\d]+)") and then add on. Appreciate help is the total can be retrieved directly without splitting first. Thanks in advance.
                  11111
                  11
                  11
                  13
                  1111
                  5051
                  11
                  113
                  2
                  12
                  655
                  1
                  1
                  11

                  Comment


                  • #10
                    You could loop:


                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str5 var1
                    "11111"
                    "11"  
                    "11"  
                    "13"  
                    "1111"
                    "5051"
                    "11"  
                    "113"  
                    "2"    
                    "12"  
                    "655"  
                    "1"    
                    "1"    
                    "11"  
                    end
                    
                    gen length= length(var1)
                    qui sum length
                    gen wanted=0
                    forval i=1/ `r(max)'{
                        qui replace wanted= wanted+ real(substr(var1, `i', 1)) if `i'<=length
                    }
                    Res.:

                    Code:
                    . l
                    
                         +-------------------------+
                         |  var1   length   wanted |
                         |-------------------------|
                      1. | 11111        5        5 |
                      2. |    11        2        2 |
                      3. |    11        2        2 |
                      4. |    13        2        4 |
                      5. |  1111        4        4 |
                         |-------------------------|
                      6. |  5051        4       11 |
                      7. |    11        2        2 |
                      8. |   113        3        5 |
                      9. |     2        1        2 |
                     10. |    12        2        3 |
                         |-------------------------|
                     11. |   655        3       16 |
                     12. |     1        1        1 |
                     13. |     1        1        1 |
                     14. |    11        2        2 |
                         +-------------------------+

                    Comment


                    • #11
                      Dear Andrew,
                      Thank you very much. It works perfect!
                      Best

                      Comment


                      • #12
                        Dear Andrew,
                        Thanks. Just a further question on the number that start with zero. For an example:
                        var1: 05051 length: 4 wanted: 0.5 +0.5 +1 =2 . The zero number was not captured before in my previous example. Appreciate any help. Thanks

                        Comment


                        • #13
                          The rule you propose is not clear. What are the values for

                          1. 012010
                          2. 5103
                          3. 6540

                          ?

                          Comment


                          • #14
                            Dear Andrew,
                            Thanks for your suggestion. My apologies for creating unclear post. The example data is as below.
                            The suggested command works well for most of observations. However for obs 2 (i.e start with 0.5 tablet), the total number that I wanted is 1 (i.e 0.5 +0.5 +1=1) but the total number generated (tndd) is 11 because "0" in front of 05051 was missing in the previous example. For the last obs (2 tablet morning and evening for pain), the wanted total number is 4 but the "2 in the evening" was missing in this free text variable. Any help is highly appreciated.
                            Best

                            gen length= length(var1) qui sum length gen wanted=0 forval i=1/ `r(max)'{ qui replace wanted= wanted+ real(substr(var1, `i', 1)) if `i'<=length }

                            dos_ndd var1 length tndd wanted
                            1 capsule , 1 capsule , 1 capsule , 1 capsule pain reliever 1111 4 4 4
                            0.5 tablet morning, 0.5 tablet evening for pain for 1 week after operation 05051 5 11 1
                            1 tablet morning at , 1 tablet evening at for pain 11 2 2 2
                            1 capsule 1-3 times per day as needed when treatment with alvedon and oxycodone is insufficient 113 3 3 3
                            2 tablet morning and evening for pain 2 1 2 4

                            Comment


                            • #15
                              Originally posted by Ches Zin View Post
                              1 capsule 1-3 times per day as needed when treatment with alvedon and oxycodone is insufficient 113 3 3 3
                              2 tablet morning and evening for pain 2 1 2 4
                              Sorry, if you are mixing the number of tablets and frequency, there is no way to distinguish between these. Given the sequence "3 2", this may mean 3 tablets and 2 tablets = 5 tablets or 3 tablets twice daily= 6 tablets. You have to go back and create a more coherent coding system.



                              Comment

                              Working...
                              X