How to split numeric variables based on rule

Man Yang

Join Date: Mar 2016

Posts: 183
#1

How to split numeric variables based on rule

13 Dec 2018, 13:19

Hi Stata folks, I am working on a dataset where each ID is associated with a numeric value comprised of 0 and 1s. What I need to do now is to extract part of the numeric values based on a rule. For instance, for a value of 1001100, I need to extract the last three digits (100); for a value of 1010110, I need to extract the last two digit (10); for a value of 1011000, I need to extract the last four digits (1000). The rule is to extract the part where the last 1 was seen (together with the 0s behind the last 1s). Is there a way to achieve this using stata? Thanks!
Tags: None
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#2

13 Dec 2018, 13:22

See nsplit: ssc install nsplit

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
Comment
Man Yang

Join Date: Mar 2016

Posts: 183
#3

13 Dec 2018, 13:39

Hi Jorge, thanks for your reply. i am aware of the nsplit command but i don't think it helps in my situation. basically i need stata to find the last 1 in the value, but do you know how?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2414
#4

13 Dec 2018, 14:28

This could be a one liner, I think, but it's more readable to me on multiple lines. I presume your numbers are integers stored as longs. I think it could be done with doubles with some finesse, but I thought I'd start here:

Code:

clear input long num 1000100 1111010 10101000 end // gen s = string(num, "%12.0f") gen byte pos1 = strpos(strreverse(s), "1") gen numlast = real(substr(s,-pos1, pos1)) list
Comment

David Benson

Join Date: Oct 2018
Posts: 489

13 Dec 2018, 14:41

EDIT: This crossed with Mike's post #4. My approach is very similar to his (I went back and added his numbers to make sure it worked).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id
 1001100
 1010110
 1011000
 1011110
 1001101
 1010111
 1011001
 1000100
 1111010
10101000
end

Code:

tostring(id), gen(id_string)
egen count1 = noccur(id_string), string(1)  // count the number of "1" in id_string (I ended up not using it)
gen last1 = strrpos(id_string, "1")  // list position of the last "1" in id_string
gen id_new = substr(id_string, last1, .)
destring id_new, replace  // this and prior line could be combined into single line as in Mike Lacy's code


. list, abbrev(12) noobs

  +------------------------------------------------+
  |       id   id_string   count1   last1   id_new |
  |------------------------------------------------|
  |  1001100     1001100        3       5      100 |
  |  1010110     1010110        4       6       10 |
  |  1011000     1011000        3       4     1000 |
  |  1011110     1011110        5       6       10 |
  |  1001101     1001101        4       7        1 |
  |------------------------------------------------|
  |  1010111     1010111        5       7        1 |
  |  1011001     1011001        4       7        1 |
  |  1000100     1000100        2       5      100 |
  |  1111010     1111010        5       6       10 |
  | 10101000    10101000        3       5     1000 |
  +------------------------------------------------+

Comment

Man Yang

Join Date: Mar 2016

Posts: 183
#6

13 Dec 2018, 15:59

Hi David and Mike, thank you both so much for this. This's exactly what I need.

Can I ask a follow-up question? If I also need to count the total number of 0s in between 1s, what should I do then? For instance, using David's sample data, the first ID has a total of two 0s between 1s, the second ID has a total of two 0s between 1s, ...and ID 1001101 has a total of three 0s in between 1s.
Comment

David Benson

Join Date: Oct 2018
Posts: 489

13 Dec 2018, 16:46

So, would that be the same as finding the position of the last "1", and then counting the number of 0's to the left of it?
If that's the case, then this should work (using the same sample data as in post #5.):

Edited to add: egen, noccur is part of egenmore (SSC). (SSC install egenmore)

Code:

gen last1 =  strrpos(id_string, "1")  // list position of the last "1" in id_string (from prior post)
gen id_new2 = substr(id_string, 1, last1)  // create new variable that goes from beginning to last "1" in id_string
egen count0 = noccur(id_new2), string(0)  // counting number of zeros in that truncated number

. list id id_new2 count0

     +-----------------------------+
     |       id   id_new2   count0 |
     |-----------------------------|
  1. |  1001100     10011        2 |
  2. |  1010110    101011        2 |
  3. |  1011000      1011        1 |
  4. |  1011110    101111        1 |
  5. |  1001101   1001101        3 |
     |-----------------------------|
  6. |  1010111   1010111        2 |
  7. |  1011001   1011001        3 |
  8. |  1000100     10001        3 |
  9. |  1111010    111101        1 |
 10. | 10101000     10101        2 |
     +-----------------------------+

Last edited by David Benson; 13 Dec 2018, 17:05.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35683

13 Dec 2018, 18:20

Addressing #6. If you want the longest substring starting and ending with 1, then moss (SSC) offers a solution. Then a way to count characters is to compare (a) the length of the string and (b) the length the string would be if that character were removed. So the numbers of zeros in 11001 is the length of that (5) minus the length of 111 after the 0s have been removed. For more see https://www.stata-journal.com/sjpdf....iclenum=dm0056

Here it all is in a nutshell:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id
 1001100
 1010110
 1011000
 1011110
 1001101
 1010111
 1011001
 1000100
 1111010
10101000
end

gen sid = string(id, "%12.0f") 

moss sid, match("(1.*1)") regex 

gen n0 = length(_match1) - length(subinstr(_match1, "0", "", .)) 

list, sep(0) 

     +-----------------------------------------------------+
     |       id        sid   _count   _match1   _pos1   n0 |
     |-----------------------------------------------------|
  1. |  1001100    1001100        1     10011       1    2 |
  2. |  1010110    1010110        1    101011       1    2 |
  3. |  1011000    1011000        1      1011       1    1 |
  4. |  1011110    1011110        1    101111       1    1 |
  5. |  1001101    1001101        1   1001101       1    3 |
  6. |  1010111    1010111        1   1010111       1    2 |
  7. |  1011001    1011001        1   1011001       1    3 |
  8. |  1000100    1000100        1     10001       1    3 |
  9. |  1111010    1111010        1    111101       1    1 |
 10. | 10101000   10101000        1     10101       1    2 |
     +-----------------------------------------------------+

Comment

Ches Zin

Join Date: Jun 2023

Posts: 20
#9

18 Jul 2023, 03:04

Dear Stata folks,

I have the following datasets (in string format) that I want to add on all numbers and get the total. For example 11111 =5, 11=2 etc. I have used the following command for splitting each of number
gen first_num = real(ustrregexs(1)) if ustrregexm(dosage,"([\d]+)") and then add on. Appreciate help is the total can be retrieved directly without splitting first. Thanks in advance.

11111

11

11

13

1111

5051

11

113

2

12

655

1

1

11
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

#10

18 Jul 2023, 03:17

You could loop:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 var1
"11111"
"11"  
"11"  
"13"  
"1111"
"5051"
"11"  
"113"  
"2"    
"12"  
"655"  
"1"    
"1"    
"11"  
end

gen length= length(var1)
qui sum length
gen wanted=0
forval i=1/ `r(max)'{
    qui replace wanted= wanted+ real(substr(var1, `i', 1)) if `i'<=length
}

Res.:

Code:

. l

     +-------------------------+
     |  var1   length   wanted |
     |-------------------------|
  1. | 11111        5        5 |
  2. |    11        2        2 |
  3. |    11        2        2 |
  4. |    13        2        4 |
  5. |  1111        4        4 |
     |-------------------------|
  6. |  5051        4       11 |
  7. |    11        2        2 |
  8. |   113        3        5 |
  9. |     2        1        2 |
 10. |    12        2        3 |
     |-------------------------|
 11. |   655        3       16 |
 12. |     1        1        1 |
 13. |     1        1        1 |
 14. |    11        2        2 |
     +-------------------------+

Comment

Ches Zin

Join Date: Jun 2023

Posts: 20
#11

18 Jul 2023, 05:33

Dear Andrew,
Thank you very much. It works perfect!
Best
Comment
Ches Zin

Join Date: Jun 2023

Posts: 20
#12

18 Jul 2023, 11:52

Dear Andrew,
Thanks. Just a further question on the number that start with zero. For an example:
var1: 05051 length: 4 wanted: 0.5 +0.5 +1 =2 . The zero number was not captured before in my previous example. Appreciate any help. Thanks
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#13

18 Jul 2023, 15:24

The rule you propose is not clear. What are the values for

1. 012010
2. 5103
3. 6540

?
Comment
Ches Zin

Join Date: Jun 2023

Posts: 20
#14

19 Jul 2023, 01:30

Dear Andrew,
Thanks for your suggestion. My apologies for creating unclear post. The example data is as below.
The suggested command works well for most of observations. However for obs 2 (i.e start with 0.5 tablet), the total number that I wanted is 1 (i.e 0.5 +0.5 +1=1) but the total number generated (tndd) is 11 because "0" in front of 05051 was missing in the previous example. For the last obs (2 tablet morning and evening for pain), the wanted total number is 4 but the "2 in the evening" was missing in this free text variable. Any help is highly appreciated.
Best

gen length= length(var1) qui sum length gen wanted=0 forval i=1/ `r(max)'{ qui replace wanted= wanted+ real(substr(var1, `i', 1)) if `i'<=length }

dos_ndd var1 length tndd wanted
1 capsule , 1 capsule , 1 capsule , 1 capsule pain reliever 1111 4 4 4
0.5 tablet morning, 0.5 tablet evening for pain for 1 week after operation 05051 5 11 1
1 tablet morning at , 1 tablet evening at for pain 11 2 2 2
1 capsule 1-3 times per day as needed when treatment with alvedon and oxycodone is insufficient 113 3 3 3
2 tablet morning and evening for pain 2 1 2 4
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#15

19 Jul 2023, 03:25

Originally posted by Ches Zin View Post

1 capsule 1-3 times per day as needed when treatment with alvedon and oxycodone is insufficient 113 3 3 3
2 tablet morning and evening for pain 2 1 2 4

Sorry, if you are mixing the number of tablets and frequency, there is no way to distinguish between these. Given the sequence "3 2", this may mean 3 tablets and 2 tablets = 5 tablets or 3 tablets twice daily= 6 tablets. You have to go back and create a more coherent coding system.
Comment

Announcement