Extracting the numbers after repeated characters in a string variable

Mayra Pineda

Join Date: Jan 2019

Posts: 13
#1

Extracting the numbers after repeated characters in a string variable

21 Sep 2019, 10:50

Hello,

I am trying to obtain the distance between two geographic locations using public transportation. I used OpenTripPlanner to do so, and it allowed me to generate a string variable that contains this information. However, this is a string variable that not only contains the distance but other information on the trip taken between those two locations — for example, the route taken, transportation modes, etc.

The issue is that the string variable does not contain information on the total distance traveled. It contains information on each one of the "legs," traveled between the two points. In other words, for each one of the transportation modes taken between the two points, I have a distance. Therefore, I need to add up all those "sub-distances" to obtain the total traveled distance.

The string variable contains around 40,000 characters in each cell. However, it contains repeated characters that allow identifying a "sub-distance." Whenever the string variable shows the text ""realTime": false, distance":"" the numbers after this text refer to the distance traveled using a transportation mode.

The first thing that I tried to do was to use the command "regexs," in the following way:

Code:

gen dist=regexs(0) if regexm(store_5am, `""realTime":false,"distance":[0-9]*"')

Where "store_5am," is the string variable. Although that instruction does work, the issue is that the information corresponds to the first "sub-distance," but it does not help me to obtain the other sub-distances.

So, what I tried to do next was to split the string variable into several variables, assuming the split would happen every time Stata runs into the text ""realTime": false, distance":"" Then, I would have a new set of string variables, and I would extract the numbers after the text in each one of these variables.

The following are the instructions I define to split the string variable and then extract the "sub-distance" from each new variable generated after the split:

Code:

split store_5am, parse(""realTime":false,") gen(d_) forvalues n=1(1)15{ capture gen dist_`n'=regexs(0) if regexm(d_`n', `""distance":[0-9]*"') }

The code above assumes that there are up to 15 times in which Stata could run into the text ""realTime": false, distance":"", but they can be less than that.

The issue with the instructions above is that I thought Stata would generate 15 or fewer variables with the split command since it can run into the text ""realTime": false, distance":"" up to 15 times. However, when I execute the split command, I run into the error:

no room to add more variables
Up to 32,767 variables are currently allowed, which is the maximum you can set; see help memory.

So, it seems the command “split” is not generating up to 15 variables, as I thought it would do.

The following is an example of my data. As I mentioned before, the string variable contains around 40,000 characters, but I cannot generate an example the way the data is using dataex. So, this is a version of my data for one observation and information on two "sub-distances."

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id str243 store_5am 1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000," "realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"' end

So, could you could help me to come up with a better approach to obtain all the "sub-distances," such that I can calculate the total distance traveled between the two locations?

Thank you!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10216

21 Sep 2019, 15:08

You can use the -subinstr()- function to eliminate matched text. I modify your matching command to pick out the full distance (i.e., including what follows the decimal point) below.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str243 store_5am
1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000," "realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"'
end

gen remainder= store_5am
forval i=1/15{
      gen dist`i'=regexs(0) if regexm(remainder, `""realTime":false,"distance":[0-9\.0-9]*"')
      replace remainder= subinstr(remainder,dist`i',"",.)
}

Res.:

Code:

   +-------------------------------------------------------------------------------------------+
     | id                                  dist1                                           dist2 |
     |-------------------------------------------------------------------------------------------|
  1. |  1   "realTime":false,"distance":1596.257   "realTime":false,"distance":11730.58983892094 |
     +-------------------------------------------------------------------------------------------+

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

22 Sep 2019, 07:35

Code:

********************************************************************************
split store_5am , parse(`""realTime":false,"distance":"') gen(distance)
drop distance1
rename (distance#) (distance#), renumber 

foreach v of varlist distance* {

    tempname s    
    rename `v' `s'
    gen `v' = real(substr(`s', 1, -1 + strpos(`s', ",")))
    drop `s'
}
********************************************************************************

Comment

Mayra Pineda

Join Date: Jan 2019

Posts: 13
#4

22 Sep 2019, 09:40

Thank you, Andrew and Bjarte! This is exactly what I was looking for.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

22 Sep 2019, 10:06

I think there is on detail in #2 that should be changed:

Code:

replace remainder= subinstr(remainder,dist`i',"",.)

will replace all occurances of identical distances. To remove only the first use

Code:

replace remainder= subinstr(remainder, dist`i',"", 1)

Below this change is implemented. I also use a nongready regex.

Code:

clear all
timer clear


* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str243 store_5am
1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000,""realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"'
end

replace store_5am = 40000 * "A" + 7 * store_5am
expand 1000

preserve

gen remainder = store_5am

timer on 1
********************************************************************************

forval i=1/30 {
      
      if ! ustrregexm(remainder, `""realTime":false,"distance":([0-9\.]+?),"') {
      
        continue, break
        
      }
      
      gen dist`i'= ustrregexs(1) if ustrregexm(remainder, `""realTime":false,"distance":([0-9\.]+?),"')
        
      replace remainder= subinstr(remainder, dist`i',"", 1)
}

destring dist* , replace

********************************************************************************
timer off 1

rename (dist*)(distance*)
keep dist*
datasignature
scalar ds_A = "`r(datasignature)'"

restore

timer on 2
********************************************************************************
split store_5am , parse(`""realTime":false,"distance":"') gen(distance)
drop distance1
rename (distance#) (distance#), renumber

foreach v of varlist distance* {

    tempname s    
    rename `v' `s'
    gen `v' = real(substr(`s', 1, -1 + strpos(`s', ",")))
    drop `s'
}
********************************************************************************
timer off 2

keep distance*
datasignature
assert  ds_A == "`r(datasignature)'"

timer list

Code:

. assert  ds_A == "`r(datasignature)'"

.
. timer list
   1:      7.11 /        1 =       7.1060
   2:      0.19 /        1 =       0.1920

Last edited by Bjarte Aagnes; 22 Sep 2019, 10:14.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10216
#6

22 Sep 2019, 14:57

I think there is on detail in #2 that should be changed:
Code:
replace remainder= subinstr(remainder,dist`i',"",.)
will replace all occurances of identical distances. To remove only the first use
Code:
replace remainder= subinstr(remainder, dist`i',"", 1)

That's correct. Thanks Bjarte.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

23 Sep 2019, 01:42

Late to the party here but consider also moss (SSC):

Code:

clear
input byte id str243 store_5am
1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000," "realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"'
end

moss store, match(`"distance":([0-9]*\.[0-9]*)"') regex

list _match*

     +------------------------------+
     |  _match1             _match2 |
     |------------------------------|
  1. | 1596.257   11730.58983892094 |
     +------------------------------+

Announcement