Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting the numbers after repeated characters in a string variable

    Hello,

    I am trying to obtain the distance between two geographic locations using public transportation. I used OpenTripPlanner to do so, and it allowed me to generate a string variable that contains this information. However, this is a string variable that not only contains the distance but other information on the trip taken between those two locations — for example, the route taken, transportation modes, etc.

    The issue is that the string variable does not contain information on the total distance traveled. It contains information on each one of the "legs," traveled between the two points. In other words, for each one of the transportation modes taken between the two points, I have a distance. Therefore, I need to add up all those "sub-distances" to obtain the total traveled distance.

    The string variable contains around 40,000 characters in each cell. However, it contains repeated characters that allow identifying a "sub-distance." Whenever the string variable shows the text ""realTime": false, distance":"" the numbers after this text refer to the distance traveled using a transportation mode.

    The first thing that I tried to do was to use the command "regexs," in the following way:

    Code:
    gen dist=regexs(0) if regexm(store_5am, `""realTime":false,"distance":[0-9]*"')
    Where "store_5am," is the string variable. Although that instruction does work, the issue is that the information corresponds to the first "sub-distance," but it does not help me to obtain the other sub-distances.

    So, what I tried to do next was to split the string variable into several variables, assuming the split would happen every time Stata runs into the text ""realTime": false, distance":"" Then, I would have a new set of string variables, and I would extract the numbers after the text in each one of these variables.

    The following are the instructions I define to split the string variable and then extract the "sub-distance" from each new variable generated after the split:

    Code:
    split store_5am, parse(""realTime":false,") gen(d_)
    
    forvalues n=1(1)15{
    capture gen dist_`n'=regexs(0) if regexm(d_`n', `""distance":[0-9]*"')
    }
    The code above assumes that there are up to 15 times in which Stata could run into the text ""realTime": false, distance":"", but they can be less than that.

    The issue with the instructions above is that I thought Stata would generate 15 or fewer variables with the split command since it can run into the text ""realTime": false, distance":"" up to 15 times. However, when I execute the split command, I run into the error:


    no room to add more variables
    Up to 32,767 variables are currently allowed, which is the maximum you can set; see help memory.

    So, it seems the command “split” is not generating up to 15 variables, as I thought it would do.

    The following is an example of my data. As I mentioned before, the string variable contains around 40,000 characters, but I cannot generate an example the way the data is using dataex. So, this is a version of my data for one observation and information on two "sub-distances."

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id str243 store_5am
    1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000," "realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"'
    end
    So, could you could help me to come up with a better approach to obtain all the "sub-distances," such that I can calculate the total distance traveled between the two locations?



    Thank you!







  • #2
    You can use the -subinstr()- function to eliminate matched text. I modify your matching command to pick out the full distance (i.e., including what follows the decimal point) below.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id str243 store_5am
    1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000," "realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"'
    end
    
    gen remainder= store_5am
    forval i=1/15{
          gen dist`i'=regexs(0) if regexm(remainder, `""realTime":false,"distance":[0-9\.0-9]*"')
          replace remainder= subinstr(remainder,dist`i',"",.)
    }

    Res.:

    Code:
       +-------------------------------------------------------------------------------------------+
         | id                                  dist1                                           dist2 |
         |-------------------------------------------------------------------------------------------|
      1. |  1   "realTime":false,"distance":1596.257   "realTime":false,"distance":11730.58983892094 |
         +-------------------------------------------------------------------------------------------+

    Comment


    • #3
      Code:
      ********************************************************************************
      split store_5am , parse(`""realTime":false,"distance":"') gen(distance)
      drop distance1
      rename (distance#) (distance#), renumber 
      
      foreach v of varlist distance* {
      
          tempname s    
          rename `v' `s'
          gen `v' = real(substr(`s', 1, -1 + strpos(`s', ",")))
          drop `s'
      }
      ********************************************************************************

      Comment


      • #4
        Thank you, Andrew and Bjarte! This is exactly what I was looking for.

        Comment


        • #5
          I think there is on detail in #2 that should be changed:
          Code:
          replace remainder= subinstr(remainder,dist`i',"",.)
          will replace all occurances of identical distances. To remove only the first use
          Code:
          replace remainder= subinstr(remainder, dist`i',"", 1)
          Below this change is implemented. I also use a nongready regex.

          Code:
          clear all
          timer clear
          
          
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input byte id str243 store_5am
          1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000,""realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"'
          end
          
          replace store_5am = 40000 * "A" + 7 * store_5am
          expand 1000
          
          preserve
          
          gen remainder = store_5am
          
          timer on 1
          ********************************************************************************
          
          forval i=1/30 {
                
                if ! ustrregexm(remainder, `""realTime":false,"distance":([0-9\.]+?),"') {
                
                  continue, break
                  
                }
                
                gen dist`i'= ustrregexs(1) if ustrregexm(remainder, `""realTime":false,"distance":([0-9\.]+?),"')
                  
                replace remainder= subinstr(remainder, dist`i',"", 1)
          }
          
          destring dist* , replace
          
          ********************************************************************************
          timer off 1
          
          rename (dist*)(distance*)
          keep dist*
          datasignature
          scalar ds_A = "`r(datasignature)'"
          
          restore
          
          timer on 2
          ********************************************************************************
          split store_5am , parse(`""realTime":false,"distance":"') gen(distance)
          drop distance1
          rename (distance#) (distance#), renumber
          
          foreach v of varlist distance* {
          
              tempname s    
              rename `v' `s'
              gen `v' = real(substr(`s', 1, -1 + strpos(`s', ",")))
              drop `s'
          }
          ********************************************************************************
          timer off 2
          
          keep distance*
          datasignature
          assert  ds_A == "`r(datasignature)'"
          
          timer list
          Code:
          . assert  ds_A == "`r(datasignature)'"
          
          .
          . timer list
             1:      7.11 /        1 =       7.1060
             2:      0.19 /        1 =       0.1920
          Last edited by Bjarte Aagnes; 22 Sep 2019, 10:14.

          Comment


          • #6
            I think there is on detail in #2 that should be changed:
            Code:
            replace remainder= subinstr(remainder,dist`i',"",.)
            will replace all occurances of identical distances. To remove only the first use
            Code:
            replace remainder= subinstr(remainder, dist`i',"", 1)
            That's correct. Thanks Bjarte.

            Comment


            • #7
              Late to the party here but consider also moss (SSC):

              Code:
              clear
              input byte id str243 store_5am
              1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000," "realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"'
              end
              
              moss store, match(`"distance":([0-9]*\.[0-9]*)"') regex
              
              list _match*
              
                   +------------------------------+
                   |  _match1             _match2 |
                   |------------------------------|
                1. | 1596.257   11730.58983892094 |
                   +------------------------------+

              Comment

              Working...
              X