Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to extract numbers and operation symbols from string variables in Stata

    Hi all,

    I have a question about how to extract numbers and operation symbols from a string variable in Stata.

    My data look as below:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str19 packing
    "20s*2"          
    "10s x 3"        
    ""               
    ""               
    "950gx2"         
    "20s x 2"        
    "20s x 2"        
    "20s x 2"        
    "100s"           
    "10s"            
    ""               
    "20s x 2"        
    "16s"            
    ""               
    ""               
    "18s x 2"        
    "700ml"          
    ""               
    "20s x 2"        
    "500g"           
    ""               
    "16s"            
    "700ml"          
    "700ml+200ml x 2"
    "20ml x 6pcs"    
    "16s"            
    ""               
    "525ml x 2"      
    "700ml"          
    "950gx2"         
    end

    What I want is the total unit of the products, for instance for "20s*2",I want to have 20*2=40; while for "700ml+200ml x 2" I want to get 1100. I don't know how to extract both the numbers and the operation symbols so that I can get the total amount of the units of the product.

    I wonder if anyone has any suggestions with respect to this issue?

    Thank you very much.

  • #2
    I got this far and gave up. What a mess. Someone better at this than me is needed to solve this problem.
    Code:
    foreach mess in s g ml pc " " {
        replace packing = subinstr(packing,"`mess'","",.)
    }
    replace packing = subinstr(packing,"x","*",.)
    g packing2 = packing
    g splitat = ustrpos(packing2,"+")
    g v1 = ""
    replace v1 = substr(packing2,1,splitat-1)
    g op1 = substr(packing2,splitat,1)
    replace packing2 = substr(packing2,1+splitat,.)
    replace splitat = ustrpos(packing2,"*")
    g v2 = ""
    replace v2 = substr(packing2,1,splitat-1)
    g op2 = substr(packing2,splitat,1)
    g v3 = ""
    replace v3 = substr(packing2,splitat+1,.)

    Comment


    • #3
      I generally don't like looping over observations, but other approaches seem less efficient in this instance:

      Code:
      gen wanted = .
      forv x = 1/`=_N'{
          cap replace wanted = `=ustrregexra(usubinstr(packing[`x'],"x","*",.),"[A-Za-z]|\s","")' in `x'
      }
      Edit: Clyde rightly notes in #4 that my regular expression removes only letters. This updated regular expression accounts for that and instead removes everything that is neither a number or an operator:

      Code:
      gen wanted = .
      forv x = 1/`=_N'{
          cap replace wanted = `=ustrregexra(usubinstr(packing[`x'],"x","*",.),"(\+|-|\*|\/)|[^\d]|\s","$1")' in `x'
      }
      Last edited by Ali Atia; 21 Jul 2021, 15:56.

      Comment


      • #4
        Here's a very crude, ugly hack that works in the example data. For one thing, it assumes that the only characters that occur other than numbers and operators are letters. It also assumes that the only x's that occur are multiplication operators. No doubt the regular expression wizards who frequent this forum can do better.

        Bonus points for somebody who can figure out how to get around looping over observations.

        Nevertheless, this produces the desired results in the example data shown.

        Code:
        gen expr = lower(trim(itrim(packing)))
        foreach a in `c(alpha)' {
            if `"`a'"' != `"x"'{
                replace expr = subinstr(expr, `"`a'"', "", .)
            }
            else {
                replace expr = subinstr(expr, "x", "*", .)
            }
        }
        
        gen result = .
        forvalues i = 1/`=_N' {
            local expression = expr[`i']
            if `"`expression'"' != "" {
                replace result = `expression' in `i'
            }
        }
        Added: crossed with #3 which is similar in spirit and much more compact. It does use regular expressions, although it also still assumes that only letters need to be removed from the packing variable.
        Last edited by Clyde Schechter; 21 Jul 2021, 15:50.

        Comment


        • #5
          Originally posted by George Ford View Post
          I got this far and gave up. What a mess. Someone better at this than me is needed to solve this problem.
          Code:
          foreach mess in s g ml pc " " {
          replace packing = subinstr(packing,"`mess'","",.)
          }
          replace packing = subinstr(packing,"x","*",.)
          g packing2 = packing
          g splitat = ustrpos(packing2,"+")
          g v1 = ""
          replace v1 = substr(packing2,1,splitat-1)
          g op1 = substr(packing2,splitat,1)
          replace packing2 = substr(packing2,1+splitat,.)
          replace splitat = ustrpos(packing2,"*")
          g v2 = ""
          replace v2 = substr(packing2,1,splitat-1)
          g op2 = substr(packing2,splitat,1)
          g v3 = ""
          replace v3 = substr(packing2,splitat+1,.)
          Hi George,

          Thank you so much for your comment. The codes have done a lot and simplified the previous string variable to a great extent.

          Comment


          • #6
            Originally posted by Ali Atia View Post
            I generally don't like looping over observations, but other approaches seem less efficient in this instance:

            Code:
            gen wanted = .
            forv x = 1/`=_N'{
            cap replace wanted = `=ustrregexra(usubinstr(packing[`x'],"x","*",.),"[A-Za-z]|\s","")' in `x'
            }
            Edit: Clyde rightly notes in #4 that my regular expression removes only letters. This updated regular expression accounts for that and instead removes everything that is neither a number or an operator:

            Code:
            gen wanted = .
            forv x = 1/`=_N'{
            cap replace wanted = `=ustrregexra(usubinstr(packing[`x'],"x","*",.),"(\+|-|\*|\/)|[^\d]|\s","$1")' in `x'
            }
            Hi Ali,

            Thank you for your help. The code works perfectly for me, I got exactly what I want. Amazing!

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              Here's a very crude, ugly hack that works in the example data. For one thing, it assumes that the only characters that occur other than numbers and operators are letters. It also assumes that the only x's that occur are multiplication operators. No doubt the regular expression wizards who frequent this forum can do better.

              Bonus points for somebody who can figure out how to get around looping over observations.

              Nevertheless, this produces the desired results in the example data shown.

              Code:
              gen expr = lower(trim(itrim(packing)))
              foreach a in `c(alpha)' {
              if `"`a'"' != `"x"'{
              replace expr = subinstr(expr, `"`a'"', "", .)
              }
              else {
              replace expr = subinstr(expr, "x", "*", .)
              }
              }
              
              gen result = .
              forvalues i = 1/`=_N' {
              local expression = expr[`i']
              if `"`expression'"' != "" {
              replace result = `expression' in `i'
              }
              }
              Added: crossed with #3 which is similar in spirit and much more compact. It does use regular expressions, although it also still assumes that only letters need to be removed from the packing variable.
              Hi Clyde,

              Thank you so much for your help. Your codes work perfectly on my side and solved my problem. Coding seems more interesting to me now

              Comment

              Working...
              X