Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting numbers of variable length from a messy string variable

    Colleagues,

    I have a string variable resembling the data generated by the code below. I would like to achieve two things:
    1. Create a separate variable containing figures in brackets
    2. Create a separate variable with figures saved at the end of the variable
    The code reflects the nature of the variable rather well, but to summarise:
    • Each observation contains two sets of figures, both of variable lengths from 1 to 3 digits
      • First set of figures is always encapsulated in brackets
      • Second set of figures is always at the end
    • Each observation contains at least one word at the beginning of the variable
    • On some occasions additional words appear between the figures
    • There is no word at the end of the observation
    • There are blank spaces at the end or beginning of the observation
    Naturally, I will be grateful for any suggestions on how to tame this variable.

    Code:
    /* === SampleString Data === */
    clear
    input str27 problemvar    
    "abcXYZ (90) 135            "
    "def (130) comment 20       "
    "mbnuiegh (1) koj 130       "
    "wshli (786) kojepj (11)    "
    "oujiopwe kojkl we (09) 787 "
    " ecfh (11) comment 90       "
    end
    Last edited by Konrad Zdeb; 27 Jan 2015, 04:46. Reason: Typo.
    Kind regards,
    Konrad
    Version: Stata/IC 13.1

  • #2
    I know many people will think "regular expressions" here, which is one reason to suggest something more prosaic as an alternative.

    Code:
     
    clear
    
    input str27 problemvar     
    "abcXYZ (90) 135            "
    "def (130) comment 20       "
    "mbnuiegh (1) koj 130       "
    "wshli (786) kojepj (11)    "
    "oujiopwe kojkl we (09) 787 "
    " ecfh (11) comment 90       "
    end
    
    local v problemvar 
    local p2 strpos(`v', ")") - 1 
    local p1 strpos(`v', "(") + 1
    
    gen swanted2 = word(`v', -1)
    replace swanted2 = subinstr(swanted2, "(", "", .) 
    replace swanted2 = subinstr(swanted2, ")", "", .) 
    gen wanted2 = real(swanted2) 
    
    gen wanted1 = real(substr(`v', `p1', `p2' - `p1' - 1)) 
    
    list wanted1 wanted2, sep(0)  
    
         +-------------------+
         | wanted1   wanted2 |
         |-------------------|
      1. |      90       135 |
      2. |     130        20 |
      3. |       1       130 |
      4. |     786        11 |
      5. |       9       787 |
      6. |      11        90 |
         +-------------------+

    Comment


    • #3
      Thing of beauty!
      Kind regards,
      Konrad
      Version: Stata/IC 13.1

      Comment


      • #4
        Here's another way to do it. (Regular expression fans are right too.) moss is from SSC.

        Code:
         
        moss problemvar, regex match("([0-9]+)")

        Comment


        • #5
          Definitely more straightforward but the results are slightly more messy:
          Code:
          . /* === Example String Data === */
          . clear
          
          . input str27 problemvar     
          
                                problemvar
            1. "abcXYZ (90) 135            "
            2. "def (130) comment 20       "
            3. "mbnuiegh (1) koj 130       "
            4. "wshli (786) kojepj (11)    "
            5. "oujiopwe kojkl we (09) 787 "
            6. "ecfh (11) comment 90       "
            7. end
          
          . preserve
          
          .
          . // First solution
          . local v problemvar
          
          . local p2 strpos(`v', ")") - 1
          
          . local p1 strpos(`v', "(") + 1
          
          .
          . gen swanted2 = word(`v', -1)
          
          . replace swanted2 = subinstr(swanted2, "(", "", .)
          (1 real change made)
          
          . replace swanted2 = subinstr(swanted2, ")", "", .)
          (1 real change made)
          
          . gen wanted2 = real(swanted2)
          
          .
          . gen wanted1 = real(substr(`v', `p1', `p2' - `p1' - 1))
          
          .
          . list wanted1 wanted2, sep(0)
          
               +-------------------+
               | wanted1   wanted2 |
               |-------------------|
            1. |      90       135 |
            2. |     130        20 |
            3. |       1       130 |
            4. |     786        11 |
            5. |       9       787 |
            6. |      11        90 |
               +-------------------+
          
          .
          . // Second solution
          . restore
          
          . moss problemvar, regex match("([0-9]+)")
          
          . list
          
               +--------------------------------------------------------------------------+
               |                  problemvar   _count   _match1   _pos1   _match2   _pos2 |
               |--------------------------------------------------------------------------|
            1. | abcXYZ (90) 135                    2        90       9       135      13 |
            2. | def (130) comment 20               2       130       6        20      19 |
            3. | mbnuiegh (1) koj 130               2         1      11       130      18 |
            4. | wshli (786) kojepj (11)            2       786       8        11      21 |
            5. | oujiopwe kojkl we (09) 787         2        09      20       787      24 |
               |--------------------------------------------------------------------------|
            6. | ecfh (11) comment 90               2        11       7        90      19 |
               +--------------------------------------------------------------------------+
          
          .
          .
          end of do-file
          Kind regards,
          Konrad
          Version: Stata/IC 13.1

          Comment


          • #6
            In your problem it seems there are always two wanted numbers. In many other problems things are (much) more complicated. The extra variables _count and _pos1 up are there for more complicated problems, as when (simple example) one might want to home in observations for which _count is not 2. That's why moss produces extra variables, which you can always ignore or drop.

            Comment

            Working...
            X