Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify repeated substrings within a string

    Dear Users,

    I am writing to request your help with string manipulation. I am trying to write a code to identify the repeated substrings within a string and store those distinct substrings in a variable.

    In my dataset I have a variable V_concat that has as observations strings such as the ones below:

    “A10XA8AB1DB1XB2CC10AC1DC4AD4AD5AG4CG4XH4AH4EJ1DJ1 FJ1KL1XL4AM1AN5CR1BR3CV7A”

    “A10XA14AA2BA3FA3GA4AA5BA7EA7HB1CB1DB2CB6CC10AC11A C1BC1CC1DC1FC1XC2AC6AC7AC8AC9CD10AD11AD5BD7AG2XG4D G4EG4XH1CH4BH4CH4XJ1DJ1FJ1XJ2AJ5BJ5CJ7AL1DL1XL2AL2 BL3AL4AM1AM5BM5XN2AN2BN2CN3AN5AN5BN5CN6AN6BN6DN7DN 7XP1GR1BR3CR3DR3FS1XT1AT1FV1AV3DV3GV7AA10X”

    “A10XA10X”

    I am considering solving the problem through the functions regexm() and regexs(). Basically, I am trying to write a code that will first identify substrings starting in an alphabet and ending in an alphabet (in the first example A10X , A8A , B1D etc would be the substrings) within a string. After identifying the substrings, as a second step, I want to store into a new variable the set of distinct substrings that are repeated within the string.

    I am struggling with writing up the component within regexm. Any help will be greatly appreciated.

    Thanks,
    Archita

  • #2
    Code:
     
    ssc desc moss
    ssc inst moss
    
    moss whatever, regex match("([A-Z].[A-Z])")
    Also search the forum for mentions of this program.

    Comment


    • #3
      Well Nick beat me to it. Here's my draft. Note that my pattern is a bit more flexible in that it will match substrings like "A10X".

      You can use moss (from SSC) to extract the substrings. To install moss, type in Stata's command window

      Code:
      ssc install moss
      Here's a quick example using the strings you posted

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte id str245 s
      1 "A10XA8AB1DB1XB2CC10AC1DC4AD4AD5AG4CG4XH4AH4EJ1DJ1 FJ1KL1XL4AM1AN5CR1BR3CV7A"                                                                                                                                                                          
      2 "A10XA14AA2BA3FA3GA4AA5BA7EA7HB1CB1DB2CB6CC10AC11A C1BC1CC1DC1FC1XC2AC6AC7AC8AC9CD10AD11AD5BD7AG2XG4D G4EG4XH1CH4BH4CH4XJ1DJ1FJ1XJ2AJ5BJ5CJ7AL1DL1XL2AL2 BL3AL4AM1AM5BM5XN2AN2BN2CN3AN5AN5BN5CN6AN6BN6DN7DN 7XP1GR1BR3CR3DR3FS1XT1AT1FV1AV3DV3GV7AA10X"
      3 "A10XA10X"                                                                                                                                                                                                                                             
      end
      
      moss s, match("([A-Z][^ A-Z]+[A-Z])") regex
      drop _pos* _count s
      
      * convert to long form to remove duplicates
      reshape long _match, i(id) j(n)
      drop if mi(_match)
      bysort id _match: keep if _n == 1
      The regex pattern breaks down to:
      • "[A-Z]" a single uppercase letter, followed by
      • "[^ A-Z]+" one or more of any character except a space or uppercase letter, followed by
      • "[A-Z]" a single uppercase letter
      • the parentheses indicate that everything that matches is returned

      Comment


      • #4
        "Fast is fine, but accuracy is everything"
        "Speed is fine, but accuracy is final"
        ..

        -- attributed to Wyatt Earp

        Comment


        • #5
          Dear Nick, Dear Robert,

          Many thanks for your help.The code works perfectly for my case.

          Rgds,
          Archita

          Comment

          Working...
          X