Identify repeated substrings within a string

Archita Sarmah

Join Date: May 2015

Posts: 6
#1

Identify repeated substrings within a string

10 Dec 2015, 14:52

Dear Users,

I am writing to request your help with string manipulation. I am trying to write a code to identify the repeated substrings within a string and store those distinct substrings in a variable.

In my dataset I have a variable V_concat that has as observations strings such as the ones below:

“A10XA8AB1DB1XB2CC10AC1DC4AD4AD5AG4CG4XH4AH4EJ1DJ1 FJ1KL1XL4AM1AN5CR1BR3CV7A”

“A10XA14AA2BA3FA3GA4AA5BA7EA7HB1CB1DB2CB6CC10AC11A C1BC1CC1DC1FC1XC2AC6AC7AC8AC9CD10AD11AD5BD7AG2XG4D G4EG4XH1CH4BH4CH4XJ1DJ1FJ1XJ2AJ5BJ5CJ7AL1DL1XL2AL2 BL3AL4AM1AM5BM5XN2AN2BN2CN3AN5AN5BN5CN6AN6BN6DN7DN 7XP1GR1BR3CR3DR3FS1XT1AT1FV1AV3DV3GV7AA10X”

“A10XA10X”

I am considering solving the problem through the functions regexm() and regexs(). Basically, I am trying to write a code that will first identify substrings starting in an alphabet and ending in an alphabet (in the first example A10X , A8A , B1D etc would be the substrings) within a string. After identifying the substrings, as a second step, I want to store into a new variable the set of distinct substrings that are repeated within the string.

I am struggling with writing up the component within regexm. Any help will be greatly appreciated.

Thanks,
Archita
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35754
#2

10 Dec 2015, 15:37

Code:

ssc desc moss ssc inst moss moss whatever, regex match("([A-Z].[A-Z])")

Also search the forum for mentions of this program.
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

10 Dec 2015, 15:42

Well Nick beat me to it. Here's my draft. Note that my pattern is a bit more flexible in that it will match substrings like "A10X".

You can use moss (from SSC) to extract the substrings. To install moss, type in Stata's command window

Code:

ssc install moss

Here's a quick example using the strings you posted

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str245 s
1 "A10XA8AB1DB1XB2CC10AC1DC4AD4AD5AG4CG4XH4AH4EJ1DJ1 FJ1KL1XL4AM1AN5CR1BR3CV7A"                                                                                                                                                                          
2 "A10XA14AA2BA3FA3GA4AA5BA7EA7HB1CB1DB2CB6CC10AC11A C1BC1CC1DC1FC1XC2AC6AC7AC8AC9CD10AD11AD5BD7AG2XG4D G4EG4XH1CH4BH4CH4XJ1DJ1FJ1XJ2AJ5BJ5CJ7AL1DL1XL2AL2 BL3AL4AM1AM5BM5XN2AN2BN2CN3AN5AN5BN5CN6AN6BN6DN7DN 7XP1GR1BR3CR3DR3FS1XT1AT1FV1AV3DV3GV7AA10X"
3 "A10XA10X"                                                                                                                                                                                                                                             
end

moss s, match("([A-Z][^ A-Z]+[A-Z])") regex
drop _pos* _count s

* convert to long form to remove duplicates
reshape long _match, i(id) j(n)
drop if mi(_match)
bysort id _match: keep if _n == 1

The regex pattern breaks down to:

"[A-Z]" a single uppercase letter, followed by
"[^ A-Z]+" one or more of any character except a space or uppercase letter, followed by
"[A-Z]" a single uppercase letter
the parentheses indicate that everything that matches is returned

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35754
#4

10 Dec 2015, 15:47

"Fast is fine, but accuracy is everything"
"Speed is fine, but accuracy is final" ..

-- attributed to Wyatt Earp
1 like
Comment
Archita Sarmah

Join Date: May 2015

Posts: 6
#5

11 Dec 2015, 01:32

Dear Nick, Dear Robert,

Many thanks for your help.The code works perfectly for my case.

Rgds,
Archita
Comment

Announcement

Identify repeated substrings within a string

Comment

Comment

Comment

Comment