Dear Users,
I am writing to request your help with string manipulation. I am trying to write a code to identify the repeated substrings within a string and store those distinct substrings in a variable.
In my dataset I have a variable V_concat that has as observations strings such as the ones below:
“A10XA8AB1DB1XB2CC10AC1DC4AD4AD5AG4CG4XH4AH4EJ1DJ1 FJ1KL1XL4AM1AN5CR1BR3CV7A”
“A10XA14AA2BA3FA3GA4AA5BA7EA7HB1CB1DB2CB6CC10AC11A C1BC1CC1DC1FC1XC2AC6AC7AC8AC9CD10AD11AD5BD7AG2XG4D G4EG4XH1CH4BH4CH4XJ1DJ1FJ1XJ2AJ5BJ5CJ7AL1DL1XL2AL2 BL3AL4AM1AM5BM5XN2AN2BN2CN3AN5AN5BN5CN6AN6BN6DN7DN 7XP1GR1BR3CR3DR3FS1XT1AT1FV1AV3DV3GV7AA10X”
“A10XA10X”
I am considering solving the problem through the functions regexm() and regexs(). Basically, I am trying to write a code that will first identify substrings starting in an alphabet and ending in an alphabet (in the first example A10X , A8A , B1D etc would be the substrings) within a string. After identifying the substrings, as a second step, I want to store into a new variable the set of distinct substrings that are repeated within the string.
I am struggling with writing up the component within regexm. Any help will be greatly appreciated.
Thanks,
Archita
I am writing to request your help with string manipulation. I am trying to write a code to identify the repeated substrings within a string and store those distinct substrings in a variable.
In my dataset I have a variable V_concat that has as observations strings such as the ones below:
“A10XA8AB1DB1XB2CC10AC1DC4AD4AD5AG4CG4XH4AH4EJ1DJ1 FJ1KL1XL4AM1AN5CR1BR3CV7A”
“A10XA14AA2BA3FA3GA4AA5BA7EA7HB1CB1DB2CB6CC10AC11A C1BC1CC1DC1FC1XC2AC6AC7AC8AC9CD10AD11AD5BD7AG2XG4D G4EG4XH1CH4BH4CH4XJ1DJ1FJ1XJ2AJ5BJ5CJ7AL1DL1XL2AL2 BL3AL4AM1AM5BM5XN2AN2BN2CN3AN5AN5BN5CN6AN6BN6DN7DN 7XP1GR1BR3CR3DR3FS1XT1AT1FV1AV3DV3GV7AA10X”
“A10XA10X”
I am considering solving the problem through the functions regexm() and regexs(). Basically, I am trying to write a code that will first identify substrings starting in an alphabet and ending in an alphabet (in the first example A10X , A8A , B1D etc would be the substrings) within a string. After identifying the substrings, as a second step, I want to store into a new variable the set of distinct substrings that are repeated within the string.
I am struggling with writing up the component within regexm. Any help will be greatly appreciated.
Thanks,
Archita
Comment