String Cleaning - identify part of string and insert text

Stephanie Pierce

Join Date: May 2018

Posts: 3
#1

String Cleaning - identify part of string and insert text

11 May 2018, 11:14

I am cleaning a massive string variable scraped from a court website. The variable contains the text of court case dockets. I want to split the variable by docket entry, as there are multiple docket entries within the variable. Each docket entry starts with a date. So I would like to have Stata find each date within the string and split before the date. The dates are formatted mm/dd/yyyy. I can use regexm to locate a date:

gen docketdate = regexm(docket,"^../../....$")

but I want to use the date to split. I tried nesting the regexm command within strpos, so I could generate variable with the position of the date and use the substr command to split at that point, but my code doesn't work:

gen docketdatepos = strpos(docket, regexm("^../../....$"))

I get an invalid syntax error when I enter this.

If I try: gen datepos=strpos(docket,"../../...."), I just get a variable populated only with 0s, as Stata isn't recognizing the wildcards.

So then I tried to used regular expressions to find each date and add a delimiter in front of the date, with the idea that I could then parse on the delimiter. But

gen newdocket=regexr(docket,"../../....","docketdate:../../....")

just turns all of my actual dates in to literal dots and slashes--the original dates are not preserved.

I also tried concatenating using regexm within a substr command, but that didn't work either.

gen newdocket2=subinstr(docket,regexm("../../...."),regexr("date:"+"?"),.)

I'm out of ideas, and would appreciate any help anyone can offer.
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

11 May 2018, 14:20

Welcome to Statalist.

What an interesting problem. I'm not sure I've understood it correctly, but perhaps the following demonstrates some useful applications of regular expressions and other string functionality in Stata. Note that I use the newer Unicode-capable regular expression functions, because the engine behind them is much more capable than that behind the older regular expression functions. The reference I use is

http://userguide.icu-project.org/strings/regexp

With that said, here's my code, followed by selected output when it is run on the single sample observation I created.

Added in edit: The key to this is choosing a character to split on that does not appear in the docket text. The "pipe" character is often used for this sort of purpose in CSV files to avoid having to quote fields containing commas. You may need to choose a different character.

Code:

clear
input str200 text
"11/11/1111 this is the first docket 22/22/2222 this is the second docket 33/33/3333 this is the third docket"
end
replace text = ustrregexra(text,"(../../....)","|$1")
replace text = substr(text,2,.) if substr(text,1,1)=="|"
list, clean noobs
split text, parse(|)
drop text
describe
generate id = _n
reshape long text, i(id) j(docket)
replace text = trim(text)
list, clean

Code:

. replace text = ustrregexra(text,"(../../....)","|$1")
(1 real change made)

. replace text = substr(text,2,.) if substr(text,1,1)=="|"
(1 real change made)

. list, clean noobs

                                                                                                
>               text  
    11/11/1111 this is the first docket |22/22/2222 this is the second docket |33/33/3333 this i
> s the third docket  

. split text, parse(|)
variables created as string:
text1  text2  text3

. drop text

. describe

Contains data
  obs:             1                          
 vars:             3                          
 size:           108                          
------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
text1           str36   %36s                  
text2           str37   %37s                  
text3           str35   %35s                  
------------------------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.

. generate id = _n

. reshape long text, i(id) j(docket)
(note: j = 1 2 3)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        1   ->       3
Number of variables                   4   ->       3
j variable (3 values)                     ->   docket
xij variables:
                      text1 text2 text3   ->   text
-----------------------------------------------------------------------------

. replace text = trim(text)
(2 real changes made)

. list, clean

       id   docket                                   text  
  1.    1        1    11/11/1111 this is the first docket  
  2.    1        2   22/22/2222 this is the second docket  
  3.    1        3    33/33/3333 this is the third docket

Last edited by William Lisowski; 11 May 2018, 14:25.

Announcement

String Cleaning - identify part of string and insert text

Comment