I am cleaning a massive string variable scraped from a court website. The variable contains the text of court case dockets. I want to split the variable by docket entry, as there are multiple docket entries within the variable. Each docket entry starts with a date. So I would like to have Stata find each date within the string and split before the date. The dates are formatted mm/dd/yyyy. I can use regexm to locate a date:
gen docketdate = regexm(docket,"^../../....$")
but I want to use the date to split. I tried nesting the regexm command within strpos, so I could generate variable with the position of the date and use the substr command to split at that point, but my code doesn't work:
gen docketdatepos = strpos(docket, regexm("^../../....$"))
I get an invalid syntax error when I enter this.
If I try: gen datepos=strpos(docket,"../../...."), I just get a variable populated only with 0s, as Stata isn't recognizing the wildcards.
So then I tried to used regular expressions to find each date and add a delimiter in front of the date, with the idea that I could then parse on the delimiter. But
gen newdocket=regexr(docket,"../../....","docketdate:../../....")
just turns all of my actual dates in to literal dots and slashes--the original dates are not preserved.
I also tried concatenating using regexm within a substr command, but that didn't work either.
gen newdocket2=subinstr(docket,regexm("../../...."),regexr("date:"+"?"),.)
I'm out of ideas, and would appreciate any help anyone can offer.
gen docketdate = regexm(docket,"^../../....$")
but I want to use the date to split. I tried nesting the regexm command within strpos, so I could generate variable with the position of the date and use the substr command to split at that point, but my code doesn't work:
gen docketdatepos = strpos(docket, regexm("^../../....$"))
I get an invalid syntax error when I enter this.
If I try: gen datepos=strpos(docket,"../../...."), I just get a variable populated only with 0s, as Stata isn't recognizing the wildcards.
So then I tried to used regular expressions to find each date and add a delimiter in front of the date, with the idea that I could then parse on the delimiter. But
gen newdocket=regexr(docket,"../../....","docketdate:../../....")
just turns all of my actual dates in to literal dots and slashes--the original dates are not preserved.
I also tried concatenating using regexm within a substr command, but that didn't work either.
gen newdocket2=subinstr(docket,regexm("../../...."),regexr("date:"+"?"),.)
I'm out of ideas, and would appreciate any help anyone can offer.
Comment