Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String Cleaning - identify part of string and insert text

    I am cleaning a massive string variable scraped from a court website. The variable contains the text of court case dockets. I want to split the variable by docket entry, as there are multiple docket entries within the variable. Each docket entry starts with a date. So I would like to have Stata find each date within the string and split before the date. The dates are formatted mm/dd/yyyy. I can use regexm to locate a date:

    gen docketdate = regexm(docket,"^../../....$")

    but I want to use the date to split. I tried nesting the regexm command within strpos, so I could generate variable with the position of the date and use the substr command to split at that point, but my code doesn't work:

    gen docketdatepos = strpos(docket, regexm("^../../....$"))

    I get an invalid syntax error when I enter this.

    If I try: gen datepos=strpos(docket,"../../...."), I just get a variable populated only with 0s, as Stata isn't recognizing the wildcards.

    So then I tried to used regular expressions to find each date and add a delimiter in front of the date, with the idea that I could then parse on the delimiter. But

    gen newdocket=regexr(docket,"../../....","docketdate:../../....")

    just turns all of my actual dates in to literal dots and slashes--the original dates are not preserved.

    I also tried concatenating using regexm within a substr command, but that didn't work either.

    gen newdocket2=subinstr(docket,regexm("../../...."),regexr("date:"+"?"),.)

    I'm out of ideas, and would appreciate any help anyone can offer.

  • #2
    Welcome to Statalist.

    What an interesting problem. I'm not sure I've understood it correctly, but perhaps the following demonstrates some useful applications of regular expressions and other string functionality in Stata. Note that I use the newer Unicode-capable regular expression functions, because the engine behind them is much more capable than that behind the older regular expression functions. The reference I use is

    http://userguide.icu-project.org/strings/regexp

    With that said, here's my code, followed by selected output when it is run on the single sample observation I created.

    Added in edit: The key to this is choosing a character to split on that does not appear in the docket text. The "pipe" character is often used for this sort of purpose in CSV files to avoid having to quote fields containing commas. You may need to choose a different character.
    Code:
    clear
    input str200 text
    "11/11/1111 this is the first docket 22/22/2222 this is the second docket 33/33/3333 this is the third docket"
    end
    replace text = ustrregexra(text,"(../../....)","|$1")
    replace text = substr(text,2,.) if substr(text,1,1)=="|"
    list, clean noobs
    split text, parse(|)
    drop text
    describe
    generate id = _n
    reshape long text, i(id) j(docket)
    replace text = trim(text)
    list, clean
    Code:
    . replace text = ustrregexra(text,"(../../....)","|$1")
    (1 real change made)
    
    . replace text = substr(text,2,.) if substr(text,1,1)=="|"
    (1 real change made)
    
    . list, clean noobs
    
                                                                                                    
    >               text  
        11/11/1111 this is the first docket |22/22/2222 this is the second docket |33/33/3333 this i
    > s the third docket  
    
    . split text, parse(|)
    variables created as string:
    text1  text2  text3
    
    . drop text
    
    . describe
    
    Contains data
      obs:             1                          
     vars:             3                          
     size:           108                          
    ------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    ------------------------------------------------------------------------------------------------
    text1           str36   %36s                  
    text2           str37   %37s                  
    text3           str35   %35s                  
    ------------------------------------------------------------------------------------------------
    Sorted by:
         Note: Dataset has changed since last saved.
    
    . generate id = _n
    
    . reshape long text, i(id) j(docket)
    (note: j = 1 2 3)
    
    Data                               wide   ->   long
    -----------------------------------------------------------------------------
    Number of obs.                        1   ->       3
    Number of variables                   4   ->       3
    j variable (3 values)                     ->   docket
    xij variables:
                          text1 text2 text3   ->   text
    -----------------------------------------------------------------------------
    
    . replace text = trim(text)
    (2 real changes made)
    
    . list, clean
    
           id   docket                                   text  
      1.    1        1    11/11/1111 this is the first docket  
      2.    1        2   22/22/2222 this is the second docket  
      3.    1        3    33/33/3333 this is the third docket
    Last edited by William Lisowski; 11 May 2018, 14:25.

    Comment

    Working...
    X