Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Read in text data, character per character

    Dear Statalisters,

    For further manipulations, I'm looking for a native Stata solution to read in text data, so that every character (including spaces) is read in as one separate observation (one character per line).

    So what I would basically need is to insert whitespaces between every characters of my input file.
    With a stream editior (e.g. GNU sed) this can be easily done by just replacing every character with itself followed by a space.

    The following solution using the filefilter command works, but is very crude, I am looking for something more elegant and especially faster (when it comes to bigger text sizes).

    ********************* START *********************
    /* generate some test data */
    clear
    set obs 1
    gen str word="Hello world. This is a test! `"
    outfile using test.raw, noquote replace


    /* filefilter the test data
    to replace every character with the character + a space in front of it
    [ backslash and left quote in one additional routine] */

    qui {
    forvalues i=33(1)255 {

    if !(`i'==92|`i'==96) {

    noisily di "`=char(`i')'"
    filefilter test.raw test_1.raw, ///
    from(`"`=char(`i')'"') to(`" `=char(`i')'"') replace
    erase test.raw
    copy test_1.raw test.raw, replace
    }
    }
    filefilter test.raw test_1.raw, ///
    from(\BS) to(`" \BS"') replace
    erase test.raw
    copy test_1.raw test.raw, replace

    filefilter test.raw test_1.raw, ///
    from(\LQ) to(`" \LQ"') replace
    erase test.raw
    copy test_1.raw test.raw, replace

    /* replace spaces by a placeholder */

    filefilter test.raw test_1.raw, ///
    from(`" "') to(`" *SPACE* "') replace
    erase test.raw
    copy test_1.raw test.raw, replace
    }

    /* infile data */

    infile str10 char using test, clear
    /* replace placeholder */

    replace char=" " if char=="*SPACE*"
    compress

    /* erase the test data */
    erase test.raw
    erase test_1.raw
    ********************* END *********************

    So if anyone has any ideas (e.g. using the file command or regular expressions), please let me know, I would be very grateful.

    Many thanks

    Ali

    P.S.: I'm using Stata 12.1

  • #2
    How about this. Using -infix- you can tell it to read one character at a time. With anything reasonably wide, this could get really tedious. So we use Stata to create the .do file for us! If you're going wider than 300 characters, might need to modify it in various ways, but hope this basic approach is potentially attractive and simpler than what you were attempting.

    Code:
    cd c:\data\text
    
    clear
    set obs 100
    *====creating do file
    gen strL var1="cd c:\data\text" if _n==1
    replace var1="clear" if _n==2
    replace var1="infix " if _n==3
    
    *===meat of it.  Assumes 300 characters per line
    forvalues i=1/300 {
    replace var1=var1+" str1 var" + "`i'" + " " + "`i'" + "-" + "`i'" + " " if _n==3
    }
    replace var1=var1+ " using C:\data\text\rawtext.txt" if _n==3
    outfile using "c:\data\text\readraw.do", replace noq
    
    *=====now do the do-file we created!
    clear
    do readraw.do
    gen obs=_n
    reshape long var, i(obs) j(j)
    Last edited by ben earnhart; 09 Feb 2015, 16:25.

    Comment


    • #3
      BTW -- I was expecting to bump up against a limit on the length of a command. But at least in Stata 13.1/IC, I can do up to the 2047 variable limit in a single command. 2,047 characters is a pretty good-sized chunk of text, so you should probably be good to go, and if you have SE or MP, I guess you can go beyond that, though I dunno how far.

      Comment


      • #4
        Duh. You have Stata 12.1, so no long strings, gsem, or unicorns for you. Here's a version that will work with Stata 12. The 240 character limit is moot, since it's only reading one character at a time.

        Code:
        cd c:\data\text
        set more off
        clear
        set obs 2000
        *====creating do file
        gen str50 var1="cd c:\data\text" if _n==1
        replace var1="clear" if _n==2
        replace var1="#delim ;" if _n==3
        replace var1="infix " if _n==4
        *===meat of it.  Assumes 300 characters per line
        forvalues i=1/300 {
        replace var1=var1+" str1 var" + "`i'" + " " + "`i'" + "-" + "`i'" + " " if _n==`i' +4
        replace var1=var1+ " using C:\data\text\rawtext.txt;" if `i'==300 & _n==305
        }
        
        outfile using "c:\data\text\readraw.do", replace noq
        
        *=====now do the do-file we created!
        clear
        do readraw.do
        gen obs=_n
        reshape long var, i(obs) j(j)
        Last edited by ben earnhart; 09 Feb 2015, 22:34.

        Comment


        • #5
          Thank you Ben,

          This was just the elegant solution I was looking for. It works fine AND: very quick!

          Best

          Ali

          PS: yeah, at the moment, I "only" have Stata 12 MP with plenty of available RAM.
          I thought about waiting to buy a new version until Stata finally supports Unicode (which is crucial for corpus linguistics).

          Comment


          • #6
            Hi again,

            I slightly modified Ben’s approach: instead of assuming 300 characters per line, the length of the longest word in the text document is first calculated. This value is then used to automatically create Ben’s readraw do-file.

            Just compared this approach and my orginal filefilter-version with a bigger data set (around 1 million words).

            Suprisingly, the filefilter-version is much faster. Using the timer command, Ben's version took 5.11 min, while the filefilter version only needed 0.67 min to finish.

            It seems that the reshape part in Ben's approach is most time consuming, I would say more than 95%. So instead of using reshape, one could first outfile the raw data, filefilter superfluous spaces and then infile it again. However, with around one minute, this still takes longer than the filefilter-approach (and results in some errors).

            Ali

            Comment


            • #7
              Originally posted by Alexander Koplenig View Post
              Just compared this approach and my orginal filefilter-version with a bigger data set (around 1 million words).

              Suprisingly, the filefilter-version is much faster. Using the timer command, Ben's version took 5.11 min, while the filefilter version only needed 0.67 min to finish.

              Ali
              It's a good thing you tested. From -help filefilter-:

              Because of the buffering design of filefilter, arbitrarily large files can be converted quickly.
              You should:

              1. Read the FAQ carefully.

              2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

              3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

              4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

              Comment

              Working...
              X