Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Command infile with string data which includes blanks

    Hi all,

    I am trying to read data into Stata 13 from a text-file which has a format as in the attached file (test.txt). Essentially the file is one long string and one observation is 125 characters, sometimes including blank spaces. After 125 characters a new observations starts and there is no delimiter.

    I tried reading the data using:

    infile str125 v1 using "test.txt"

    However, this gives me the following dta.-file (attachment: Test.png). So the infile command apparently can’t handle blanks in the 125 character string and starts a new observation. Is there any way to tell infile that it should include blanks in the string variable?

    Best regards

    Carlo
    Attached Files
    Last edited by Carlo Wix; 24 Apr 2014, 05:26.

  • #2
    Some additional info:

    I have also tried reading the data using:

    infix str v1 1-125 using "test.txt"

    But Stata then tells me: "(0 observations read)".

    However, if I manually format the data such that it looks like in test2.txt, then I get the following dta.-file (attachment: Test2.png), which is exactly what I am looking for. But since the actual txt.-file is quite large, there is no way I can manually format all the data.
    Attached Files

    Comment


    • #3
      Given your first structure, with blocks of 4 lines, you can go something like

      Code:
      egen block = seq(), block(4)
      sort block, stable
      by block : gen everything = v1 if _n == 1
      by block : replace everything = everything[_n-1] + " " + v1 if _n > 1
      by block : keep if _n == 4
      You still have to parse the data.

      Comment


      • #4
        I was going to suggest that for your basic problem -- that your source file is a single line -- you would need to use something other than Stata, such as a text processing utility (awk, sed) or a general programming language like Python, to pre-process it into something that Stata can read.

        Then I remembered that Mata is quite a general programming language, and has the raw input/output facilities needed:
        Code:
        mata:
          file = fopen("test.txt", "r")
          outfile = fopen("testout.dat","w")
        
          while ((line=fread(file,126))!=J(0,0,"")) {
            fwrite(outfile, line)
            fwrite(outfile,char(10))
          }
          fclose(file)
          fclose(outfile)
        end
        This reads 126 bytes at a time from your source file, and then writes it out to another file (which must not already exist), adding a line-break character. On Windows you may need to a carriage-return/line-break combination: put "fwrite(outfile,char(13))" before the "char(10)" line in that case.

        Note that the data you provided is in 126 not 125-byte chunks.

        Comment


        • #5
          Carlo,

          Here's yet another solution which doesn't require any trickery. Create a dictionary (say, test.dct) containing the following:

          Code:
          dictionary using test.txt {
          _lrecl(126)
          str125 v1 %125s
          }
          and deploy it with infile using test.dct.

          Of course, you can also read in individual variables instead of one long string, according to your specifications.

          Regards,
          Joe

          Comment


          • #6
            I haven't tried it, but Brendan's code

            Code:
             
            fwrite(outfile, line)
            fwrite(outfile, char(10))
            could I imagine be tweaked to

            Code:
            fwrite(outfile, line + char(10))

            Comment


            • #7
              Thanks a lot everyone for your help!

              Joe's code solved the problem and was exactly what I was looking for.

              @Nick: Your code worked perfectly for my provided example. However, in my real dataset I had the additional issue that the blocks were of varying size depending on the observation.

              Best regards

              Carlo

              Comment


              • #8
                OK. The technique could easily be extended, e.g. suppose each block has the same starter text.

                Code:
                 
                gen block = (v1 == "starter text") 
                replace block = sum(block) 
                sort block, stable 
                by block : gen everything = v1 if _n == 1
                and so forth with the last condition if _n == _N

                Comment


                • #9
                  Joe's succinct solution has made me realise that infile via dictionary is rather more powerful and general than I thought.

                  Comment


                  • #10
                    I don't deserve much credit for this solution. I haven't had to do this before, but the computer scientist in me couldn't believe that Stata didn't have some sort of built-in (albeit little used) solution for this sort of problem. The term lrecl (and its meaning) is a leftover from days gone by when text files were frequently stored without carriage returns.

                    Comment


                    • #11
                      Yes, something about lrecl made me recall FORTRAN and punchcards. Or VAX/VMS at any rate.

                      Comment


                      • #12
                        Hi all
                        I would like to know how to eliminate the observations concerning missing data which does not exist on my database while these data are treated in my model and treated as existing observations.thanks

                        Comment

                        Working...
                        X