Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • import delimited and accidental carriage returns

    Hi Statalist,

    I am importing about 75 csv's and appending them together. I got these csvs from a google doc where our surveyors are inputting information about each case. One of these variables is a notes column where manual comments have been typed about each case. The vast majority of the data is fine, but occasionally, a surveyor used a carriage-return (ALT + ENTER) to seperate out longer comments on to individual lines within the notes field. When the forms were exported to csv, the result was that some lines are parsed incorrectly and break before they should.

    An example of what each csv looks like when there is no problem:

    cleancsv.csv:

    RESPID,TYPE NOTES,VAR1 VAR2,VAR3
    ID1,TYPEA,some notes for the first case, 1,0,1
    ID2,TYPEA,some notes for the second case, 1,1,0
    ID3,TYPEB,some notes for the third case, 0,0,1
    ID4,TYPEA,some notes for the fourth case, 1,1,1
    ID5,TYPEB,some notes for the fifth case, 1,1,0
    ID6,TYPEB,some notes for the sixth case, 0,0,1
    ...

    An example with a manually entered carriage-return in the third observation's notes, it looks like:

    carriagereturncsv.csv:

    RESPID,TYPE NOTES,VAR1,VAR2,VAR3
    ID1,TYPEA,some notes for the first case, 1,0,1
    ID2,TYPEA,some notes for the second case, 1,1,0
    ID3,TYPEB,some notes for the third case
    that ended up containing a carriage return, 0,0,1
    ID4,TYPEA,some notes for the fourth case, 1,1,1
    ID5,TYPEB,some notes for the fifth case, 1,1,0
    ID6,TYPEB,some notes for the sixth case, 0,0,1

    When I "import delimited using "cleancsv.csv", I get a clean looking dataset that looks like:
    Code:
    clear all
    
    input str50(RESPID TYPE NOTES VAR1 VAR2 VAR3)
    "ID1" "TYPEA" "some notes for the first case" " 1" "0" "1"
    "ID2" "TYPEA" "some notes for the second case" " 1" "1" "0"
    "ID3" "TYPEB" "some notes for the third case" " 0" "0" "1"
    "ID4" "TYPEA" "some notes for the fourth case" " 1" "1" "1"
    "ID5" "TYPEB" "some notes for the fifth case" " 1" "1" "0"
    "ID6" "TYPEB" "some notes for the sixth case" " 0" "0" "1"
    end
    But when I "import delimited using "carriagereturncsv.csv", I get a warped dataset that looks like:

    Code:
    clear all
    
    input str50(RESPID TYPE NOTES VAR1 VAR2 VAR3)
    "ID1" "TYPEA" "some notes for the first case" " 1" "0" "1"
    "ID2" "TYPEA" "some notes for the second case" " 1" "1" "0"
    "ID3" "TYPEB" "notes for the third case"
    "that ended up containing a carriage return " "0" "0" "1"
    "ID4" "TYPEA" "some notes for the fourth case" " 1" "1" "1"
    "ID5" "TYPEB" "some notes for the fifth case" " 1" "1" "0"
    "ID6" "TYPEB" "some notes for the sixth case" " 0" "0" "1"
    The result is that I have an extra observation with RESPID = "that ended up containing a carriage return", and VAR1, VAR2, and VAR3 missing for the third observation with RESPID=ID3.

    The question: is there any systematic way to identify and negate these accidental carriage returns from STATA?

    Thanks much.
    Last edited by Julian Duggan; 17 Aug 2019, 10:19. Reason: caught a typo

  • #2
    The exact code necessary and how well my approach might work would depend on how dependably regular these errors are, and the actual data types (string, float, etc.) in these data sets. However, if the errors are quite regular, some illustrative technique for clean-up would be: (see comments/below)
    Code:
    // Set up the problem variables and their replacements.
    local varlist = "VAR1 VAR2 VAR3"  // presumably longer in reality
    local replist = "TYPE NOTES VAR1" // replacement variables
    //
    // Identify bad observations.
    foreach v of varlist `varlist' }
      replace `v' = strtrim(`v') // stray blanks are likely
    }
    egen goodvars = rownonmiss(`varlist'), strok  // 0 good means a bad observation
    //
    // Replace values in problem observations using variables from the next observation.
    local pos = 1
    foreach v of varlist `varlist' {
       local vrep : word `pos' of `replist'
       replace `v' = `vrep'[_n+1] if (goodvars == 0)
       local ++pos
    }
    Note that what I've shown is vulnerable to serious error if all the variables are missing for some reason other than the problem you mention. It also depends on there being just one carriage return inserted, so that the correct values are in the next observation. I'd also note that, with the limited example you showed here, the code I suggest didn't get much of a test.

    Another possibility: If there is something distinctive about the ID variable, perhaps you could identify bad observations by the next observation having a anomalous ID value. Or, at least in your example, an observation is bad if the next observation's NOTES variable is anomalously short. Or, in the actual data, problem observations might be identifiable by non-numeric content in a numeric variable. I might try flagging bad observations using several of these criteria.





    Comment


    • #3
      The problem is that the export of your Google docs to csv files was done incorrectly. The NOTES values should have been surrounded by quotation marks to indicate the contents of special characters, which Stata could then have been instructed to ignore.
      Code:
      . type file2.csv
      RESPID,TYPE,NOTES,VAR1,VAR2,VAR3
      ID1,TYPEA,"some notes for the first case", 1,0,1
      ID2,TYPEA,"some notes for the second case", 1,1,0
      ID3,TYPEB,"some notes for the third case
      that ended up containing a carriage return", 0,0,1
      ID4,TYPEA,"some notes for the fourth case", 1,1,1
      ID5,TYPEB,"some notes for the fifth case", 1,1,0
      ID6,TYPEB,"some notes for the sixth case", 0,0,1
      
      . clear
      
      . import delimited using file2.csv, bindquote(strict)
      (6 vars, 6 obs)
      
      . list notes, sep(1)
      
           +--------------------------------------------------------------------------+
           |                                                                    notes |
           |--------------------------------------------------------------------------|
        1. |                                            some notes for the first case |
           |--------------------------------------------------------------------------|
        2. |                                           some notes for the second case |
           |--------------------------------------------------------------------------|
        3. | some notes for the third case                                            |
           | that ended up containing a carriage return                               |
           |--------------------------------------------------------------------------|
        4. |                                           some notes for the fourth case |
           |--------------------------------------------------------------------------|
        5. |                                            some notes for the fifth case |
           |--------------------------------------------------------------------------|
        6. |                                            some notes for the sixth case |
           +--------------------------------------------------------------------------+
      
      .
      The straightforward way of correcting this problem is to export the forms to CSV files again, taking care to ensure that the non-numeric columns are quoted correctly.

      Added in edit: Crossed with post #2 from Mike Lacy who provides good technique for correcting detected errors. The advantage to the approach in this post is that whoever exported the CSV files will learn how to robustly export files that included columns containing text.
      Last edited by William Lisowski; 17 Aug 2019, 11:36.

      Comment


      • #4
        Hi Mike and William,

        Mike: That's some nifty code, thanks very much. I do happen to already have a case of multiple carriage returns in a single notes field, though. So I may have to adapt a bit further if I go this route.

        William: Thanks for the comment on exporting. How would I go about making sure that non-numeric columns are correctly quoted when exporting?

        I should explain further, too. The googlesheet that I am exporting has about 75 sheets -- these are that csvs that I need to append together. So I actually download the entire googlesheet as a .xlsx, then use a python script, which I can execute from my terminal, called xlsx2csv. This converts each sheet to an indepedent csv. Then I append the csvs. The code:

        Code:
        global xlsxpath "/Path/to/xlsx/with/75/sheets
        global csvpath "/Path/to/where/i/save/the/individual/csvs
        
            loc i = 1 
            foreach s in `sheetname'{
            loc terminal `"`terminal' xlsx2csv -n "`s'" "$xlsxpath/file.xlsx" "$csvpath/`s'.csv""'
            if `i'!= 73 loc terminal "`terminal' & "
            loc ++i
            }
            !`terminal' // this produces all 75 csvs at once
        
        // append down here

        Comment


        • #5
          Well, if you actually originally downloaded these as .xlsx files, why go through .csv? Why not directly import from .xlsx using -import excel-? I think that the problem you are encountering will be completely avoided if you go directly from .xlsx to Stata.

          Comment


          • #6
            Hi Clyde,

            I thought that, but I was unable to download the .xlsx file directly using import excel. Upon typing

            import excel using "/my/file/path/ERD20190817.xlsx", sheet(lists)

            I receive the error:

            file /my/file/path/ERD20190817.xlsx could not be loaded
            r(603);

            I do not actually understand the reason for this error. It may have something to do with the number of rows and columns. But I did not pursue it because I still prefer the csv route, I think. In my experience, import excel is a lot slower than import delimited, and the process of generating the csvs is fast enough that I still prefer to go the csv route. As the original googlesheet I am working from will be updated over and over, I will need to run this import process over and over. So speed is a somewhat important consideration.

            Thanks,

            Julian

            Comment


            • #7
              How would I go about making sure that non-numeric columns are correctly quoted when exporting?
              TL;DR: read the fine manual page for the xlsx2csv command.

              I am not a python user, but a quick DuckDuckGo search (not a member of the Google borg, either) turned up as its top hit https://pypi.org/project/xlsx2csv/ as a source of one such Python command. The manual page included in the package tells us
              Code:
              -q nonnumeric
              or
              Code:
              --quoting=nonnumeric
              should do what you need, if the xlsx2csv Python command you are using is the same one I found.

              Comment


              • #8
                Hi William,

                Thanks! Apologies for farming out more basic google searches to you -- I didn't understand the concept well enough to know that this is the sort of thing xlsx2csv could solve itself.

                Thanks,

                Julian

                Comment

                Working...
                X