Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • import delimited csv with comma as part of data value

    Hello all,

    I have a csv data file (comma separated), however in some string columns they also contain commas as part of the value/text.

    For example: field_of_study could be one word such as "economics" or two or more words also separated by commas such as "Data, Analytics". Note that I put double quote here for the content, in the data there is no double quote around the values.

    Similar example is address where comma is often part of the text.

    I tried to use import delimited but looking for some similar options such as some bindquote() or stripquote() but for commas.

    Is there a way to import such data?

    Many thanks,
    Minh

  • #2
    I don't think there is anything you can do with this. If you can contact the people who created the data set you can ask them to regenerate it with some other delimiter that doesn't actually appear in any of the content. (A pipe "|" delimiter is often perfect for this.) Or you can try to find another version of the file in some other format, like a spreadsheet or some statistical package's data format.

    The dim ray of hope is that perhaps there are only a small number of observations so affected. If that is so, you may be able to edit them. It's relatively easy to spot those instances: after importation there will be one (or sometimes more) extra variable on the end of the data set, and it will be non-missing only in those observations where there is an offending embedded comma. So if you just count how many non-missing values there are in the first "extra" variable you will have a sense of how many of these problems there are.

    But if the problem is pervasive, I think the file is just unreadable.

    Comment


    • #3
      Perhaps I don't understand the problem correctly, but I have no issue with values that contain an embedded comma in some string variables.

      Code:
      * Create toy data set.
      clear
      input byte var1 str4 (var2 var3 var4)
        1 "abc" "d,ef" "ghi"
        2 "jk,l" "mno" "pqr"
      end
      
      list
      
           +---------------------------+
           | var1   var2   var3   var4 |
           |---------------------------|
        1. |    1    abc   d,ef    ghi |
        2. |    2   jk,l    mno    pqr |
           +---------------------------+
      
      * Export csv file with a comma embedded in some string values.
      export delimited commatest.csv
      
      * Import data from csv file.
      import delimited commatest.csv, varnames(1) clear
      
      * List vars (including those with embedded an comma in some string values.
      list
      
           +---------------------------+
           | var1   var2   var3   var4 |
           |---------------------------|
        1. |    1    abc   d,ef    ghi |
        2. |    2   jk,l    mno    pqr |
           +---------------------------+
      Red Owl

      Comment


      • #4
        Red,

        The reason you're not encountering the problem is that when you -export delimited commatest.csv-, export delimited is smart enough to bind in quotes those values that contain embedded commas:

        Code:
        . type commatest.csv
        var1,var2,var3,var4
        1,abc,"d,ef",ghi
        2,"jk,l",mno,pqr
        Minh Nguyen's data set does not have quotes binding those variables. So -import delimited- is unable to distinguish the comma within d,ef from the comma that delimits it from abc or ghi. -import delimited- thinks that d,ef is in fact two separate variables with values d and ef.

        Minh doesn't say where his data set comes from. But I have encountered this kind of data set before, typically generated by a SQL query from a mainframe database system. In my case, the only workable solution has been to ask the data provider to regenerate the data set using a pipe (|) delimiter.


        Comment


        • #5
          Clyde,

          I should have written in my post that I had created the csv file manually with Excel 2016 (on Windows 10) without quotes when I tested it.

          I only created the csv file with quotes in my post because I couldn't link to my original csv file.

          Have you tried to reproduce the issue by manually creating the csv file without quotes directly with Excel? It's a puzzle because it works fine for me even without the quotes in the csv file.

          I wonder if this is related to the version of Excel with which Minh's csv file was created.

          Red Owl

          Comment


          • #6
            Many thanks Clyde and Red.
            That is correct. In the data, I dont have the quotes around the data value, so import delimited can guess the correct columns to import. As a result, the data imported into Stata will have more variables as Clyde described above, and the contents of the variables will be shifted or separated based on the separator (,).

            The problem is extensive. I was planning to do one or two columns at a time, and fix, still there are some values with many commas in the text.

            Comment


            • #7
              Red Owl So, I took your commatest.csv and edited out the quotes. Now watch:

              Code:
              . clear
              
              . type commatest.csv
              var1,var2,var3,var4
              1,abc,d,ef,ghi
              2,jk,l,mno,pqr
              
              . 
              
              
              . import delimited using commatest.csv, varnames(1)
              (5 vars, 2 obs)
              
              . 
              . list
              
                   +---------------------------------+
                   | var1   var2   var3   var4    v5 |
                   |---------------------------------|
                1. |    1    abc      d     ef   ghi |
                2. |    2     jk      l    mno   pqr |
                   +---------------------------------+
              Notice that there are now 5 variables instead of 4. d,ef has been split into two variables, and so has jkl.

              Comment


              • #8
                Minh Nguyen,

                In addition to the good advice given in post #2 above:

                If the data have a single comma as field separator (not followed by any space) and any comma in a string field is followed by a space e.g.

                Data, Analytics
                Post Office Box 123, Shelton, Connecticut 06484


                a record may look like

                0,1,Data, Analytics,3,Post Office Box 203, Shelton, Connecticut 06484,5.5

                and you may replace the ", " with a string not found in your data using -filefilter- before importing:
                Code:
                clear
                tempfile A
                tempfile B
                
                file open A using `A', write text replace
                file write A "0,1,Data, Analytics,3,Post Office Box 203, Shelton, Connecticut 06484,5.5"
                file close A
                
                filefilter `A' `B' , from(", ") to("##")
                
                import delimited using `B' , varnames(nonames) clear
                
                foreach v of varlist * {
                
                    capture replace `v' = subinstr(`v',"##",", ",.)
                }
                
                list
                Last edited by Bjarte Aagnes; 24 Sep 2017, 05:05.

                Comment

                Working...
                X