Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import CSV File - Text Data with uneven quotation marks and delimiter symbols

    Dear Statalist Users,

    I want to import a ".csv" dataset which contains both numeric and string variables.
    First I tried using
    Code:
    import delimited mydata.csv, delimiter(";") varnames(1)
    The command works, but for some observations all variable values are stored within a simple variable. I browsed through the raw data with a text editor and I think it has to do with quotation marks ( " ) and the delimiter symbol ( ; ), which both can be found as strings for a certain string variable. Some string values look like this
    "bet"yes".
    Hence, I think that due to the uneven amount of quotation marks, Stata searches further for the line for the closing quotation symbol, which results in my problem.


    Next, I tried
    Code:
    import delimited mydata.csv, delimiter(";") varnames(1) stripquotes(yes)
    as I thought this might solve the issue, but it does not and the same problem occurs. Is this the case because I have not specified the -bindqoutes- option?


    My next try was
    Code:
    import delimited mydata.csv, delimiter(";")  varnames(1) bindquote(nobind) stripquotes(yes)
    This solves the first problem, but gives birth to another. As previously mentioned, I also have string values with the delimiter symbol it self in it, for instance "a;b".
    Thus, it will now use the semicolon, which is supposed to be meant as a string as an delimiter. As a result, for observation with semicolons as a string, the variables values are shifted to the right and it results in one extra variable for each semicolon, which was meant to be a string value.


    Is there a way to tackle both problems at once? I haven't found a solution yet and my next move would be to stick with the last command and reshift the affected variable values, after the import.


    Best regards,
    Ali

  • #2
    I think you might be able to solve this using regex (theoretically). Importing messy data is difficult and trying to fix it at the point of import is very difficult. Fixing it afterwards (your other thread) is also very difficult and sometimes not even feasible.

    The alternative is to try to fix the data before you import it. Of course, this is also not always possible. However, imagine you know that most of your issues are caused by data which looks like
    Code:
    ...; "okay string"; 12421; "not ok"ay string"; ...
    Then you could use regex to find all cases which follow the pattern semicolon-quote-text-quote-text-quote-semicolon. You might need to replace text by "anything that isn't a semi-colon". Then you could use regex to remove the outside quotes so that you get semicolon-text-quote-text-semicolon. You might need to adapt this for cases which contain multiple quotes inside the string.

    Then you look for all cases which follow the pattern semicolon-quote-text-quote-semicolon and remove those quotes as well.

    Now you have data where the only quotes are the ones that are actual data content, and none that function as string delimiter. Then you can use your first import delimited code and hopefully get a nice dataset. Also, go yell at the person who constructed the dataset in such a shabby way.

    Oh, I almost forgot. To actually do the regex operations, install a powerful text editor, e.g. notepad++ (free and amazing). Don't ask me to write the actual regex codes though, I am absolutely horrible at it.

    Comment


    • #3
      Dear Jesse,

      thanks again for your advice. I will first try to further fix the data after the import as I know that this can be done for sure, or at least most likely. The logic you present seems clear, but I think it is easier to fix the data after import, as for instance I also have data which looks like
      Code:
      ...; "okay string"; 12421; "not ok" ;ay string"; ...
      so I guess I would need even more rules.

      However, I will definitely have a try with regex to get familiar with it.


      Best regards,
      Ali

      Comment


      • #4
        Best of luck.

        Comment

        Working...
        X