Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with missing when exporting to .cvs-file

    Good evening everyone.
    Hope someone can help me with my problem, and that I have managed to get this post approximatey right.


    I have problems with missing values in my dataset when I try to export to .cvs-file.

    My file is on this format:
    AGE SEX HEIGHT WEIGHT sample_key alcohol_weekly audit_binary
    71 Male 173.5 102.7 9384872_RO1 0.3234 1
    66 Female 163.2 . 8234732_RO2 -0.983 0


    etc.. The file contains some missing values for some of the observations.

    I would like to export as a cvs-file, and I have been using this code:

    Code:
    . export delimited using "/Users/sigridbo_adm/Documents/Migrene/Datafiler + variabler/Datafiler/Tromsøstudien/Alkohol/TromsoHarvest_Alkoho
    > l.csv", delimiter(tab) nolabel replace
    The problem with the resulting cvs-file, is that if a value is missing, the variable after is shown in that column.
    An example is posted below; the 4th individual is missing "alkohol_weekly_log", so the audit_binary has been moved to that column, i.e. in one colum there is a mixture of different variables.

    Click image for larger version

Name:	Skjermbilde 2017-01-13 17.46.07.png
Views:	1
Size:	20.9 KB
ID:	1370255



    I have been looking at the manual, this forum and googling without finding an explanation to my problem.
    I have tried multple ways of exporting, using code, GUI, different export options, but the problem remain.


    I am sorry if there is an obvious solution to my problem that I have not seen, but I hope someone can help me.


    Regards
    Sigrid Børte




  • #2
    Sigrid:
    you may want to try the following work-around:
    Code:
    use "C:\Program Files (x86)\Stata14\ado\base\a\auto.dta", clear
    foreach var of varlist price-foreign {
     replace `var' =9999 if `var'==.
      }
    Then you can export your file remembering to replace 9999 with an empty cell thereafter.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      See also

      Code:
      help mvencode 
      which replaces Carlo's loop with one command.

      Comment


      • #4
        Nick is right.
        Old habits are difficult to give up!
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          An important detail is that mvencode checks whether your proposed new value already exists!

          Comment


          • #6
            ...avoiding that way inadvertent catastrophic mixing-ups of values with quite different meanings!
            I often complain I have too little time for studying.
            Nick's reply highlights one research field I should devote more (of the limited) time to.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Hi!

              Thank you very much for your help!
              It worked substituting the "." with something else (I just used -99 since it is not in my data).
              Just weird Stata is not doing something like that automatically when exporting, as this must be a known problem.. But as long as there is a work-around, it should be fine.
              Thanks again!


              Regards
              Sigrid

              Comment


              • #8
                Sigrid Borte ​​​​​​ has asked why columns of values are misaligned in a tab-delimited output in the presence of missing values.
                Carlo Lazzaro and Nick Cox have provided advice on replacing the missing value with a non-missing value as a placeholder.
                Sigrid subsequently asked why Stata doesn't do something like this automatically.

                The problem is that the format specifies a tab character (ASCII code 09) to be used as a delimiter, and this character is invisible. Moreover, while specifying to move to the next column, no instruction is given how wide should each column be. All these decisions are left to the text editor/word processor/etc which is used to subsequently open the file.

                Importantly, spreadsheets, database applications, statistical packages, and other such programs that expect tabular data, will be looking for the separators, and properly react to missing content.
                Stata does produce a correct file, as shown in this example:
                Click image for larger version

Name:	StataTabDelimitedOutput.png
Views:	1
Size:	51.4 KB
ID:	1370598



                Note the double separators where the missing values occur in the file.
                This is correct and other software (including Stata itself) should properly insert missing values:
                Click image for larger version

Name:	StataTabDelimitedOutput2.png
Views:	1
Size:	10.9 KB
ID:	1370599




                So the question goes back to Sigrid: what makes you think that the file is malformed and any special adjustments are necessary? Which software are you using to subsequently process the tab-delimited data? Can you reproduce the problem with a public dataset?

                Best, Sergiy Radyakin

                PS:
                CVS is a pharmacy in the USA, and possibly in other countries.
                CSV is a Comma-Separated-Values file format.
                Do not confuse the two. CVS may be is using the CSV file internally, but we don't know this.
                In Sigrid's case it is actually a TSV - tab-separated-values file format, since she specified TAB to be used as a delimiter.

                Comment


                • #9
                  Sergiy has well articulated my overall puzzlement with this thread. (When it started I was confined to bed and so I didn't study the original post carefully. My contributions are essentially local, not global, i.e. reacting to the previous post, usually.

                  I would strongly agree that it's a bad convention to call tab-separated files .csv at any level. (.cvs although it's repeated is I assume just a careless typo.)

                  As I see it, Stata leaves missings implicit between separators. which are not typically going to be misunderstood by other software, so there is no real problem so long as that is understood.

                  Comment


                  • #10
                    Hi!

                    For some reason I have not seen the last answers here..

                    I cannot understand how your file displays the "." while mine are not. I am working with the exported dataset in Unix.
                    I have tried to convert both to comma-separated file and tab-separated file, but the "." is converted to nothing; not even an empty string. To me it results in trouble when I try to set missing value to a particular value. It might be me, but I still think it is strange that Stata gives me nothing in exchange for the ".".

                    The only solution I have found, without replacing the missing value in Stata, is to use sed to replace the ",," with ex. "NA", but it doesn´t really work either..

                    Anyway, thank you so much for your feedback! Using the for loop in Stata is what is working best!

                    (And yes, the CVS was a typo, I don´t know why I keep doing it)


                    Sigrid

                    Comment


                    • #11
                      Dear Sigrid,

                      you wrote:
                      The problem with the resulting cvs-file, is that if a value is missing, the variable after is shown in that column.
                      How is this established?

                      Best, Sergiy

                      Comment


                      • #12
                        Sigrid Borte I think you are misunderstanding Sergiy.

                        In tab separated values format, successive data items are separated by tabs. He has shown in post #8 in the window titled "Lister" containing the hex dump of the output file that Stata missing values are "missing"- not represented by any character(s) - in the output file, but there are two tabs (hex 09 characters) - one that follows the value before the missing value, and one that precedes the value after the missing value.

                        His question to you, both post 8 and post 10, is "what did you use to display the listing in the screen shot in your post #1?" Because he, and I, think that the problem is with the way the data was displayed by that process, not with the tab separated value output file. If you were to open your exported file in Excel, the data would line up in the proper columns, as long as you do not check the option to "treat consecutive delimiters as one" in which case the two adjacent tabs will be treated as one, and results like those you showed in post #1 will indeed happen.

                        Here is a small reproducible example that can be used to confirm that the columns line up in a program - Excel - that correctly handles tab separated data. Note that I use .txt as the suffix for the exported file - a .csv suffix for tab separated values is incorrect and will confuse programs, like Excel, that are asked to read the file, since it incorrectly implies that columns are separated by commas. Stata will let you create a tab delimited file with .csv as the suffix, as you demonstrate in post #1, but this is not something you should do.

                        Code:
                        // create a small file with one missing value
                        sysuse auto, clear
                        keep in 1/5
                        drop weight-foreign
                        list, clean
                        // save as tab sepaerated values
                        export delimited using "Demo.txt", delimiter(tab) replace
                        And now use Excel to open Demo.txt.

                        Comment


                        • #13
                          Dear all

                          Yes, I guess I am misunderstanding.

                          There was a typo in my question, as I have already pointed out; I was exporting a tab-separated file (.txt), not a comma-separated one (.csv), even though I tried both.
                          However, I am not using excel to read my files, but unix, so the extension does not really matter.

                          The screen shot from above was from using the "head" command in my terminal.

                          My problem is that when I am using a tool working with fields, the fields are not read properly. Sergiy asked was how that is established, and that was my question too. All I know is that when I am working with my files in unix, the fields/columns are disturbed.


                          However - I have used the very useful tips from the first answers, simply replacing all values with "-99 (mvencode _all, mv(-99)) ", and after exporting, replacing all "-99" with "NA" or whatever.


                          Thank you so much for the help!

                          Sigrid

                          Comment

                          Working...
                          X