Problems with missing when exporting to .cvs-file

Sigrid Borte

Join Date: Jan 2017

Posts: 5
#1

Problems with missing when exporting to .cvs-file

13 Jan 2017, 09:57

Good evening everyone.
Hope someone can help me with my problem, and that I have managed to get this post approximatey right.

I have problems with missing values in my dataset when I try to export to .cvs-file.

My file is on this format:
AGE SEX HEIGHT WEIGHT sample_key alcohol_weekly audit_binary
71 Male 173.5 102.7 9384872_RO1 0.3234 1
66 Female 163.2 . 8234732_RO2 -0.983 0

etc.. The file contains some missing values for some of the observations.

I would like to export as a cvs-file, and I have been using this code:

Code:

. export delimited using "/Users/sigridbo_adm/Documents/Migrene/Datafiler + variabler/Datafiler/Tromsøstudien/Alkohol/TromsoHarvest_Alkoho > l.csv", delimiter(tab) nolabel replace

The problem with the resulting cvs-file, is that if a value is missing, the variable after is shown in that column.
An example is posted below; the 4th individual is missing "alkohol_weekly_log", so the audit_binary has been moved to that column, i.e. in one colum there is a mixture of different variables.

I have been looking at the manual, this forum and googling without finding an explanation to my problem.
I have tried multple ways of exporting, using code, GUI, different export options, but the problem remain.

I am sorry if there is an obvious solution to my problem that I have not seen, but I hope someone can help me.

Regards
Sigrid Børte
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

13 Jan 2017, 10:14

Sigrid:
you may want to try the following work-around:

Code:

use "C:\Program Files (x86)\Stata14\ado\base\a\auto.dta", clear foreach var of varlist price-foreign { replace `var' =9999 if `var'==. }

Then you can export your file remembering to replace 9999 with an empty cell thereafter.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#3

13 Jan 2017, 10:16

See also

Code:

help mvencode

which replaces Carlo's loop with one command.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#4

13 Jan 2017, 10:24

Nick is right.
Old habits are difficult to give up!

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#5

13 Jan 2017, 10:34

An important detail is that mvencode checks whether your proposed new value already exists!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#6

13 Jan 2017, 10:41

...avoiding that way inadvertent catastrophic mixing-ups of values with quite different meanings!
I often complain I have too little time for studying.
Nick's reply highlights one research field I should devote more (of the limited) time to.

Kind regards,
Carlo
(Stata 19.0)
Comment
Sigrid Borte

Join Date: Jan 2017

Posts: 5
#7

17 Jan 2017, 05:58

Hi!

Thank you very much for your help!
It worked substituting the "." with something else (I just used -99 since it is not in my data).
Just weird Stata is not doing something like that automatically when exporting, as this must be a known problem.. But as long as there is a work-around, it should be fine.
Thanks again!

Regards
Sigrid
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#8

17 Jan 2017, 08:54

Sigrid Borte has asked why columns of values are misaligned in a tab-delimited output in the presence of missing values.
Carlo Lazzaro and Nick Cox have provided advice on replacing the missing value with a non-missing value as a placeholder.
Sigrid subsequently asked why Stata doesn't do something like this automatically.

The problem is that the format specifies a tab character (ASCII code 09) to be used as a delimiter, and this character is invisible. Moreover, while specifying to move to the next column, no instruction is given how wide should each column be. All these decisions are left to the text editor/word processor/etc which is used to subsequently open the file.

Importantly, spreadsheets, database applications, statistical packages, and other such programs that expect tabular data, will be looking for the separators, and properly react to missing content.
Stata does produce a correct file, as shown in this example:

Note the double separators where the missing values occur in the file.
This is correct and other software (including Stata itself) should properly insert missing values:

So the question goes back to Sigrid: what makes you think that the file is malformed and any special adjustments are necessary? Which software are you using to subsequently process the tab-delimited data? Can you reproduce the problem with a public dataset?

Best, Sergiy Radyakin

PS:
CVS is a pharmacy in the USA, and possibly in other countries.
CSV is a Comma-Separated-Values file format.
Do not confuse the two. CVS may be is using the CSV file internally, but we don't know this.
In Sigrid's case it is actually a TSV - tab-separated-values file format, since she specified TAB to be used as a delimiter.
3 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#9

17 Jan 2017, 09:51

Sergiy has well articulated my overall puzzlement with this thread. (When it started I was confined to bed and so I didn't study the original post carefully. My contributions are essentially local, not global, i.e. reacting to the previous post, usually.

I would strongly agree that it's a bad convention to call tab-separated files .csv at any level. (.cvs although it's repeated is I assume just a careless typo.)

As I see it, Stata leaves missings implicit between separators. which are not typically going to be misunderstood by other software, so there is no real problem so long as that is understood.
1 like
Comment
Sigrid Borte

Join Date: Jan 2017

Posts: 5
#10

21 Dec 2017, 06:22

Hi!

For some reason I have not seen the last answers here..

I cannot understand how your file displays the "." while mine are not. I am working with the exported dataset in Unix.
I have tried to convert both to comma-separated file and tab-separated file, but the "." is converted to nothing; not even an empty string. To me it results in trouble when I try to set missing value to a particular value. It might be me, but I still think it is strange that Stata gives me nothing in exchange for the ".".

The only solution I have found, without replacing the missing value in Stata, is to use sed to replace the ",," with ex. "NA", but it doesn´t really work either..

Anyway, thank you so much for your feedback! Using the for loop in Stata is what is working best!

(And yes, the CVS was a typo, I don´t know why I keep doing it)

Sigrid
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#11

21 Dec 2017, 12:34

Dear Sigrid,

you wrote:

The problem with the resulting cvs-file, is that if a value is missing, the variable after is shown in that column.

How is this established?

Best, Sergiy
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

21 Dec 2017, 14:54

Sigrid Borte I think you are misunderstanding Sergiy.

In tab separated values format, successive data items are separated by tabs. He has shown in post #8 in the window titled "Lister" containing the hex dump of the output file that Stata missing values are "missing"- not represented by any character(s) - in the output file, but there are two tabs (hex 09 characters) - one that follows the value before the missing value, and one that precedes the value after the missing value.

His question to you, both post 8 and post 10, is "what did you use to display the listing in the screen shot in your post #1?" Because he, and I, think that the problem is with the way the data was displayed by that process, not with the tab separated value output file. If you were to open your exported file in Excel, the data would line up in the proper columns, as long as you do not check the option to "treat consecutive delimiters as one" in which case the two adjacent tabs will be treated as one, and results like those you showed in post #1 will indeed happen.

Here is a small reproducible example that can be used to confirm that the columns line up in a program - Excel - that correctly handles tab separated data. Note that I use .txt as the suffix for the exported file - a .csv suffix for tab separated values is incorrect and will confuse programs, like Excel, that are asked to read the file, since it incorrectly implies that columns are separated by commas. Stata will let you create a tab delimited file with .csv as the suffix, as you demonstrate in post #1, but this is not something you should do.

Code:

// create a small file with one missing value sysuse auto, clear keep in 1/5 drop weight-foreign list, clean // save as tab sepaerated values export delimited using "Demo.txt", delimiter(tab) replace

And now use Excel to open Demo.txt.
1 like
Comment
Sigrid Borte

Join Date: Jan 2017

Posts: 5
#13

13 Mar 2018, 08:21

Dear all

Yes, I guess I am misunderstanding.

There was a typo in my question, as I have already pointed out; I was exporting a tab-separated file (.txt), not a comma-separated one (.csv), even though I tried both.
However, I am not using excel to read my files, but unix, so the extension does not really matter.

The screen shot from above was from using the "head" command in my terminal.

My problem is that when I am using a tool working with fields, the fields are not read properly. Sergiy asked was how that is established, and that was my question too. All I know is that when I am working with my files in unix, the fields/columns are disturbed.

However - I have used the very useful tips from the first answers, simply replacing all values with "-99 (mvencode _all, mv(-99)) ", and after exporting, replacing all "-99" with "NA" or whatever.

Thank you so much for the help!

Sigrid
Comment

Announcement

Problems with missing when exporting to .cvs-file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment