Dropped Observations after Importing Excel Sheet as a CSV File

Eli Aguado

Join Date: Jan 2017

Posts: 19
#1

Dropped Observations after Importing Excel Sheet as a CSV File

13 Jan 2020, 06:14

I have an excel sheet I downloaded as a csv file. The number of observations is over 150,000. I imported the csv file, and noticed some things were off. After checking the number of observations with the original file, the number of observations in the imported file was at around 141,000. I was wondering if anyone knows why a fair amount of observations were dropped after importing the file. (I find this so strange. Also, as a small FYI, I am running on Stata 14).

This does not happen when I import the data as an excel file (.xlsx). However, when I import the excel file as a .xlsx, it runs slower and my variables turn into string variables. I have tried to destring them with destring, replace but it does not destring and reads the variables as non-numeric. I am a little bit at a loss. Thanks for any help and/or leads!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

13 Jan 2020, 07:06

I assume you used import excel to import your .xlsx file. The problem Stata has is that it can be difficult for import excel to determine where the data start and end on the worksheet, and sometimes it doesn't get the right answer. You should try the cellrange() option on import excel to specify exactly where your data are, and use the firstrow option if the first row of your cellrange contains variable names rather than data values.

If that doesn't solve your problem, we can better help you if we know exactly what commands you have tried and what Stata told you to indicate that there was a problem. Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. See especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], as described in section 12 of the FAQ.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#3

13 Jan 2020, 07:15

One other thing to consider along with William's helpful advice, is that if you imported your dataset as a csv file using -import delimited-, it's possible that around line 141,000 there was a malformed line or something funny with the data. This could have caused the import process to stop there. Analogous to the -cellrange()- option for -import excel-, there is also the -rowrange()- option for -import delimited- that will specify the range of rows to import.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

13 Jan 2020, 07:24

Leonardo makes a good point, which reminds me that I meant to write that in general effort is best expended on importing data directly from an xlsx file, rather than having Excel translate the xlsx to csv and then having Stata import the csv and translate it to a Stata dataset. That's why my advice in post #2 wasn't about why the csv didn't work, but instead about how to solve the original problem, which is to read data that's in .xlsx file.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#5

13 Jan 2020, 07:53

That's a good point William Lisowski. I mention my comment mainly for two reasons. Excel automatically associates itself with the .csv file extension, so it may well be that raw data is only in that format and people can mistakenly think of csv as excel files, though this doesn't seem to be the case here. Another is that a similar issue may arise in the future for long csv files, so it may be worth a mention. That said, I wonder which is expected to be faster at importing?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#6

13 Jan 2020, 09:23

Here's a thought of what might cause this oddity:
I wonder if there are some apparently blank lines at the end of the Excel file that containing nonvisible characters (e.g., just a "space" in some cell.) This would cause an import of the Excel file to show up with string values after -import excel-. *Perhaps* the CSV exported from Excel didn't have those blank rows, and therefore showed a smaller number of observations and also avoided the string values.
Comment
Eli Aguado

Join Date: Jan 2017

Posts: 19
#7

14 Jan 2020, 05:55

Originally posted by William Lisowski View Post

I assume you used import excel to import your .xlsx file. The problem Stata has is that it can be difficult for import excel to determine where the data start and end on the worksheet, and sometimes it doesn't get the right answer. You should try the cellrange() option on import excel to specify exactly where your data are, and use the firstrow option if the first row of your cellrange contains variable names rather than data values.

If that doesn't solve your problem, we can better help you if we know exactly what commands you have tried and what Stata told you to indicate that there was a problem. Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. See especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], as described in section 12 of the FAQ.

All, thank you...as well as Mike Lacy and everyone else in this post. I think(?) I may have figured out what is going, though, still at a loss. I can't give too much details of the dataset out. However, some missing values in both datasets have "***" for missing values. For these two files, the "***" are found within one particular year. With regards to the .xlsx file, once I dropped the year, the

Code:

destring, replace

option worked. Though, I am still confused. I spent a good deal trying to analyze these excel files. The .csv file is still giving me an N=141,709 out of over 150,000 observations. I tried analyzing this observation, and I can't find the issue. The csv file in it's original format has "***" for the said year in the .xlsx file, but (assuming given that it's a csv file) in the Stata data browser/editor the *** turn into "."

This is how the csv file was imported using

Code:

import delimited

Code:

import delimited "/#######.csv", case(preserve) rowrange(1:151759) colrange(1:14) encoding(ISO-8859-1)

The variable case was originally imported as "Lower" and had the same issue. Text encoding is Latin1. Floating point precision: use as default. Quote binding: loose. Treat sequential delimiters as one. Everything is else is automatic.

Thank you all.

Last edited by Eli Aguado; 14 Jan 2020, 05:59. Reason: grammar/typos
Comment
Eli Aguado

Join Date: Jan 2017

Posts: 19
#8

14 Jan 2020, 05:57

Originally posted by Leonardo Guizzetti View Post

One other thing to consider along with William's helpful advice, is that if you imported your dataset as a csv file using -import delimited-, it's possible that around line 141,000 there was a malformed line or something funny with the data. This could have caused the import process to stop there. Analogous to the -cellrange()- option for -import excel-, there is also the -rowrange()- option for -import delimited- that will specify the range of rows to import.

Yes, I tried to see what the issue was and cannot see what it is. The cellrange set is okay, but something is throwing everything off.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

14 Jan 2020, 06:19

You do not indicate that you know how to use destring when your data have non-numeric characters. The example below shows three approaches.

The first one is the best, because it requires you to examine your data to find and correct all the non-numeric data.

The second one assumes that the only place an asterisk occurs is in a value that should be missing.

The third one just ignores all non-numeric characters.

Neither the second nor third techniques are advisable because you risk overlooking important problems in your data.

Code:

cls
* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 var1
"1" 
"**"
"3" 
"42"
end
generate var2 = var1
generate var3 = var1

tab var1 if missing(real(var1))
// after looking at the results of the tab
replace var1 = "" if var1=="**"
destring var1, replace

destring var2, replace ignore("*")

destring var3, replace force

list, clean

Code:

. tab var1 if missing(real(var1))

       var1 |      Freq.     Percent        Cum.
------------+-----------------------------------
         ** |          1      100.00      100.00
------------+-----------------------------------
      Total |          1      100.00

. // after looking at the results of the tab
. replace var1 = "" if var1=="**"
(1 real change made)

. destring var1, replace
var1: all characters numeric; replaced as byte
(1 missing value generated)

. 
. destring var2, replace ignore("*")
var2: character * removed; replaced as byte
(1 missing value generated)

. 
. destring var3, replace force
var3: contains nonnumeric characters; replaced as byte
(1 missing value generated)

. 
. list, clean

       var1   var2   var3  
  1.      1      1      1  
  2.      .      .      .  
  3.      3      3      3  
  4.     42     42     42

Comment

Eli Aguado

Join Date: Jan 2017

Posts: 19
#10

14 Jan 2020, 16:01

Originally posted by William Lisowski View Post

You do not indicate that you know how to use destring when your data have non-numeric characters. The example below shows three approaches.

The first one is the best, because it requires you to examine your data to find and correct all the non-numeric data.

The second one assumes that the only place an asterisk occurs is in a value that should be missing.

The third one just ignores all non-numeric characters.

Neither the second nor third techniques are advisable because you risk overlooking important problems in your data.

Code:

cls * Example generated by -dataex-. To install: ssc install dataex clear input str2 var1 "1" "**" "3" "42" end generate var2 = var1 generate var3 = var1 tab var1 if missing(real(var1)) // after looking at the results of the tab replace var1 = "" if var1=="**" destring var1, replace destring var2, replace ignore("*") destring var3, replace force list, clean

Code:

. tab var1 if missing(real(var1)) var1 | Freq. Percent Cum. ------------+----------------------------------- ** | 1 100.00 100.00 ------------+----------------------------------- Total | 1 100.00 . // after looking at the results of the tab . replace var1 = "" if var1=="**" (1 real change made) . destring var1, replace var1: all characters numeric; replaced as byte (1 missing value generated) . . destring var2, replace ignore("*") var2: character * removed; replaced as byte (1 missing value generated) . . destring var3, replace force var3: contains nonnumeric characters; replaced as byte (1 missing value generated) . . list, clean var1 var2 var3 1. 1 1 1 2. . . . 3. 3 3 3 4. 42 42 42

Hi, William. Thanks for your reply, but I am afraid to say that I am a little lost. Is this for the .xlsx file or the .csv file? The .csv file when it's uploaded to Stata does not have the asterisks (***); though, it is in the actual file. I do not think I have to forcibly destring the variables in the file that got rid of the observations after uploading. Also, it's looking like some of these dropped observations (seemingly) look to be "random."
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

14 Jan 2020, 17:58

I would
ignore the csv file

import the xlsx file as it was to begin with - without having manually deleted rows from the spreadsheet

follow one of the procedures above to destring the variables that are imported as string
Comment
Eli Aguado

Join Date: Jan 2017

Posts: 19
#12

18 Jan 2020, 15:32

Okay, now I am a little worried and even more perplexed--it did it again with another csv file from a completely different source. Original csv file has over 90,000 observations and it dropped down to around 85,000 observations after importing as import delimited. My settings are pretty much the same as I mentioned in a previous post on this thread. I can't ignore all csv files. I am assuming there is something incorrect with my settings.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

18 Jan 2020, 15:47

I would suggest the following.

1) add a new column to your Excel worksheet and fill it with the line number. Assuming the first row is variable name, name this column Obs.
2) export this as csv and import it and see that you again lost observations.
3) find missing observations

Code:

generate skip = Obs - Obs[_n-1]+1

4) look at the imported data and see which observations have skipped observations before them

Code:

list Obs if skip>0

5) now open your CSV file an a text editor and find those observations that were skipped. What's different about them?
Comment
Eli Aguado

Join Date: Jan 2017

Posts: 19
#14

18 Jan 2020, 16:00

Originally posted by William Lisowski View Post

I would suggest the following.

1) add a new column to your Excel worksheet and fill it with the line number. Assuming the first row is variable name, name this column Obs.
2) export this as csv and import it and see that you again lost observations.
3) find missing observations

Code:

generate skip = Obs - Obs[_n-1]+1

4) look at the imported data and see which observations have skipped observations before them

Code:

list Obs if skip>0

5) now open your CSV file an a text editor and find those observations that were skipped. What's different about them?

William, thanks for taking the time to help me. I am not sure if I did this correctly, so this is what I did. I don't see anything different, but it listed all the observations after I ran the code in #4. I looked at the text editor, it looks the same.

Code:

gen Obs="" destring, replace generate skip = Obs - Obs[_n-1]+1 list Obs if skip>0

I received a type mismatch error, so I tried to destring. Not sure if that was the best approach.

Addendum: no observations were dropped when I saved the csv file as an excel worksheet file (xlsx)

Last edited by Eli Aguado; 18 Jan 2020, 16:05.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#15

18 Jan 2020, 18:04

I'm sorry, but your description of what you did makes no sense to me given my understanding of your data.

I have an excel sheet I downloaded as a csv file. The number of observations is over 150,000. I imported the csv file, and noticed some things were off. After checking the number of observations with the original file, the number of observations in the imported file was at around 141,000. ...

This does not happen when I import the data as an excel file (.xlsx).

So do I understand correctly that you download a CSV and read it into Stata using import delimited after which you discover that not all the rows were imported? You then open the CSV in Excel, save it as XLSX and read the XLSX into Stata using import Excel and find that all the rows are present?

If this is not correct, tell me instead what it is you are doing.

If this is correct, then do the following. If this isn't clear to you, don't do something else. Ask for clarification.

1) Open the CSV in Excel.
2) In Excel, add a new column to the worksheet and fill it with the line number.
3) Save it as as a new CSV.

The point is you now have a CSV which has line numbers, so that you can tell what observations are missing after it is read into Stata. Generating a (missing!) observation number in Stata after you have read the CSV into Stata as you described above is too late the because the missing rows are already missing!

4) Read the new CSV with the line numbers into Stata using import delimited.
5) In Stata, do

Code:

generate skip = Obs - Obs[_n-1]+1 list Obs skip if skip>0

where Obs is the name of the Stata variable that holds the line number from the CSV.

Each line of the output will tell you the observation number *after* one or more dropped observations. And the value of skip will tell you how many observations were skipped. So if you get Obs=666 and skip=4 you will know that observations 662-665 were dropped.

Look at the new CSV and find those observations. How are they different from observations that were not dropped?
1 like
Comment

Announcement