Dropped Observations after Importing Excel Sheet as a CSV File

Eli Aguado

Join Date: Jan 2017

Posts: 19
#16

18 Jan 2020, 18:42

Originally posted by William Lisowski View Post

I'm sorry, but your description of what you did makes no sense to me given my understanding of your data.

So do I understand correctly that you download a CSV and read it into Stata using import delimited after which you discover that not all the rows were imported? You then open the CSV in Excel, save it as XLSX and read the XLSX into Stata using import Excel and find that all the rows are present?

If this is not correct, tell me instead what it is you are doing.

If this is correct, then do the following. If this isn't clear to you, don't do something else. Ask for clarification.

1) Open the CSV in Excel.
2) In Excel, add a new column to the worksheet and fill it with the line number.
3) Save it as as a new CSV.

The point is you now have a CSV which has line numbers, so that you can tell what observations are missing after it is read into Stata. Generating a (missing!) observation number in Stata after you have read the CSV into Stata as you described above is too late the because the missing rows are already missing!

4) Read the new CSV with the line numbers into Stata using import delimited.
5) In Stata, do

Code:

generate skip = Obs - Obs[_n-1]+1 list Obs skip if skip>0

where Obs is the name of the Stata variable that holds the line number from the CSV.

Each line of the output will tell you the observation number *after* one or more dropped observations. And the value of skip will tell you how many observations were skipped. So if you get Obs=666 and skip=4 you will know that observations 662-665 were dropped.

Look at the new CSV and find those observations. How are they different from observations that were not dropped?

I imported two completely different csv files from two different sources when I use import delimited. I noticed many, in one case thousands, of observations dropped after importing the file. When I downloaded the csv files from the data sources, I saved them as a xlsx file. I imported these files into Stata as an Excel worksheet and none of the observations were dropped in either file. It's perplexing.
Comment
Eli Aguado

Join Date: Jan 2017

Posts: 19
#17

18 Jan 2020, 19:15

Originally posted by William Lisowski View Post

I'm sorry, but your description of what you did makes no sense to me given my understanding of your data.

So do I understand correctly that you download a CSV and read it into Stata using import delimited after which you discover that not all the rows were imported? You then open the CSV in Excel, save it as XLSX and read the XLSX into Stata using import Excel and find that all the rows are present?

If this is not correct, tell me instead what it is you are doing.

If this is correct, then do the following. If this isn't clear to you, don't do something else. Ask for clarification.

1) Open the CSV in Excel.
2) In Excel, add a new column to the worksheet and fill it with the line number.
3) Save it as as a new CSV.

The point is you now have a CSV which has line numbers, so that you can tell what observations are missing after it is read into Stata. Generating a (missing!) observation number in Stata after you have read the CSV into Stata as you described above is too late the because the missing rows are already missing!

4) Read the new CSV with the line numbers into Stata using import delimited.
5) In Stata, do

Code:

generate skip = Obs - Obs[_n-1]+1 list Obs skip if skip>0

where Obs is the name of the Stata variable that holds the line number from the CSV.

Each line of the output will tell you the observation number *after* one or more dropped observations. And the value of skip will tell you how many observations were skipped. So if you get Obs=666 and skip=4 you will know that observations 662-665 were dropped.

Look at the new CSV and find those observations. How are they different from observations that were not dropped?

For every observation, with the exception of the first one, skip is coded as "2."

Code:

2020 2 20 1920 .75 1 . 2020 2 20 1921 .75 2 2 2020 2 20 1922 .75 3 2 2020 2 20 1923 .75 4 2 2020 2 20 1924 .75 5 2 2020 2 20 1925 .75 6 2 2020 2 20 1926 .75 7 2 2020 2 20 1927 .75 8 2 2020 2 20 1928 .75 9 2 2020 2 20 1929 .75 10 2 2020 2 20 1930 .75 11 2 2020 2 20 1931 .75 12 2 2020 2 20 1932 .75 13 2 2020 2 20 1933 .75 14 2 2020 2 20 1934 .75 15 2 2020 2 20 1935 .75 16 2 2020 2 20 1936 .75 17 2 2020 2 20 1937 .75 18 2 2020 2 20 1938 .75 19 2 2020 2 20 1939 .75 20 2 2020 2 20 1940 .75 21 2 2020 2 20 1941 .75 22 2 2020 2 20 1942 .75 23 2 2020 2 20 1943 .75 24 2
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#18

18 Jan 2020, 19:59

Sorry, typo, and I thought of some improvements.

In post #16, change step 5 to

Code:

replace skip = Obs - (Obs[_n-1]+1) tab skip, missing list Obs skip if skip>0
Comment

Eli Aguado

Join Date: Jan 2017
Posts: 19

#19

18 Jan 2020, 20:34

Originally posted by William Lisowski View Post

Sorry, typo, and I thought of some improvements.

In post #16, change step 5 to

Code:

replace skip = Obs - (Obs[_n-1]+1)
tab skip, missing
list Obs skip if skip>0

Hi, William. Thank you. The following are the results:

Code:

 tab skip, missing

       skip |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     86,107      100.00      100.00
       4406 |          1        0.00      100.00
          . |          1        0.00      100.00
------------+-----------------------------------
      Total |     86,109      100.00

. list Obs skip if skip>0

       +--------------+
       |   Obs   skip |
       |--------------|
    1. |     1      . |
42550. | 46956   4406 |
       +--------------+

Edit: I don't have the slightest clue why all those values were skipped. Nothing sticks out in the csv file.

Last edited by Eli Aguado; 18 Jan 2020, 20:52.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#20

19 Jan 2020, 08:53

So apparently not every observation had skip = 2 as you reported in post #17.

We see that 4405 observations starting at 42551 and ending at 42955 were omitted. My guess is that something is strange in your CSV file at or around observation 42551.

In post #16 I wrote the very detailed descriptrion of what I thought you had done

So do I understand correctly that you download a CSV and read it into Stata using import delimited after which you discover that not all the rows were imported? You then open the CSV in Excel, save it as XLSX and read the XLSX into Stata using import Excel and find that all the rows are present?

If this is not correct, tell me instead what it is you are doing.

to which you replied

When I downloaded the csv files from the data sources, I saved them as a xlsx file.

which is not nearly so detailed. Were the data stored at the source as XLSX files or as CSV files? What does it mean to download a CSV file a the data source and save it as an XLSX file? Download is the verb that describes having a web site give you a copy of a file. Generally, my web browser is displaying the data source, I click on a link, and the data source sends the file in whatever format it has it in. If it's a CSV there's no way for the download process to change it to an XLSX file.

Write a recipe that describes step by step how you got the CSV and XLSX files. Extra points if the data source is publicly accessible and you can give a URL to download a copy.
Comment
Eli Aguado

Join Date: Jan 2017

Posts: 19
#21

20 Jan 2020, 16:33

Originally posted by William Lisowski View Post

So apparently not every observation had skip = 2 as you reported in post #17.

We see that 4405 observations starting at 42551 and ending at 42955 were omitted. My guess is that something is strange in your CSV file at or around observation 42551.

In post #16 I wrote the very detailed descriptrion of what I thought you had done

to which you replied

which is not nearly so detailed. Were the data stored at the source as XLSX files or as CSV files? What does it mean to download a CSV file a the data source and save it as an XLSX file? Download is the verb that describes having a web site give you a copy of a file. Generally, my web browser is displaying the data source, I click on a link, and the data source sends the file in whatever format it has it in. If it's a CSV there's no way for the download process to change it to an XLSX file.

Write a recipe that describes step by step how you got the CSV and XLSX files. Extra points if the data source is publicly accessible and you can give a URL to download a copy.

With one data source, the data source provides the following: 1) the data saved as a csv file and 2) the data saved as an excel worksheet (.xlsx file). When I import the csv data with import delimited, observations are dropped. When I import the .xlsx file as an Excel worksheet into Stata no observations are dropped.

With the data used in the latest posts, it is only provided by the authors as a csv file. When I download the file, and import it the csv file into stata using import delimited, I lose observations. When I download the csv file, and then save as an xlsx file, import it into Stata as an Excel worksheet, no observations are dropped. I can provide you the URL for this data file, but I would need to message you as I already had two separate cases of my research being stolen; so, needless to say, I am a bit paranoid at this stage.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#22

21 Jan 2020, 10:08

Eli Aguado and I have been in correspondence and I have downloaded his data from their original sources. I have a subsequent post coming but want to start with this very pointed one.

Eli - It appears that there were a number of updates to Stata 14 across its life that addressed problems with import delimited and in particular a problem that caused "portions of the data" to be skipped. This is a fairly niche problem involving line-end characters in the input dataset, perhaps more likely encountered on a Mac.

If your copy of Stata 14 is not fully updated, you should update it now. That may be sufficient to resolve your problems. Assuming you have administrator privileges on your system, type update query in Stata's Command window. On my system, running Stata 16, I see

Code:

. update query (contacting http://www.stata.com) Update status Last check for updates: 21 Jan 2020 New update available: 08 Jan 2020 (what's new) Current update level: 11 Dec 2019 (what's new) Possible actions Install available updates (or type -update all-)

which tells me updates are available and gives me a link to click or a command to type in order to update my copy of Stata.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#23

21 Jan 2020, 11:34

The first example dataset has 151,762 lines in total. Lines 1-4 are heading information, and nothing in them is suitable for use as variable names. Lines 5-151,758 contain 14 columns of comma-separate values. Columns 2, 3, and 4 contain text; the other columns apparently contain numbers. So my attempt to import the data follows.

Code:

. import delimited Example1.csv, rowrange(5) varnames(nonames)
(14 vars, 151,758 obs)

. describe

Contains data
  obs:       151,758                          
 vars:            14                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
v1              int     %8.0g                
v2              str41   %41s                  
v3              str41   %41s                  
v4              str6    %9s                  
v5              str6    %9s                  
v6              str6    %9s                  
v7              str6    %9s                  
v8              str6    %9s                  
v9              str6    %9s                  
v10             str6    %9s                  
v11             str6    %9s                  
v12             str6    %9s                  
v13             str6    %9s                  
v14             str6    %9s                  
--------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.

Now, the first thing I notice is that import delimited in my installation of Stata 16 did not reproduce the problem Eli reported. I hope that my advice in post #22 resolves his problem. I'll continue this exploration to its conclusion, however.

The next thing I notice is that Stata has decided that columns 5-15 contain non-numeric data and has imported them as string variables rather than numeric variables.

Here's a standard approach to figuring out what's wrong.

Code:

. foreach v of varlist v5-v14 {
  2.     tab `v' if missing(real(`v')), missing
  3.         }

         v5 |      Freq.     Percent        Cum.
------------+-----------------------------------
            |     56,460       95.21       95.21
          * |      2,842        4.79      100.00
------------+-----------------------------------
      Total |     59,302      100.00

         v6 |      Freq.     Percent        Cum.
------------+-----------------------------------
            |    113,216       98.77       98.77
          * |      1,406        1.23      100.00
------------+-----------------------------------
      Total |    114,622      100.00

[remaining tabs omitted]

It turns out that 7,000+ lines in the CSV have a single asterisk in one or more of columns 5-14. Looking at the CSV in a text editor shows me where to look for examples.

Code:

. list v5-v14 in 141710/141714, clean

          v5   v6   v7   v8   v9   v10   v11   v12   v13   v14  
141710.    *    0    0    0    0     0     0           *     0  
141711.    *    *   36    0    0     0    36    36     *     *  
141712.   14   14    6    0    0     0     0          20    20  
141713.    9    7    0    0    0     0     *     *     8     7  
141714.    *    *    5    0    0     *     0     *     *     *

We see that there are both standard text missing values (e.g v12 in observation 141710) and asterisks (v6 and v13 in observation 141710). I don't know what to make of the asterisks, so I convert them to the text representation of one of Stata's special numeric missing values. I can then destring each of the variables and have a workable dataset that fully represents the input data, including the anomalous asterisks for me to deal with once I understand their meaning.

Code:

. foreach v of varlist v5-v14 {
  2.     replace `v' = ".s" if `v'=="*"
  3.         destring `v', replace
  4.         }
(2,842 real changes made)
v5: all characters numeric; replaced as long
...
v14: all characters numeric; replaced as long
(112607 missing values generated)

. describe

Contains data
  obs:       151,758                          
 vars:            14                          
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
v1              int     %8.0g                
v2              str41   %41s                  
v3              str41   %41s                  
v4              str6    %9s                  
v5              long    %10.0g                
v6              long    %10.0g                
v7              long    %10.0g                
v8              long    %10.0g                
v9              long    %10.0g                
v10             long    %10.0g                
v11             long    %10.0g                
v12             long    %10.0g                
v13             long    %10.0g                
v14             long    %10.0g                
--------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.

. list v5-v14 in 141710/141714, clean

          v5   v6   v7   v8   v9   v10   v11   v12   v13   v14  
141710.   .s    0    0    0    0     0     0     .    .s     0  
141711.   .s   .s   36    0    0     0    36    36    .s    .s  
141712.   14   14    6    0    0     0     0     .    20    20  
141713.    9    7    0    0    0     0    .s    .s     8     7  
141714.   .s   .s    5    0    0    .s     0    .s    .s    .s

One might argue that I could have used the numericcols() option on import delimited to force the storage of v5-v14 as numeric variables, replacing all anomalous content with missing values. But there are two arguments against that. First, the tabs I did are required to understand your data. Wholesale replacement of non-numeric data with missing values risks overlooking important problems in your data. And second, perhaps the asterisks are meaningfully different than the missing values that also appear in other variables on the same observation, and that meaning should be taken into account in the analysis. By replacing the asterisks with special missing values, I can distinguish them in subsequent analyses.

Added in Edit: The second example dataset imported perfect and had no missing or non-numeric values. This reinforces my believe that the problem lies within Eli's copy of Stata 14 and may be remedied by updating.

Last edited by William Lisowski; 21 Jan 2020, 11:40.

Comment

Eli Aguado

Join Date: Jan 2017

Posts: 19
#24

30 Jan 2020, 05:52

Originally posted by William Lisowski View Post

The first example dataset has 151,762 lines in total. Lines 1-4 are heading information, and nothing in them is suitable for use as variable names. Lines 5-151,758 contain 14 columns of comma-separate values. Columns 2, 3, and 4 contain text; the other columns apparently contain numbers. So my attempt to import the data follows.

Code:

. import delimited Example1.csv, rowrange(5) varnames(nonames) (14 vars, 151,758 obs) . describe Contains data obs: 151,758 vars: 14 -------------------------------------------------------------------------------- storage display value variable name type format label variable label -------------------------------------------------------------------------------- v1 int %8.0g v2 str41 %41s v3 str41 %41s v4 str6 %9s v5 str6 %9s v6 str6 %9s v7 str6 %9s v8 str6 %9s v9 str6 %9s v10 str6 %9s v11 str6 %9s v12 str6 %9s v13 str6 %9s v14 str6 %9s -------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved.

Now, the first thing I notice is that import delimited in my installation of Stata 16 did not reproduce the problem Eli reported. I hope that my advice in post #22 resolves his problem. I'll continue this exploration to its conclusion, however.

The next thing I notice is that Stata has decided that columns 5-15 contain non-numeric data and has imported them as string variables rather than numeric variables.

Here's a standard approach to figuring out what's wrong.

Code:

. foreach v of varlist v5-v14 { 2. tab `v' if missing(real(`v')), missing 3. } v5 | Freq. Percent Cum. ------------+----------------------------------- | 56,460 95.21 95.21 * | 2,842 4.79 100.00 ------------+----------------------------------- Total | 59,302 100.00 v6 | Freq. Percent Cum. ------------+----------------------------------- | 113,216 98.77 98.77 * | 1,406 1.23 100.00 ------------+----------------------------------- Total | 114,622 100.00 [remaining tabs omitted]

It turns out that 7,000+ lines in the CSV have a single asterisk in one or more of columns 5-14. Looking at the CSV in a text editor shows me where to look for examples.

Code:

. list v5-v14 in 141710/141714, clean v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 141710. * 0 0 0 0 0 0 * 0 141711. * * 36 0 0 0 36 36 * * 141712. 14 14 6 0 0 0 0 20 20 141713. 9 7 0 0 0 0 * * 8 7 141714. * * 5 0 0 * 0 * * *

We see that there are both standard text missing values (e.g v12 in observation 141710) and asterisks (v6 and v13 in observation 141710). I don't know what to make of the asterisks, so I convert them to the text representation of one of Stata's special numeric missing values. I can then destring each of the variables and have a workable dataset that fully represents the input data, including the anomalous asterisks for me to deal with once I understand their meaning.

Code:

. foreach v of varlist v5-v14 { 2. replace `v' = ".s" if `v'=="*" 3. destring `v', replace 4. } (2,842 real changes made) v5: all characters numeric; replaced as long ... v14: all characters numeric; replaced as long (112607 missing values generated) . describe Contains data obs: 151,758 vars: 14 -------------------------------------------------------------------------------- storage display value variable name type format label variable label -------------------------------------------------------------------------------- v1 int %8.0g v2 str41 %41s v3 str41 %41s v4 str6 %9s v5 long %10.0g v6 long %10.0g v7 long %10.0g v8 long %10.0g v9 long %10.0g v10 long %10.0g v11 long %10.0g v12 long %10.0g v13 long %10.0g v14 long %10.0g -------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . list v5-v14 in 141710/141714, clean v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 141710. .s 0 0 0 0 0 0 . .s 0 141711. .s .s 36 0 0 0 36 36 .s .s 141712. 14 14 6 0 0 0 0 . 20 20 141713. 9 7 0 0 0 0 .s .s 8 7 141714. .s .s 5 0 0 .s 0 .s .s .s

One might argue that I could have used the numericcols() option on import delimited to force the storage of v5-v14 as numeric variables, replacing all anomalous content with missing values. But there are two arguments against that. First, the tabs I did are required to understand your data. Wholesale replacement of non-numeric data with missing values risks overlooking important problems in your data. And second, perhaps the asterisks are meaningfully different than the missing values that also appear in other variables on the same observation, and that meaning should be taken into account in the analysis. By replacing the asterisks with special missing values, I can distinguish them in subsequent analyses.

Added in Edit: The second example dataset imported perfect and had no missing or non-numeric values. This reinforces my believe that the problem lies within Eli's copy of Stata 14 and may be remedied by updating.

Dear William,
As usual, thank you for taking the time to look into this situation.

I updated Stata, and the number of observations were no longer altered. I feel a bit bad with the back-and-forth for a simple fix, but I guess I learned an important lesson. Hopefully this will help others too.

More so, thank you for providing useful advice regarding special missing values, as opposed to wholesale replacement. This is something that I always wondered about--so thank you!
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment