Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropped Observations after Importing Excel Sheet as a CSV File

    I have an excel sheet I downloaded as a csv file. The number of observations is over 150,000. I imported the csv file, and noticed some things were off. After checking the number of observations with the original file, the number of observations in the imported file was at around 141,000. I was wondering if anyone knows why a fair amount of observations were dropped after importing the file. (I find this so strange. Also, as a small FYI, I am running on Stata 14).

    This does not happen when I import the data as an excel file (.xlsx). However, when I import the excel file as a .xlsx, it runs slower and my variables turn into string variables. I have tried to destring them with destring, replace but it does not destring and reads the variables as non-numeric. I am a little bit at a loss. Thanks for any help and/or leads!

  • #2
    I assume you used import excel to import your .xlsx file. The problem Stata has is that it can be difficult for import excel to determine where the data start and end on the worksheet, and sometimes it doesn't get the right answer. You should try the cellrange() option on import excel to specify exactly where your data are, and use the firstrow option if the first row of your cellrange contains variable names rather than data values.

    If that doesn't solve your problem, we can better help you if we know exactly what commands you have tried and what Stata told you to indicate that there was a problem. Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. See especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], as described in section 12 of the FAQ.

    Comment


    • #3
      One other thing to consider along with William's helpful advice, is that if you imported your dataset as a csv file using -import delimited-, it's possible that around line 141,000 there was a malformed line or something funny with the data. This could have caused the import process to stop there. Analogous to the -cellrange()- option for -import excel-, there is also the -rowrange()- option for -import delimited- that will specify the range of rows to import.

      Comment


      • #4
        Leonardo makes a good point, which reminds me that I meant to write that in general effort is best expended on importing data directly from an xlsx file, rather than having Excel translate the xlsx to csv and then having Stata import the csv and translate it to a Stata dataset. That's why my advice in post #2 wasn't about why the csv didn't work, but instead about how to solve the original problem, which is to read data that's in .xlsx file.

        Comment


        • #5
          That's a good point William Lisowski. I mention my comment mainly for two reasons. Excel automatically associates itself with the .csv file extension, so it may well be that raw data is only in that format and people can mistakenly think of csv as excel files, though this doesn't seem to be the case here. Another is that a similar issue may arise in the future for long csv files, so it may be worth a mention. That said, I wonder which is expected to be faster at importing?

          Comment


          • #6
            Here's a thought of what might cause this oddity:
            I wonder if there are some apparently blank lines at the end of the Excel file that containing nonvisible characters (e.g., just a "space" in some cell.) This would cause an import of the Excel file to show up with string values after -import excel-. *Perhaps* the CSV exported from Excel didn't have those blank rows, and therefore showed a smaller number of observations and also avoided the string values.

            Comment


            • #7
              Originally posted by William Lisowski View Post
              I assume you used import excel to import your .xlsx file. The problem Stata has is that it can be difficult for import excel to determine where the data start and end on the worksheet, and sometimes it doesn't get the right answer. You should try the cellrange() option on import excel to specify exactly where your data are, and use the firstrow option if the first row of your cellrange contains variable names rather than data values.

              If that doesn't solve your problem, we can better help you if we know exactly what commands you have tried and what Stata told you to indicate that there was a problem. Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. See especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], as described in section 12 of the FAQ.
              All, thank you...as well as Mike Lacy and everyone else in this post. I think(?) I may have figured out what is going, though, still at a loss. I can't give too much details of the dataset out. However, some missing values in both datasets have "***" for missing values. For these two files, the "***" are found within one particular year. With regards to the .xlsx file, once I dropped the year, the
              Code:
              destring, replace
              option worked. Though, I am still confused. I spent a good deal trying to analyze these excel files. The .csv file is still giving me an N=141,709 out of over 150,000 observations. I tried analyzing this observation, and I can't find the issue. The csv file in it's original format has "***" for the said year in the .xlsx file, but (assuming given that it's a csv file) in the Stata data browser/editor the *** turn into "."

              This is how the csv file was imported using
              Code:
              import delimited
              Code:
              import delimited "/#######.csv", case(preserve) rowrange(1:151759) colrange(1:14) encoding(ISO-8859-1)
              The variable case was originally imported as "Lower" and had the same issue. Text encoding is Latin1. Floating point precision: use as default. Quote binding: loose. Treat sequential delimiters as one. Everything is else is automatic.

              Thank you all.
              Last edited by Eli Aguado; 14 Jan 2020, 05:59. Reason: grammar/typos

              Comment


              • #8
                Originally posted by Leonardo Guizzetti View Post
                One other thing to consider along with William's helpful advice, is that if you imported your dataset as a csv file using -import delimited-, it's possible that around line 141,000 there was a malformed line or something funny with the data. This could have caused the import process to stop there. Analogous to the -cellrange()- option for -import excel-, there is also the -rowrange()- option for -import delimited- that will specify the range of rows to import.
                Yes, I tried to see what the issue was and cannot see what it is. The cellrange set is okay, but something is throwing everything off.

                Comment


                • #9
                  You do not indicate that you know how to use destring when your data have non-numeric characters. The example below shows three approaches.

                  The first one is the best, because it requires you to examine your data to find and correct all the non-numeric data.

                  The second one assumes that the only place an asterisk occurs is in a value that should be missing.

                  The third one just ignores all non-numeric characters.

                  Neither the second nor third techniques are advisable because you risk overlooking important problems in your data.
                  Code:
                  cls
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str2 var1
                  "1" 
                  "**"
                  "3" 
                  "42"
                  end
                  generate var2 = var1
                  generate var3 = var1
                  
                  tab var1 if missing(real(var1))
                  // after looking at the results of the tab
                  replace var1 = "" if var1=="**"
                  destring var1, replace
                  
                  destring var2, replace ignore("*")
                  
                  destring var3, replace force
                  
                  list, clean
                  Code:
                  . tab var1 if missing(real(var1))
                  
                         var1 |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                           ** |          1      100.00      100.00
                  ------------+-----------------------------------
                        Total |          1      100.00
                  
                  . // after looking at the results of the tab
                  . replace var1 = "" if var1=="**"
                  (1 real change made)
                  
                  . destring var1, replace
                  var1: all characters numeric; replaced as byte
                  (1 missing value generated)
                  
                  . 
                  . destring var2, replace ignore("*")
                  var2: character * removed; replaced as byte
                  (1 missing value generated)
                  
                  . 
                  . destring var3, replace force
                  var3: contains nonnumeric characters; replaced as byte
                  (1 missing value generated)
                  
                  . 
                  . list, clean
                  
                         var1   var2   var3  
                    1.      1      1      1  
                    2.      .      .      .  
                    3.      3      3      3  
                    4.     42     42     42

                  Comment


                  • #10
                    Originally posted by William Lisowski View Post
                    You do not indicate that you know how to use destring when your data have non-numeric characters. The example below shows three approaches.

                    The first one is the best, because it requires you to examine your data to find and correct all the non-numeric data.

                    The second one assumes that the only place an asterisk occurs is in a value that should be missing.

                    The third one just ignores all non-numeric characters.

                    Neither the second nor third techniques are advisable because you risk overlooking important problems in your data.
                    Code:
                    cls
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str2 var1
                    "1"
                    "**"
                    "3"
                    "42"
                    end
                    generate var2 = var1
                    generate var3 = var1
                    
                    tab var1 if missing(real(var1))
                    // after looking at the results of the tab
                    replace var1 = "" if var1=="**"
                    destring var1, replace
                    
                    destring var2, replace ignore("*")
                    
                    destring var3, replace force
                    
                    list, clean
                    Code:
                    . tab var1 if missing(real(var1))
                    
                    var1 | Freq. Percent Cum.
                    ------------+-----------------------------------
                    ** | 1 100.00 100.00
                    ------------+-----------------------------------
                    Total | 1 100.00
                    
                    . // after looking at the results of the tab
                    . replace var1 = "" if var1=="**"
                    (1 real change made)
                    
                    . destring var1, replace
                    var1: all characters numeric; replaced as byte
                    (1 missing value generated)
                    
                    .
                    . destring var2, replace ignore("*")
                    var2: character * removed; replaced as byte
                    (1 missing value generated)
                    
                    .
                    . destring var3, replace force
                    var3: contains nonnumeric characters; replaced as byte
                    (1 missing value generated)
                    
                    .
                    . list, clean
                    
                    var1 var2 var3
                    1. 1 1 1
                    2. . . .
                    3. 3 3 3
                    4. 42 42 42
                    Hi, William. Thanks for your reply, but I am afraid to say that I am a little lost. Is this for the .xlsx file or the .csv file? The .csv file when it's uploaded to Stata does not have the asterisks (***); though, it is in the actual file. I do not think I have to forcibly destring the variables in the file that got rid of the observations after uploading. Also, it's looking like some of these dropped observations (seemingly) look to be "random."

                    Comment


                    • #11
                      I would
                      • ignore the csv file
                      • import the xlsx file as it was to begin with - without having manually deleted rows from the spreadsheet
                      • follow one of the procedures above to destring the variables that are imported as string

                      Comment


                      • #12
                        Okay, now I am a little worried and even more perplexed--it did it again with another csv file from a completely different source. Original csv file has over 90,000 observations and it dropped down to around 85,000 observations after importing as import delimited. My settings are pretty much the same as I mentioned in a previous post on this thread. I can't ignore all csv files. I am assuming there is something incorrect with my settings.

                        Comment


                        • #13
                          I would suggest the following.

                          1) add a new column to your Excel worksheet and fill it with the line number. Assuming the first row is variable name, name this column Obs.
                          2) export this as csv and import it and see that you again lost observations.
                          3) find missing observations
                          Code:
                          generate skip = Obs - Obs[_n-1]+1
                          4) look at the imported data and see which observations have skipped observations before them
                          Code:
                          list Obs if skip>0
                          5) now open your CSV file an a text editor and find those observations that were skipped. What's different about them?

                          Comment


                          • #14
                            Originally posted by William Lisowski View Post
                            I would suggest the following.

                            1) add a new column to your Excel worksheet and fill it with the line number. Assuming the first row is variable name, name this column Obs.
                            2) export this as csv and import it and see that you again lost observations.
                            3) find missing observations
                            Code:
                            generate skip = Obs - Obs[_n-1]+1
                            4) look at the imported data and see which observations have skipped observations before them
                            Code:
                            list Obs if skip>0
                            5) now open your CSV file an a text editor and find those observations that were skipped. What's different about them?
                            William, thanks for taking the time to help me. I am not sure if I did this correctly, so this is what I did. I don't see anything different, but it listed all the observations after I ran the code in #4. I looked at the text editor, it looks the same.
                            Code:
                            gen Obs=""
                            destring, replace
                            generate skip = Obs - Obs[_n-1]+1
                            list Obs if skip>0
                            I received a type mismatch error, so I tried to destring. Not sure if that was the best approach.

                            Addendum: no observations were dropped when I saved the csv file as an excel worksheet file (xlsx)
                            Last edited by Eli Aguado; 18 Jan 2020, 16:05.

                            Comment


                            • #15
                              I'm sorry, but your description of what you did makes no sense to me given my understanding of your data.

                              I have an excel sheet I downloaded as a csv file. The number of observations is over 150,000. I imported the csv file, and noticed some things were off. After checking the number of observations with the original file, the number of observations in the imported file was at around 141,000. ...

                              This does not happen when I import the data as an excel file (.xlsx).
                              So do I understand correctly that you download a CSV and read it into Stata using import delimited after which you discover that not all the rows were imported? You then open the CSV in Excel, save it as XLSX and read the XLSX into Stata using import Excel and find that all the rows are present?

                              If this is not correct, tell me instead what it is you are doing.

                              If this is correct, then do the following. If this isn't clear to you, don't do something else. Ask for clarification.

                              1) Open the CSV in Excel.
                              2) In Excel, add a new column to the worksheet and fill it with the line number.
                              3) Save it as as a new CSV.

                              The point is you now have a CSV which has line numbers, so that you can tell what observations are missing after it is read into Stata. Generating a (missing!) observation number in Stata after you have read the CSV into Stata as you described above is too late the because the missing rows are already missing!

                              4) Read the new CSV with the line numbers into Stata using import delimited.
                              5) In Stata, do
                              Code:
                              generate skip = Obs - Obs[_n-1]+1
                              list Obs skip if skip>0
                              where Obs is the name of the Stata variable that holds the line number from the CSV.

                              Each line of the output will tell you the observation number *after* one or more dropped observations. And the value of skip will tell you how many observations were skipped. So if you get Obs=666 and skip=4 you will know that observations 662-665 were dropped.

                              Look at the new CSV and find those observations. How are they different from observations that were not dropped?

                              Comment

                              Working...
                              X