Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • *import excel* is slow with a specific 73KB xlsx file [SOLVED]

    Solution: use the cellrange() option.

    This is an informational post. I figured out the answer while typing it out, with the help of William Lisowski's post at at https://www.statalist.org/forums/for...-as-a-csv-file.

    I am trying to import an Excel (.xslx) file using the import excel command. With many Excel files, the command runs to completion and works as expected. However, with one specific 73 KB file, the command runs indefinitely (so far I have waited 9 minutes). The little circle loading icon in the bottom right continues to display while I wait. I can open and view the file in Excel.

    I have tried running import excel with no options as well as with sheet(). I have tried saving the file in a non-OneDrive directory (usually I work in OneDrive).

    When I break (with the red X icon), it looks as though all of my data is loaded (after 9 minutes). Stata has created many variables for the empty columns to the right of my data. Stata has also created many observations (>1M) below the bottom of my data (112 real observations). So the problem seems to be that Stata cannot tell where the data stops.

    Version information
    StataNow/MP 18.5 for Windows (64-bit)
    Revision 16 Jul 2024

  • #2
    Thank you for posting the solution to your problem so others can learn from it.

    I have, on a few occasions, been bitten by the same problem when importing from Excel. In my instances, when I have taken the time to explore it, I have found that there is some cell far outside the range of actual data that has something in it. Because it's way outside, you don't see it when you open and visually inspect the Excel file unless you specifically look for it by scrolling far to the right or far down (or both). But Stata knows it's there and thinks it's part of the data grid.

    To prevent getting tripped up by this again, the first time I try to import from an Excel file I run -import excel using filename, describe-. The output shows me the name of each sheet and the cellrange that Stata thinks constitute the data space within them. If the cellrange seems excessively large compared to my expectations, I then explore in Excel to see what is going on. If I find there are extraneous cells, I specify the -cellrange()- option when actually importing that sheet so as to only bring in the real data.

    Of course, it is possible that in your instance it is not a matter of some far-flung cell with extraneous content--perhaps there is something else preventing Stata from recognizing the extent of the data space. It could be a bug in -import excel-, or some non-bug attribute of how Stata tries to identify the data space, or a corruption of the Excel spreadsheet not severe enough to make it unopenable. After all, another common problem when importing from Excel is ending up with a bunch of empty observations at the end of the Stata data set because Stata thought the data space included some rows that have nothing in them. And maybe there are other possibilities.

    Comment


    • #3
      I run into to this a lot. In some spreadsheets, Stata thinks there's data either vertically or horizontally that does not appear in the spreadsheet. Just wiping the data in column or rows does not always resolve the problem. Deleting the rows and columns always works.

      Comment


      • #4
        The problem with either wiping or deleting is that you are modifying the source of the data in a way that leaves no audit trail. I prefer to avoid doing that. In fact, I usually make Excel files sent to me by others read-only, and do all data management in Stata after importation so that every step of the way is fully documented.

        Comment


        • #5
          missings from the Stata Journal has options dropobs and dropvars geared to this problem.

          dropobs will drop observations that are entirely missings and dropvars will drop variables that are entirely missing.

          Beyond the situation in this thread. many spreadsheet users have habits of leaving blank columns or rows in worksheet for tidiness and such habits make sense in that context, but not so much when worksheets are imported for statistical purposes.

          The command name missings was in one strong sense a bad idea because it's hard to find among many other commands or documents concerning missing values. That bad idea can be circumvented with the otherwise unpredictable detail that dm0085 indexes the original paper and later updates.

          Code:
          . search dm0085, entry
          
          Search of official help files, FAQs, Examples, and Stata Journals
          
          SJ-23-2 dm0085_3  . . . . . . . . . . . . . . . . Software update for missings
                  (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                  Q2/23   SJ 23(2):595--596
                  most important change is the addition of subcommand breakdown
          
          SJ-20-4 dm0085_2  . . . . . . . . . . . . . . . . Software update for missings
                  (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                  Q4/20   SJ 20(4):1028--1030
                  sorting has been extended for missings report
          
          SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
                  (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                  Q3/17   SJ 17(3):779
                  identify() and sort options have been added
          
          SJ-15-4 dm0085  Speaking Stata: A set of utilities for managing missing values
                  (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                  Q4/15   SJ 15(4):1174--1185
                  provides command, missings, as a replacement for, and extension
                  of, previous commands nmissing and dropmiss

          Comment


          • #6
            On #4, I've made a habit of copying excel sheets within the spreadsheet and naming them STATA (or something like that). That way, you have the trail.

            Comment

            Working...
            X