Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • odd result when importing excel date where display format differs from underlying value

    I have dataset that was provided to me as an excel file. It includes a date variable. if I open the file in Excel and visually inspect it, the date variable appears to be a variable of the format mm/dd/yy. However, if I click on one of the cells in the date column it becomes clear that the variable was actually entered into Excel as date and time (i.e. mm/dd/yy hh:mm:ss am/pm).

    When I use import excel to import the file I get a date variable that is stored as a double and formatted %td. If I list the data, everything looks fine. A visual comparison of the data and the excel spreadsheet show that everything matches and it appears that I have a daily date. When I tabulate the date variable, though, it becomes clear that something is wrong, as shown below. (Note that the date variable addmission was misspelled on arrival)

    Code:
     import excel "patients V4 7-1 to 7-8.xls", sheet("Sheet1") cellrange(A4:Z115) firstrow clear case(lower)
    
    . desc addmission
    
                  storage   display    value
    variable name   type    format     label      variable label
    ----------------------------------------------------------------
    addmission      double  %td                   Addmission
    
    . sort addmission
    
    . list addmission in 1/5
    
         +-----------+
         | addmiss~n |
         |-----------|
      1. | 01jul2014 |
      2. | 01jul2014 |
      3. | 01jul2014 |
      4. | 01jul2014 |
      5. | 01jul2014 |
         +-----------+
    
    .
    . tab addmission in 1/5
    
     Addmission |      Freq.     Percent        Cum.
    ------------+-----------------------------------
      01jul2014 |          1       20.00       20.00
      01jul2014 |          1       20.00       40.00
      01jul2014 |          1       20.00       60.00
      01jul2014 |          1       20.00       80.00
      01jul2014 |          1       20.00      100.00
    ------------+-----------------------------------
          Total |          5      100.00
    
    .
    . /*some investigating reveals that the underlying variable that Stata has imported is not an integer as I would expect */
    .
    . format addmission %25.0g
    
    .
    . list addmission in 1/5
    
         +--------------------+
         |         addmission |
         |--------------------|
      1. | 19905.013888888891 |
      2. | 19905.050694444442 |
      3. | 19905.055555555555 |
      4. | 19905.386111111111 |
      5. | 19905.435416666667 |
         +--------------------+
    .

    I'm only interested in the date portion of the date so it's easy enough to solve this problem with
    gen admitdate=int(addmission)


    This seems like odd behavior to me, though. By default I would expect Stata to ignore the Excel formatting and import the variable as date and time (integer formatted as %tc). If it didn't do that, I would expect Stata to respect the excel formatting and just import the date portion (integer formatted as %td). The fact that the variables is imported as some sort of weird mix of daily date and fractional time doesn't seem to be desirable to me.

    I'm reluctant to declare this a bug, but I thought it worth noting publicly since it struck me as unexpected behavior.


  • #2
    I have occasionally encountered similar behavior with date imported from Excel that were entered in the way you describe: date and time were entered but the Excel display format is set as date only.

    What is most amazing to me is that the "hybrid" that Stata creates is a double that is not the Stata date-time representation of the date/time combination itself. In your example, if you were to create a variable -gen double x = clock("01jul2014", "DMY") the resulting Stata date-time x is orders of magnitude bigger than 19905 (which is the correct Stata date representation of 1jul2014.) Some playing around with these situations reveals that the decimal part of this weird import is actually the fraction of a day corresponding to the time of day that was entered in Excel. (Which is the way that Excel stores date-times: numerical date plus fraction of day.) So, if you wanted to recover the date and the time, it is easy enough to do that.

    But I agree that it is weird (or at least undocumented) that Stata does this. In fact it appears to contradict what the manual says:
    If an all-numerical column contains at least one cell formatted as a date or time, the entire
    column is imported as a Stata date or datetime variable. import excel imports the column as
    a Stata date if all date cells in Excel are dates only; otherwise, a datetime is used.

    Comment

    Working...
    X