Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing dates of different formats from excel to stata

    I have a dataset created by merging different excel files that all have different date formats (sometimes mixed within files, too). My date variable has the following four formats:

    "1/24/2015"
    "21-Oct"
    "41988"
    "no date"

    41988 is an example of excel's date format. How can I convert these all to the same date format, either when I'm merging the files or cleaning the .dta? I'm using Stata 13.

  • #2
    Yes, importing dates from Excel can be a huge mess. Does each spreadsheet contain only a single uniform format for the dates, or are the various formats interspersed within spreadsheets?

    Also, it will not be possible to convert "21-Oct" to a date, because the year is unknown--unless you have some basis for imputing a year in these circumstances.

    Also, the correct translation of Excel date 41988 to Stata internal dates depends on which version of Excel you are dealing with here.

    Comment


    • #3
      With difficulty. How could it be otherwise? Working upwards,

      "no date" is clearly hopeless.

      "21-Oct" is hopeless unless you know a year.

      "41988" is difficult unless you are prepared to impute day too.

      Code:
       
      gen ddate = daily(date, "MDY")
      will work for dates like "1/24/2015" and leave the others missing.

      Code:
       
      replace ddate = mdy(real(substr(date, 1, 1)), 15, real(substr(date, 2, 4))
      should turn dates like "41988" into the daily date for 4/15/1988 (imputing day 15).

      Comment


      • #4
        Interesting! I didn't notice that 41988 might be a representation of April 1988: I took it as a numerical code internal to Excel that got imported as an integer. For example, for those version of Excel that use 1 Jan 1898 as their base, 41988 would correspond to 17 Dec 2012.

        Comment


        • #5
          So, the problems could even be greater than mentioned if 41988 could mean two quite different things.

          Comment


          • #6
            Thank you for your responses. To answer some of the questions--

            1. Some of the excel files do contain different date formats within them--one has the 21-Oct format and the 41988 format.
            2. All dates with missing years are 2014
            3. 41988 is excel's internal format for the number of days since 1900; so I could do change it into SIF for all dates using something like mydates + td(30dec1899) and then format mydates %td if they were all in this format, but since they're not, I can't apply this to the entire column. This is consistent across all files with the date in this format
            4. I don't have a strong preference for how I code "no date"-- could be 0 or remain a string, if that helps

            As of now I've been trying to separate them into different columns based on the format, but since they're all in string it's a bit of guesswork as to how to tell Stata to differentiate between them

            Comment


            • #7
              2. answers itself.

              Otherwise 4. is easiest to answer.

              "no date" can only defensibly be a missing numeric date in Stata; it can remain a string value in a string variable but those are the limiting choices.

              0 is emphatically not a solution as it means 1 January 1960!

              Comment


              • #8
                Thanks again. I didn't catch that the first time around and had a column of mostly January 1960!

                Anyway, I worked out a solution based on string matching and the date solution proposed above. Comments/criticisms welcome!

                Code:
                *** variable with mixed dates is called calldate***
                
                gen calldate2 = calldate 
                
                
                * Create a variable for dates of the format DD-Month
                
                replace calldate2 = calldate2 + "-2014" if regexm(calldate2, "^[0-9]") == 1 & regexm(calldate2, "[a-z]$") == 1
                gen date_ddmm = calldate2 if regexm(calldate2, "^[0-9]") == 1 & regexm(calldate2, "[a-z]") == 1
                gen date_ddmm2 = date(date_ddmm, "DMY")
                drop date_ddmm
                ren date_ddmm2 date_ddmm
                
                
                * Create a variable for dates using internal excel format
                
                 gen date_excel = calldate2 if  regexm(calldate2, "^4") == 1 & regexm(calldate2, "[0-9][0-9][0-9][0-9)[0-9]") == 1
                 destring(date_excel), replace
                 replace date_excel = date_excel + td(30dec1899)
                
                
                 * Remove the observations that I created separate variables for above
                
                 replace calldate2 = "" if calldate2 =="no call" | regexm(calldate2, "^4") == 1 & regexm(calldate2, "[0-9][0-9][0-9][0-9)[0-9]") == 1 | regexm(calldate2, "^[0-9]") == 1 & regexm(calldate2, "[a-z]") == 1
                
                
                 * Format the remaining observations of the form MM/DD/YYYY
                
                gen calldate3 = date(calldate2, "MDY")
                
                
                egen dates = rowtotal(date_ddmm date_excel calldate3)
                replace dates = . if dates == 0
                format dates %td
                
                drop calldate2 date_ddmm date_excel calldate3)

                Comment

                Working...
                X