Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate a unified time based on separate variables for date and time

    Hi all statalists!
    In my data, I have the dates of exams (test_date) and the time they started (start_hour).

    test_date is in DMY format. While start_hour is in a weird format (integer), for example, start_hour == 900 says that the exam started at 9:00 a.m. or 09:00 in 24-hour format, and start_hour == 1430 indicates that the exam started at 2:30 p.m. or 14:30 in 24-hour format.

    I attach a sample of my data below:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int(test_date start_hour)
    16560  900
    16579 1415
    16589 1445
    16589 1700
    16602 1100
    16615  900
    16618  900
    15465 1430
    15467  900
    15467 1100
    end
    format %td test_date

    I would like to generate two new variables:
    1. time_hour, which takes the values of start_hour and converts them into 24-hour format. For example, when start_hour == 1430, time_hour == 14:30.
    2. date_time, which consists of the date and the time of the exam. For instance, for the first row (test_date == 04may2005 and start_hour == 900), I want to see something like 04 May 2005, 09:00. I need date_time to be in a format that enables me to calculate the difference between two dates. For example, the difference in days between the first observation and the second observation is 19 and in minutes is 27,675.
    Many thanks in advance!

  • #2
    Code:
    tostring start_hour, gen(time_hour) format(%04.0f)
    replace time_hour = substr(time_hour, 1, 2) + ":" + substr(time_hour, 3, 4)
    
    gen int h = floor(start_hour/100)
    gen int m = mod(start_hour, 100)
    gen double date_time = dhms(test_date, h, m, 0)
    format date_time %tc

    Comment


    • #3
      Thanks, Mr. Schechter!

      But what if I want to include time_hour in a regression? For instance, I want to estimate the effect of the time the exam starts on the score
      Code:
      reg score time_hour

      Comment


      • #4
        You can't use that variable for that purpose. You have to use a proper Stata internal format clock variable for that. We can get that by extracting the date from date_time.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input int(test_date start_hour)
        16560  900
        16579 1415
        16589 1445
        16589 1700
        16602 1100
        16615  900
        16618  900
        15465 1430
        15467  900
        15467 1100
        end
        format %td test_date
        
        gen int h = floor(start_hour/100)
        gen int m = mod(start_hour, 100)
        gen double date_time = dhms(test_date, h, m, 0)
        format date_time %tc
        
        gen double test_time = date_time - cofd(test_date)
        format test_time %tcHH:MM
        The variable pure_time reflects the time of day and is suitable for use in regression analyses or other calculations. That said, -reg score test_time- doesn't seem like a plausible model for most tests. That model implies that as the day progresses, the scores increase linearly. Then when we hit midnight and start the new day, they drop down to the "wee hours of the morning" values and begin their rise over the day again. I can't think of anything that really works that way off the top of my head. Usually these things involve the use of smooth periodic functions like sin() and cos() with appropriate parameters.

        Comment

        Working...
        X