Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to round date variables to the nearest Monday?

    Hi everyone,

    I have a dataset with student test scores and test dates for the years 2008-2019. I have all student test scores but I don't have their test dates for the years 2008 to 2014. The data looks something like this:
    student id test_score test_date year school
    1 90 . 2010 100
    1 98 . 2011 100
    1 95 apr 30 2015 2015 100
    1 94 may 1 2016 2016 100
    2 88 . 2013 200
    2 88 . 2014 200
    2 89 may 3 2015 2015 200
    2 90 apr 22 2017 2017 200
    3 70 . 2011 100
    3 75 apr 30 2015 2015 100

    I am imputing these missing test dates by taking the average test date of the non-missing test dates by school but some of these average test dates end up being on a Saturday or Sunday. I know that most students take their tests on a Monday. I would like to round the imputed test dates to the nearest Monday. Does anyone know how to do that? Here is what my code looks like so far:

    gen day = day(test_date)
    gen month = month(test_date)

    egen day_mean = mean(day), by(school)
    egen month_mean = mean(month), by(school)

    gen date_imp = mdy(month_mean, day_mean, year)

    ***insert code to round to nearest monday here!!!

    replace test_date = date_imp if test_date == .



  • #2
    Code:
    clear
    input byte(studentid test_score) int(year school) float test_date
    1 90 2010 100     .
    1 98 2011 100     .
    1 95 2015 100 20208
    1 94 2016 100 20575
    2 88 2013 200     .
    2 88 2014 200     .
    2 89 2015 200 20211
    2 90 2017 200 20931
    3 70 2011 100     .
    3 75 2015 100 20208
    end
    format %td test_date
    
    
    //  IMPUTE SCHOOL'S MEAN TEST DATE FOR MISSING VALUES
    by school, sort: egen school_mean_test_day_of_year = mean(doy(test_date))
    replace school_mean_test_day_of_year = round(school_mean_test_day_of_year)
    gen  school_mean_test_date = mdy(1, 1, year) + school_mean_test_day_of_year
    
    gen imputed_test_date = test_date
    replace imputed_test_date = school_mean_test_date if missing(test_date)
    format imputed_test_date %td
    
    //  NOW SWITCH TO NEAREST MONDAY
    gen weekday = dow(imputed_test_date)
    replace imputed_test_date = imputed_test_date - weekday  ///
        if weekday <= 4
    replace imputed_test_date = imputed_test_date + 7 - weekday ///
        if weekday > 4 & !missing(weekday)
    replace imputed_test_date = imputed_test_date + 1
    assert dow(imputed_test_date) == 1
    format imputed_test_date %td
    The detour through Sunday makes the code slightly easier to write because Stata's dow() function has Sunday = 0.

    By the way, your code for calculating the mean date does not actually calculate the mean date. It calculates the mean day of the month and the mean month of the year and then puts those together, but that is not quite the same thing. For practical purposes it may not make much difference, but I've shown code that actually gets the mean date.

    Next, there is the question of whether you should do this. I don't know what you are going to use these dates for, but for many purposes, imputing a mean value to replace missing values is not a good idea. This is particularly true if you will be using these dates as a variable in a regression model or the equivalent. Unfortunately, when it comes to dealing with missing data, we are usually in a position of having to choose the least bad solution, rather than a good one. But, again, especially for regression, multiple imputation would generally be better than single mean imputation.

    Finally, in the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
    Last edited by Clyde Schechter; 18 Dec 2020, 12:44. Reason: For some reason, this post posted itself before I was finished writing it!

    Comment


    • #3
      I just noticed an error in the code I posted in #2.

      Code:
      gen school_mean_test_date = mdy(1,1,year) + school_mean_test_day_of_year
      
      // SHOULD BE
      
      gen school_mean_test_date = mdy(12,31,year-1) + school_mean_test_day_of_year
      
      // BECAUSE THE doy() FUNCTION STARTS WITH 1, NOT ZERO

      Comment


      • #4
        Thank you Clyde! Sorry, this was my first post on the stata forum. I'll be sure to use the dataex command next time.

        Comment

        Working...
        X