Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • 53 Weeks and tsset seem to clash

    Hey,

    I'm currently working on a weekly time series project and cleaning my dataset. As it spans from 2020 to 2023 the 53rd in 2020 is constantly a pain
    I've already done this to create a numerical date variable in the format YYYYYwWW and a week variable with ISO8601 dates:

    Code:
    * Convert the string date to a Stata date variable
    gen date = date(date_str, "YMD")
    format date %td
    gen year=year(date)
    
    * Get ISO8601 dates
    gen ISOweek =int((doy(7*int((date-mdy(1,1,1900))/7)+ mdy(1,1,1900) + 3)+6)/7)
    gen week = ISOweek
    drop ISOweek date_str
    When I use
    Code:
    tesset date
    Stata puts my id 44 with the week 2020w53 at the end. I also tried this solution from Nick in an older post:

    Code:
    egen numweek = group(date), label
    replace numweek = 3297 in 169
    format numweek %tw
    replace date = numweek if id ==44
    sort id
    It didn't help as when I use
    Code:
    tesset date
    afterwards it just deletes the 2020w53 and sorts it at the end. If I use Nicks code after the tsset and try to generate 1. differences in a foreach loop I get the error "not sorted":

    Code:
    local variables "wai tavg_Berlin tavg_Bremen tavg_Hamburg airbnb_Berlin airbnb_Bremen airbnb_Hamburg booking_Berlin booking_Bremen booking_Hamburg urlaub_topic_Berlin urlaub_topic_Bremen urlaub_topic_Hamburg anzfallvortag_Berlin anzfallvortag_Bremen anzfallvortag_Hamburg kumfall_Berlin kumfall_Bremen kumfall_Hamburg"
    
    sort id 
    
    * Generate FD of LOGs
     foreach v of local variables {
      2.         gen dln_`v'=d.ln_`v'
      3.         label variable dln_`v' "f. diff. log `v'"
      4. }
    
    not sorted
    What am I doing wrong here and is there a solution (and what would it be)?
    Im sorry if there is a solution already posted in the forum as I didn't found it yet please reference me

    Best
    Philipp

    PS: maybe unrelated but when I first diff. the variables and ignoring the problem with the missing date in 2020w53 I lose the first observation and the last one. Is this related to the 53-week problem as I would expect to lose only one observation by using the first diff.

  • #2
    I found this really hard to follow despite having wrestled with weekly dates from time to time over several years.

    There is no data example here. We never see what date_str looks like and dropping it at an early stage prevents you from checking that your manipulations were correct.

    You calculate an ISO week variable but then ignore it, preferring to push a daily date through egen, group(), That could help but isn't essential.

    I have some certainties for you.

    1. You'll never get to see 2020w53 displayed as a weekly format as according to Stata's rules it never happened. So format anything %tw is never going to help you reliably, as some formats will be correct and some will not. Weekly display formats are only fully consistent with Stata's idea of weeks and can't be subverted to match any other criteria.

    2. If you're declaring daily dates to tsset (not tesset) that are standing in for weekly dates then you need to specify delta(7). In my experience using daily dates to stand in for weekly dates and using a daily date display format is by far the simplest way to handle weeks and matches all statistical and most non-statistical desiderata.

    The rest depends on what defines a week in your terms and how it shows up in your data. There were 53 Wednesdays and 53 Thursdays in 2020, so the idea of 53 weeks in 2020 marches with the idea that either Wednesday or Thursday is the key day of the week.

    If it's important to you to know what week number a week was, then as said weekly date formats can't do that reliably, You would need to create value labels to show the desired format. Or you could work with ISO weeks consistently, with a consequence that weeks often wrap around at year end.

    I guess Nick is me, but I can't follow how the code you attribute to me (from where?) follows from any of my posts or papers, as I've consistently emphasised, as above, that weekly display formats should never be used unless you happen to know that weeks as defined in your data source match Stata's rules, which I have never yet observed in practice.

    Comment


    • #3
      Here is some technique. I note that 1 January 2020 was a Wednesday, so a suitable sandbox is an invented dataset with 53 Wednesdays in 2020 and the first 7 Wednesdays in 2021. It's not possible for users like you and me to define their own display formats, but users can define their own value labels and arrange to show them when wished.

      labmask is from theStata Journal. It is in essence a wrapper for a loop over distinct values.

      Code:
      clear 
      set obs 60 
      gen date = mdy(1,1,2020) + 7 * (_n - 1) 
      format date %td 
      gen year = year(date)
      bysort year (date) : gen weekno = _n 
      
      gen label = strofreal(year) + "w" + strofreal(weekno)
      clonevar wdate = date 
      labmask wdate, values(label)
      
      l if inrange(_n, 1, 10) | inrange(_n, _N - 9, _N)
      
          +-----------------------------------------------+
           |      date   year   weekno     label     wdate |
           |-----------------------------------------------|
        1. | 01jan2020   2020        1    2020w1    2020w1 |
        2. | 08jan2020   2020        2    2020w2    2020w2 |
        3. | 15jan2020   2020        3    2020w3    2020w3 |
        4. | 22jan2020   2020        4    2020w4    2020w4 |
        5. | 29jan2020   2020        5    2020w5    2020w5 |
           |-----------------------------------------------|
        6. | 05feb2020   2020        6    2020w6    2020w6 |
        7. | 12feb2020   2020        7    2020w7    2020w7 |
        8. | 19feb2020   2020        8    2020w8    2020w8 |
        9. | 26feb2020   2020        9    2020w9    2020w9 |
       10. | 04mar2020   2020       10   2020w10   2020w10 |
           |-----------------------------------------------|
       51. | 16dec2020   2020       51   2020w51   2020w51 |
       52. | 23dec2020   2020       52   2020w52   2020w52 |
       53. | 30dec2020   2020       53   2020w53   2020w53 |
       54. | 06jan2021   2021        1    2021w1    2021w1 |
       55. | 13jan2021   2021        2    2021w2    2021w2 |
           |-----------------------------------------------|
       56. | 20jan2021   2021        3    2021w3    2021w3 |
       57. | 27jan2021   2021        4    2021w4    2021w4 |
       58. | 03feb2021   2021        5    2021w5    2021w5 |
       59. | 10feb2021   2021        6    2021w6    2021w6 |
       60. | 17feb2021   2021        7    2021w7    2021w7 |
           +-----------------------------------------------+
      Your real dataset may well be more complicated than this, and as you say includes later dates, but you can just tweak the sandbox code to cover your entire period and merge with your main dataset. If you want something different, then advice would surely follow your making clear what that is.

      As said, something like

      Code:
      tsset wdate, delta(7) 
      is the key to later modelling.

      Comment

      Working...
      X