Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • intraday data

    Dear Statausers,

    I am using Stata 14.1. I have intraday data at 15 minutes interval. It look likes the following:
    2014-01-01 00:00:00+01:00
    2014-01-01 00:15:00+01:00
    2014-01-01 00:30:00+01:00
    2014-01-01 00:45:00+01:00
    ....

    From this previous Statalist post:
    http://www.stata.com/statalist/archi.../msg00042.html

    I did the following:
    Code:
    gen str date = substr(time, 1, 10) 
    assert substr(time,11,1)==" "
    gen hour = real(substr(time,12,2))
    assert substr(time,14,1)==":"
    gen min  = real(substr(time,15,2))
    assert substr(time,17,1)==":"
    gen sec  = real(substr(time,18,2))
    
    gen edate = date(date, "ymd") 
    gen double secs = edate*24*60*60 + hour*60*60 + min*60 + sec
    as well as this alternative code:
    Code:
    split time, p(" " :) destring 
    gen edate = date(time1, "ymd")
    gen double secs = edate*24*60*60 + time2*60*60 + time3*60 + time5
    Sadly, I am getting missing values using both approaches. Any help in solving this issue is highly appreciated. Thank you.

    Regards,
    Syed Basher

  • #2
    This is reinventing stuff long since provided in Stata. I like split; indeed I wrote it; but you don't need it here. First, trailing strings like

    Code:
    +01:00
    in

    Code:
    2014-01-01 00:00:00+01:00
    need a decision: do you want to ignore them, or what? I am going to ignore them. So I just need to read in my strings and apply what I learned from reading

    Code:
    help dates
    Here are the results:

    Code:
    clear 
    input str42 whatever 
    "2014-01-01 00:00:00+01:00"
    "2014-01-01 00:15:00+01:00"
    "2014-01-01 00:30:00+01:00"
    "2014-01-01 00:45:00+01:00" 
    end 
    gen double datetime = clock(substr(whatever, 1, strpos(whatever, "+")-1), "YMD hms") 
    format datetime %tc 
    
    list 
    
         +------------------------------------------------+
         |                  whatever             datetime |
         |------------------------------------------------|
      1. | 2014-01-01 00:00:00+01:00   01jan2014 00:00:00 |
      2. | 2014-01-01 00:15:00+01:00   01jan2014 00:15:00 |
      3. | 2014-01-01 00:30:00+01:00   01jan2014 00:30:00 |
      4. | 2014-01-01 00:45:00+01:00   01jan2014 00:45:00 |
         +------------------------------------------------+
    A problem with your code is illustrated here: ymd is incorrect syntax.

    Code:
    . di daily("2014-01-01", "ymd")
    .
    
    . di daily("2014-01-01", "YMD")
    19724
    
    . di %td daily("2014-01-01", "YMD")
    01jan2014

    Comment


    • #3
      Thank you very much nick. Now, datetime is a numeric variable. I want to tsset datetime including the "hms" so that Stata understands that my data has 15-minute interval. I can't figure out this! Plus, how can I generate four time dummy variables spaced 15-minute (00, 15, 30, 45).

      Comment


      • #4
        I continue the toy example in #2.

        Precisely your case is documented in

        Code:
        help tsset
        namely

        Code:
        tsset datetime, delta(15 minutes)
        together with any panel identifier.

        The following shows some technique:

        Code:
        . gen minutes = mod(datetime, 60*60*1000)
        
        . tab minutes
        
            minutes |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  0 |          1       25.00       25.00
             900000 |          1       25.00       50.00
            1800000 |          1       25.00       75.00
            2700000 |          1       25.00      100.00
        ------------+-----------------------------------
              Total |          4      100.00
        
        . replace minutes = mod(datetime, 60*60*1000)/900000
        (3 real changes made)
        
        . tab minutes
        
            minutes |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  0 |          1       25.00       25.00
                  1 |          1       25.00       50.00
                  2 |          1       25.00       75.00
                  3 |          1       25.00      100.00
        ------------+-----------------------------------
              Total |          4      100.00
        
        . tab minutes, gen(minutes)
        
            minutes |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  0 |          1       25.00       25.00
                  1 |          1       25.00       50.00
                  2 |          1       25.00       75.00
                  3 |          1       25.00      100.00
        ------------+-----------------------------------
              Total |          4      100.00
        
        . d minutes?
        
                      storage   display    value
        variable name   type    format     label      variable label
        --------------------------------------------------------------------------------------------
        minutes1        byte    %8.0g                 minutes== 0.0000
        minutes2        byte    %8.0g                 minutes== 1.0000
        minutes3        byte    %8.0g                 minutes== 2.0000
        minutes4        byte    %8.0g                 minutes== 3.0000

        Comment


        • #5
          The codes for generating minutes dummy work perfect, thank you again. But, when I
          Code:
           
           tsset datetime, delta(15 minutes)
          I get the following error: repeated time values in sample.
          What panel identifier am I missing?

          Comment


          • #6
            Whatever defines distinct blocks of observations other than time. Stocks??? You haven't told us, but you should know.

            Comment


            • #7
              They are in fact electricity price data. I generated an indentifier using
              Code:
              bysort datetime: g id = _n
              and it seems working now. By the way, my data is time series. Thank you so much Nick for your valuable help.

              Comment


              • #8
                That's legal. I am not clear that that is guaranteed to be meaningful. What differentiates different prices at the same time?

                Comment


                • #9
                  The other variables I have are spot price and load. Price changes in every 15 minutes, while load changes in every hour.

                  Comment


                  • #10
                    I don't think that answers my question. Consider these data:

                    Code:
                    . clear 
                    
                    . set seed 2803 
                    
                    . input time price 
                    
                              time      price
                      1. 1   12
                      2. 1   23
                      3. 2   34
                      4. 2   45
                      5. 3   56
                      6. 3   67 
                      7. 4   78
                      8. 4   89
                      9. 5   90
                     10. 5    1
                     11. 6   12
                     12. 6   23 
                     13. end
                    
                    . bysort time : gen id1 = _n 
                    
                    . gen foo = runiform()
                    
                    . sort foo 
                    
                    . bysort time : gen id2 = _n 
                    
                    . list, sepby(time)  
                    
                         +-------------------------------------+
                         | time   price   id1        foo   id2 |
                         |-------------------------------------|
                      1. |    1      12     1   .9243789     1 |
                      2. |    1      23     2   .3326341     2 |
                         |-------------------------------------|
                      3. |    2      45     2   .1040797     1 |
                      4. |    2      34     1   .7739685     2 |
                         |-------------------------------------|
                      5. |    3      67     2   .0200225     1 |
                      6. |    3      56     1   .3383934     2 |
                         |-------------------------------------|
                      7. |    4      78     1   .1795591     1 |
                      8. |    4      89     2   .6264514     2 |
                         |-------------------------------------|
                      9. |    5       1     2   .3870576     1 |
                     10. |    5      90     1   .3980427     2 |
                         |-------------------------------------|
                     11. |    6      12     1   .7935746     1 |
                     12. |    6      23     2   .6305373     2 |
                         +-------------------------------------+
                    
                    . assert id1 == id2 
                    6 contradictions in 12 observations
                    assertion is false
                    r(9);
                    The panel identifiers are not even reproducible under similar conditions. They are thus arbitrary, indeed meaningless.

                    Comment


                    • #11
                      Damn! Though I hesitate to ask your repeated help, but how can I get around this problem? Some estimators such as "newey" would not run without sorting, and I am having exactly this problem now!

                      Comment


                      • #12
                        I'd turn it around. What is the rationale for Newey here? I don't know, but others should have better advice. You may need to start a new thread.

                        Comment


                        • #13
                          The original problem seemed to be "I get the following error: repeated time values in sample." Sometimes missing data can result in this error - I suspect missing time for time for more than one observation could generate this error. So, first check that you don't have missing data on your time tsset variable.

                          If you don't have missing data on time, you should back up and try to see where the duplicate times are. Use the duplicates procedure to find out how many of them there are and where they are. Look at the duplicates to see what is really going on. If you have duplicate observations when logically you should not have them, then you need to do something about them. If they are truly just duplicates, you can delete them.

                          Comment


                          • #14
                            Correction:
                            "missing time for time" should be "missing values for time"

                            Comment

                            Working...
                            X