intraday data

syed.basher

Join Date: Jun 2014

Posts: 20
#1

intraday data

05 May 2016, 00:29

Dear Statausers,

I am using Stata 14.1. I have intraday data at 15 minutes interval. It look likes the following:
2014-01-01 00:00:00+01:00
2014-01-01 00:15:00+01:00
2014-01-01 00:30:00+01:00
2014-01-01 00:45:00+01:00
....

From this previous Statalist post:
http://www.stata.com/statalist/archi.../msg00042.html

I did the following:

Code:

gen str date = substr(time, 1, 10) assert substr(time,11,1)==" " gen hour = real(substr(time,12,2)) assert substr(time,14,1)==":" gen min = real(substr(time,15,2)) assert substr(time,17,1)==":" gen sec = real(substr(time,18,2)) gen edate = date(date, "ymd") gen double secs = edate*24*60*60 + hour*60*60 + min*60 + sec

as well as this alternative code:

Code:

split time, p(" " :) destring gen edate = date(time1, "ymd") gen double secs = edate*24*60*60 + time2*60*60 + time3*60 + time5

Sadly, I am getting missing values using both approaches. Any help in solving this issue is highly appreciated. Thank you.

Regards,
Syed Basher
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35573

05 May 2016, 03:01

This is reinventing stuff long since provided in Stata. I like split; indeed I wrote it; but you don't need it here. First, trailing strings like

Code:

+01:00

Code:

2014-01-01 00:00:00+01:00

need a decision: do you want to ignore them, or what? I am going to ignore them. So I just need to read in my strings and apply what I learned from reading

Code:

help dates

Here are the results:

Code:

clear 
input str42 whatever 
"2014-01-01 00:00:00+01:00"
"2014-01-01 00:15:00+01:00"
"2014-01-01 00:30:00+01:00"
"2014-01-01 00:45:00+01:00" 
end 
gen double datetime = clock(substr(whatever, 1, strpos(whatever, "+")-1), "YMD hms") 
format datetime %tc 

list 

     +------------------------------------------------+
     |                  whatever             datetime |
     |------------------------------------------------|
  1. | 2014-01-01 00:00:00+01:00   01jan2014 00:00:00 |
  2. | 2014-01-01 00:15:00+01:00   01jan2014 00:15:00 |
  3. | 2014-01-01 00:30:00+01:00   01jan2014 00:30:00 |
  4. | 2014-01-01 00:45:00+01:00   01jan2014 00:45:00 |
     +------------------------------------------------+

A problem with your code is illustrated here: ymd is incorrect syntax.

Code:

. di daily("2014-01-01", "ymd")
.

. di daily("2014-01-01", "YMD")
19724

. di %td daily("2014-01-01", "YMD")
01jan2014

Comment

syed.basher

Join Date: Jun 2014

Posts: 20
#3

05 May 2016, 04:12

Thank you very much nick. Now, datetime is a numeric variable. I want to tsset datetime including the "hms" so that Stata understands that my data has 15-minute interval. I can't figure out this! Plus, how can I generate four time dummy variables spaced 15-minute (00, 15, 30, 45).
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35573

05 May 2016, 04:21

I continue the toy example in #2.

Precisely your case is documented in

Code:

help tsset

namely

Code:

tsset datetime, delta(15 minutes)

together with any panel identifier.

The following shows some technique:

Code:

. gen minutes = mod(datetime, 60*60*1000)

. tab minutes

    minutes |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          1       25.00       25.00
     900000 |          1       25.00       50.00
    1800000 |          1       25.00       75.00
    2700000 |          1       25.00      100.00
------------+-----------------------------------
      Total |          4      100.00

. replace minutes = mod(datetime, 60*60*1000)/900000
(3 real changes made)

. tab minutes

    minutes |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          1       25.00       25.00
          1 |          1       25.00       50.00
          2 |          1       25.00       75.00
          3 |          1       25.00      100.00
------------+-----------------------------------
      Total |          4      100.00

. tab minutes, gen(minutes)

    minutes |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          1       25.00       25.00
          1 |          1       25.00       50.00
          2 |          1       25.00       75.00
          3 |          1       25.00      100.00
------------+-----------------------------------
      Total |          4      100.00

. d minutes?

              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------------------
minutes1        byte    %8.0g                 minutes== 0.0000
minutes2        byte    %8.0g                 minutes== 1.0000
minutes3        byte    %8.0g                 minutes== 2.0000
minutes4        byte    %8.0g                 minutes== 3.0000

Comment

syed.basher

Join Date: Jun 2014

Posts: 20
#5

05 May 2016, 04:35

The codes for generating minutes dummy work perfect, thank you again. But, when I

Code:

tsset datetime, delta(15 minutes)

I get the following error: repeated time values in sample.
What panel identifier am I missing?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35573
#6

05 May 2016, 04:38

Whatever defines distinct blocks of observations other than time. Stocks??? You haven't told us, but you should know.
Comment
syed.basher

Join Date: Jun 2014

Posts: 20
#7

05 May 2016, 05:03

They are in fact electricity price data. I generated an indentifier using

Code:

bysort datetime: g id = _n

and it seems working now. By the way, my data is time series. Thank you so much Nick for your valuable help.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35573
#8

05 May 2016, 05:28

That's legal. I am not clear that that is guaranteed to be meaningful. What differentiates different prices at the same time?
Comment
syed.basher

Join Date: Jun 2014

Posts: 20
#9

05 May 2016, 07:52

The other variables I have are spot price and load. Price changes in every 15 minutes, while load changes in every hour.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35573

#10

05 May 2016, 08:06

I don't think that answers my question. Consider these data:

Code:

. clear 

. set seed 2803 

. input time price 

          time      price
  1. 1   12
  2. 1   23
  3. 2   34
  4. 2   45
  5. 3   56
  6. 3   67 
  7. 4   78
  8. 4   89
  9. 5   90
 10. 5    1
 11. 6   12
 12. 6   23 
 13. end

. bysort time : gen id1 = _n 

. gen foo = runiform()

. sort foo 

. bysort time : gen id2 = _n 

. list, sepby(time)  

     +-------------------------------------+
     | time   price   id1        foo   id2 |
     |-------------------------------------|
  1. |    1      12     1   .9243789     1 |
  2. |    1      23     2   .3326341     2 |
     |-------------------------------------|
  3. |    2      45     2   .1040797     1 |
  4. |    2      34     1   .7739685     2 |
     |-------------------------------------|
  5. |    3      67     2   .0200225     1 |
  6. |    3      56     1   .3383934     2 |
     |-------------------------------------|
  7. |    4      78     1   .1795591     1 |
  8. |    4      89     2   .6264514     2 |
     |-------------------------------------|
  9. |    5       1     2   .3870576     1 |
 10. |    5      90     1   .3980427     2 |
     |-------------------------------------|
 11. |    6      12     1   .7935746     1 |
 12. |    6      23     2   .6305373     2 |
     +-------------------------------------+

. assert id1 == id2 
6 contradictions in 12 observations
assertion is false
r(9);

The panel identifiers are not even reproducible under similar conditions. They are thus arbitrary, indeed meaningless.

Comment

syed.basher

Join Date: Jun 2014

Posts: 20
#11

05 May 2016, 08:20

Damn! Though I hesitate to ask your repeated help, but how can I get around this problem? Some estimators such as "newey" would not run without sorting, and I am having exactly this problem now!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35573
#12

05 May 2016, 08:34

I'd turn it around. What is the rationale for Newey here? I don't know, but others should have better advice. You may need to start a new thread.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#13

06 May 2016, 11:19

The original problem seemed to be "I get the following error: repeated time values in sample." Sometimes missing data can result in this error - I suspect missing time for time for more than one observation could generate this error. So, first check that you don't have missing data on your time tsset variable.

If you don't have missing data on time, you should back up and try to see where the duplicate times are. Use the duplicates procedure to find out how many of them there are and where they are. Look at the duplicates to see what is really going on. If you have duplicate observations when logically you should not have them, then you need to do something about them. If they are truly just duplicates, you can delete them.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#14

06 May 2016, 11:20

Correction:
"missing time for time" should be "missing values for time"
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment