Pick particular observations from a certain interval (without a loop)

Rolf Miller

Join Date: Feb 2019
Posts: 11

Pick particular observations from a certain interval (without a loop)

13 Feb 2019, 04:17

Hi there,

I have a dataset including time-series data with a binary variable (X) which is 0 most of the time and 1 else. I need to generate a new variable (X_new) that is based on X and picks the first observation that is 1 in an interval of [-10,+10] around that observation (so, all potential other observations being 1 in that interval shall be neglected). Also, I need another variable (T) that assigns the “distance” [-10,+10] to the event (i.e., where X_new is 1). Those intervals in T are required not to overlap (i.e., there need to be at least 20 observations between the 1s in X_new).

Example of the data (the last two columns are needed):

date	X	X_new	T
01.01.2019	0	0	.
02.01.2019	0	0	.
03.01.2019	0	0	.
04.01.2019	0	0	-10
05.01.2019	0	0	-9
06.01.2019	0	0	-8
07.01.2019	0	0	-7
08.01.2019	0	0	-6
09.01.2019	0	0	-5
10.01.2019	0	0	-4
11.01.2019	0	0	-3
12.01.2019	0	0	-2
13.01.2019	0	0	-1
14.01.2019	1	1	0
15.01.2019	0	0	1
16.01.2019	0	0	2
17.01.2019	0	0	3
18.01.2019	1	0	4
19.01.2019	0	0	5
20.01.2019	0	0	6
21.01.2019	1	0	7
22.01.2019	0	0	8
23.01.2019	0	0	9
24.01.2019	0	0	10
25.01.2019	0	0	.
26.01.2019	0	0	-10
27.01.2019	1	0	-9
28.01.2019	0	0	-8
29.01.2019	0	0	-7
30.01.2019	0	0	-6
31.01.2019	0	0	-5
01.02.2019	0	0	-4
02.02.2019	0	0	-3
03.02.2019	0	0	-2
04.02.2019	0	0	-1
05.02.2019	1	1	0
06.02.2019	0	0	1
07.02.2019	0	0	2
08.02.2019	0	0	3
09.02.2019	0	0	4
10.02.2019	0	0	5
11.02.2019	0	0	6
12.02.2019	0	0	7
13.02.2019	0	0	8
14.02.2019	0	0	9
15.02.2019	0	0	10

Anybody has an idea how to solve this with in a fast way?
I managed to do this in a loop, but since I need to run this on several GB of data, this will take weeks. :/

Thanks a lot!

Last edited by Rolf Miller; 13 Feb 2019, 04:19.

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10285

13 Feb 2019, 17:54

In future, please use dataex to present data examples. Your date variable needs to be converted to a Stata date prior to attempting any solution. You could reduce the complexity of the problem by working only with the sample of ones. See if the following helps.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 date byte(x x_new t)
"01.01.2019" 0 0   .
"02.01.2019" 0 0   .
"03.01.2019" 0 0   .
"04.01.2019" 0 0 -10
"05.01.2019" 0 0  -9
"06.01.2019" 0 0  -8
"07.01.2019" 0 0  -7
"08.01.2019" 0 0  -6
"09.01.2019" 0 0  -5
"10.01.2019" 0 0  -4
"11.01.2019" 0 0  -3
"12.01.2019" 0 0  -2
"13.01.2019" 0 0  -1
"14.01.2019" 1 1   0
"15.01.2019" 0 0   1
"16.01.2019" 0 0   2
"17.01.2019" 0 0   3
"18.01.2019" 1 0   4
"19.01.2019" 0 0   5
"20.01.2019" 0 0   6
"21.01.2019" 1 0   7
"22.01.2019" 0 0   8
"23.01.2019" 0 0   9
"24.01.2019" 0 0  10
"25.01.2019" 0 0   .
"26.01.2019" 0 0 -10
"27.01.2019" 1 0  -9
"28.01.2019" 0 0  -8
"29.01.2019" 0 0  -7
"30.01.2019" 0 0  -6
"31.01.2019" 0 0  -5
"01.02.2019" 0 0  -4
"02.02.2019" 0 0  -3
"03.02.2019" 0 0  -2
"04.02.2019" 0 0  -1
"05.02.2019" 1 1   0
"06.02.2019" 0 0   1
"07.02.2019" 0 0   2
"08.02.2019" 0 0   3
"09.02.2019" 0 0   4
"10.02.2019" 0 0   5
"11.02.2019" 0 0   6
"12.02.2019" 0 0   7
"13.02.2019" 0 0   8
"14.02.2019" 0 0   9
"15.02.2019" 0 0  10
end


gen date2 = date(date, "DMY")
format date2 %td
sort date2
gen xn= x==1 &_n>10
gen max=date2+20
format max %td
preserve
keep if xn==1
qui sum xn
forval i=1/`r(N)'{
drop if date2< max[`i'] &_n>`i'
}
tempfile tomerge
save `tomerge'
restore
merge 1:1 * using `tomerge'
gen xnew2=_merge==3
drop _merge

Announcement

Pick particular observations from a certain interval (without a loop)

Comment