creating new variable with varied proportions of current variables

Monica Muller

Join Date: Jul 2014
Posts: 226

creating new variable with varied proportions of current variables

26 Jul 2015, 10:59

Hi All,

In my data the three variables of w_id, w_date and value are from one dataset and I want to use them to create new_value for the other dataset which includes n_id and n_date. The problem is that the weeks in these two datasets don’t match. So, I want to create a new value variable using proportions of weeks in the first dataset (w_id, w_date, and value).

For example first two weeks in w_id and w_date are:
From 18993 to 19000 with value of 1.23 and 19000-19007 with value of 1.46
The first week in n_id and n_date is from 18996 to 19003 which means it has 4 days from the first week and 3 days from the second week in w_date data. So, I want to create new_value=(4/7)*1.23 +(3/7)*1.46. But these proportions are not always 4/7 and 3/7 and change throughout the data.

Here is a sample of my data:

Code:

clear
input int w_id int w_date int n_id int n_date float value
1  18993  1  18996  1.23
2  19000  2  19003  1.46
3  19007  3  19010  1.69
4  19014  4  19017  1.92
5  19021  5  19024  2.15
6  19028  6  19031  2.38
7  19035  7  19038  2.61
8  19042  8  19045  2.84
9  19049  9  19052  3.07
10  19056  10  19059  3.8
11  19063  11  19066  3.03
12  19070  12  19073  3.26
13  19077  13  19080  3.49
14  19084  14  19087  3.72
15  19091  15  19094  3.95
16  19098  16  19101  4.18
17  19105  17  19108  4.41
18  19112  18  19115  4.64
19  19119  19  19122  4.87
20  19126  20  19129  5.1
21  19133  21  19136  4.83
22  19140  22  19143  5.06
23  19147  23  19150  5.29
24  19154  24  19157  5.52
25  19161  25  19164  5.75
26  19168  26  19171  5.98
27  19175  27  19185  6.21
28  19182  28  19192  6.44
29  19189  29  19199  6.67
30  19196  30  19206  6.9
31  19203  31  19213  7.13
32  19210  32  19220  7.36
33  19217  33  19227  7.59
34  19224  34  19234  7.82
35  19231  35  19241  8.05
36  19238  36  19248  8.28
37  19245  37  19255  8.51
38  19252  38  19262  8.74
39  19259  39  19269  8.97
40  19266  40  19276  9.2
41  19273  41  19283  9.43
42  19280  42  19290  9.66
43  19287  43  19297  9.89
44  19294  44  19304  10.12
45  19301  45  19311  10.35
46  19308  46  19325  10.58
end
order value, before (n_id)

I really appreciate your help.

Tags: None

Robert Picard

Join Date: Mar 2014
Posts: 1536

26 Jul 2015, 12:22

Congratulations on providing an easy to use data example!

I'm a bit confused by what you are trying to do. In your example, there are two instances where n_date skips a week. Which means that your observations out of phase in the sense that w_date[_n+1] < n_date. Do the w_id and n_id have to match when generating new_value or should the values from the closest week be used. The following illustrates the issue

Code:

clear
input int w_id int w_date int n_id int n_date float value
1  18993  1  18996  1.23
2  19000  2  19003  1.46
3  19007  3  19010  1.69
4  19014  4  19017  1.92
5  19021  5  19024  2.15
6  19028  6  19031  2.38
7  19035  7  19038  2.61
8  19042  8  19045  2.84
9  19049  9  19052  3.07
10  19056  10  19059  3.8
11  19063  11  19066  3.03
12  19070  12  19073  3.26
13  19077  13  19080  3.49
14  19084  14  19087  3.72
15  19091  15  19094  3.95
16  19098  16  19101  4.18
17  19105  17  19108  4.41
18  19112  18  19115  4.64
19  19119  19  19122  4.87
20  19126  20  19129  5.1
21  19133  21  19136  4.83
22  19140  22  19143  5.06
23  19147  23  19150  5.29
24  19154  24  19157  5.52
25  19161  25  19164  5.75
26  19168  26  19171  5.98
27  19175  27  19185  6.21
28  19182  28  19192  6.44
29  19189  29  19199  6.67
30  19196  30  19206  6.9
31  19203  31  19213  7.13
32  19210  32  19220  7.36
33  19217  33  19227  7.59
34  19224  34  19234  7.82
35  19231  35  19241  8.05
36  19238  36  19248  8.28
37  19245  37  19255  8.51
38  19252  38  19262  8.74
39  19259  39  19269  8.97
40  19266  40  19276  9.2
41  19273  41  19283  9.43
42  19280  42  19290  9.66
43  19287  43  19297  9.89
44  19294  44  19304  10.12
45  19301  45  19311  10.35
46  19308  46  19325  10.58
end
format %td *_date
order value, before (n_id)


gen w1 = w_date[_n-1] - w_date
gen w2 = n_date[_n-1] - n_date

gen dbefore = n_date - w_date
gen dafter = w_date - n_date[_n-1]
gen range = dafter + dbefore

Comment

Monica Muller

Join Date: Jul 2014

Posts: 226
#3

26 Jul 2015, 12:40

Thanks Robert for your kind response.
I try to explain with an example. For example in the case of n_date=27june2012 to 11july2012 I need to see the days in this range belong to which weeks in w_dates. In this case I need to use three weeks of the w_date data.

24june to 1 jully 4 days with value =5.98 =>(4/7)*5.98
1 jully to 8 july 7 days with value =6.21 =>(7/7)*6.21
8 july to 15 july 3 days with value=6.44 => (3/4)* 6.44

I hope it makes more sense. Thanks for your help.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

26 Jul 2015, 13:18

The following solution ignores the identifiers and calculates a new value based on daily value around n_date. To get daily values, I expand the original weekly data and replicate the value until a new week occurs. I then merge daily value with n_date and calculate a new value based on a +/- 3 day window around n_date.

Code:

clear
input int w_id int w_date int n_id int n_date float value
1  18993  1  18996  1.23
2  19000  2  19003  1.46
3  19007  3  19010  1.69
4  19014  4  19017  1.92
5  19021  5  19024  2.15
6  19028  6  19031  2.38
7  19035  7  19038  2.61
8  19042  8  19045  2.84
9  19049  9  19052  3.07
10  19056  10  19059  3.8
11  19063  11  19066  3.03
12  19070  12  19073  3.26
13  19077  13  19080  3.49
14  19084  14  19087  3.72
15  19091  15  19094  3.95
16  19098  16  19101  4.18
17  19105  17  19108  4.41
18  19112  18  19115  4.64
19  19119  19  19122  4.87
20  19126  20  19129  5.1
21  19133  21  19136  4.83
22  19140  22  19143  5.06
23  19147  23  19150  5.29
24  19154  24  19157  5.52
25  19161  25  19164  5.75
26  19168  26  19171  5.98
27  19175  27  19185  6.21
28  19182  28  19192  6.44
29  19189  29  19199  6.67
30  19196  30  19206  6.9
31  19203  31  19213  7.13
32  19210  32  19220  7.36
33  19217  33  19227  7.59
34  19224  34  19234  7.82
35  19231  35  19241  8.05
36  19238  36  19248  8.28
37  19245  37  19255  8.51
38  19252  38  19262  8.74
39  19259  39  19269  8.97
40  19266  40  19276  9.2
41  19273  41  19283  9.43
42  19280  42  19290  9.66
43  19287  43  19297  9.89
44  19294  44  19304  10.12
45  19301  45  19311  10.35
46  19308  46  19325  10.58
end
format %td *_date
order value, before (n_id)
tempfile main
save "`main'"

list, clean

* verify assumptions about the data
isid w_date
isid n_date

* special case to reach the last n_date
local lastdate = n_date[_N]

* expand to daily data for these variables
keep w_id w_date value
gen days = w_date[_n+1] - w_date
replace days = `lastdate' - w_date + 1 in l // to reach the last n_date
expand days
bysort w_date : gen n_date = w_date + _n - 1
format %td *_date

* merge back using n_date
merge m:1 n_date using "`main'", assert(master match) nogen

* calculate new value based on value on the days around n_date
* -tsegen- is from SSC.
tsset n_date
tsegen new_values2 = rowmean(L(0/3).value F(0/3).value) if !mi(n_id)

* note that the same can be done this way except that there are
* no forward values for the last observation
gen new_values = (L1.value + L2.value + L3.value + value + ///
                F1.value + F2.value + F3.value) / 7 if !mi(n_id)
                
* reduce to the original data
keep if !mi(n_id)

Comment

Monica Muller

Join Date: Jul 2014

Posts: 226
#5

26 Jul 2015, 13:48

Hi Robert,

Thanks so much. It's way above my head. I ran it line by line to understand what happens. One thing I changed is that instead of

Code:

tempfile main save "`main'"

I used:

Code:

save main, replace

Honestly,I've never worked with global macros so I don't know if the changes hurt the code.

Then I used this part:

Code:

list, clean * verify assumptions about the data isid w_date isid n_date * special case to reach the last n_date local lastdate = n_date[_N] * expand to daily data for these variables keep w_id w_date value gen days = w_date[_n+1] - w_date replace days = `lastdate' - w_date + 1 in l expand days bysort w_date : gen n_date = w_date + _n - 1 format %td *_date * merge back using n_date merge m:1 n_date using main, assert(master match) nogen

And then this one:

Code:

tsset n_date tsegen new_values2 = rowmean(L(0/3).value F(0/3).value) if !mi(n_id)

but I received the following errors:

tsset n_date
repeated time values in sample
r(451);

tsegen new_values2 = rowmean(L(0/3).value F(0/3).value) if !mi(n_id)
unrecognized command: tsegen
r(199);

I don't know why it happens. Is it because I saved main in my local folder?
Thanks so much for your help.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

26 Jul 2015, 14:27

No problem, I can break it down a bit. There's no harm in switching to

Code:

save main, replace

Are you sure that the following part does not result in an error? If there are multiple observations per w_date or n_date, that would not work with this approach.

Code:

* verify assumptions about the data
isid w_date
isid n_date

If not, then the rest should work. The first step is to convert w_id w_date value into daily observations. The only difficulty is that you need to extend these daily observations so that there's a value on the last n_date. So

Code:

* special case to reach the last n_date
local lastdate = n_date[_N]

does just that. I places the last value of n_date into the local macro lastdate. Note that having only eyes on the example you posted, I assume that the last n_date is later than the last w_date. If that's not the case, then the code would have to be adjusted.

Then the following code reduces the data to just the w_id w_date value variables and creates daily observations. Rather than trying to explain each part, just put a list statement afterwards and see what is created. Essentially, the value on a given w_date is carried over each new day until a new value is observed.

Code:

* expand to daily data for these variables
keep w_id w_date value
gen days = w_date[_n+1] - w_date
replace days = `lastdate' - w_date + 1 in l // to reach the last n_date
expand days
bysort w_date : gen n_date = w_date + _n - 1
format %td *_date
list, sepby(w_id) noobs

At this point, you should have one observation per day. You can check that there are no duplicates using

Code:

isid n_date

This daily data can now be merged back with the original data. Because the daily data covers all days, you are sure to match all the original values in n_date. You can update the merge to a merge 1:1 since you have unique days in memory and you know that there are no duplicate n_date values in the original data.

Code:

* merge back using n_date
merge 1:1 n_date using "`main'", assert(master match) nogen

In order to use tsegen, you need to install it from SSC. You only need to do this once; to do so, type in the command window

Code:

ssc install tsegen

It's hard to explain how you got multiple values of n_date; based on the data example, you should not have any. That's what the following message is indicating:

Code:

tsset n_date
repeated time values in sample
r(451);

I suggest that you run this updated example. Note that since I used a local macro, it's best to run the whole thing from a do-file. Just save everything in a file called "test.do" in the current directory and run it using

Code:

do test.do

The updated code follows

Code:

clear
input int w_id int w_date int n_id int n_date float value
1  18993  1  18996  1.23
2  19000  2  19003  1.46
3  19007  3  19010  1.69
4  19014  4  19017  1.92
5  19021  5  19024  2.15
6  19028  6  19031  2.38
7  19035  7  19038  2.61
8  19042  8  19045  2.84
9  19049  9  19052  3.07
10  19056  10  19059  3.8
11  19063  11  19066  3.03
12  19070  12  19073  3.26
13  19077  13  19080  3.49
14  19084  14  19087  3.72
15  19091  15  19094  3.95
16  19098  16  19101  4.18
17  19105  17  19108  4.41
18  19112  18  19115  4.64
19  19119  19  19122  4.87
20  19126  20  19129  5.1
21  19133  21  19136  4.83
22  19140  22  19143  5.06
23  19147  23  19150  5.29
24  19154  24  19157  5.52
25  19161  25  19164  5.75
26  19168  26  19171  5.98
27  19175  27  19185  6.21
28  19182  28  19192  6.44
29  19189  29  19199  6.67
30  19196  30  19206  6.9
31  19203  31  19213  7.13
32  19210  32  19220  7.36
33  19217  33  19227  7.59
34  19224  34  19234  7.82
35  19231  35  19241  8.05
36  19238  36  19248  8.28
37  19245  37  19255  8.51
38  19252  38  19262  8.74
39  19259  39  19269  8.97
40  19266  40  19276  9.2
41  19273  41  19283  9.43
42  19280  42  19290  9.66
43  19287  43  19297  9.89
44  19294  44  19304  10.12
45  19301  45  19311  10.35
46  19308  46  19325  10.58
end
format %td *_date
order value, before (n_id)
save main, replace
list, clean

* verify assumptions about the data
isid w_date
isid n_date

* special case to reach the last n_date
local lastdate = n_date[_N]

* expand to daily data for these variables
keep w_id w_date value
gen days = w_date[_n+1] - w_date
replace days = `lastdate' - w_date + 1 in l // to reach the last n_date
expand days
bysort w_date : gen n_date = w_date + _n - 1
format %td *_date

list, sepby(w_id) noobs

* merge back using n_date
merge 1:1 n_date using "main", assert(master match)
tab _merge

* calculate new value based on value on the days around n_date
* -tsegen- is from SSC.
tsset n_date
tsegen new_values2 = rowmean(L(0/3).value F(0/3).value) if !mi(n_id)

* note that the same can be done this way except that there are
* no forward values for the last observation
gen new_values = (L1.value + L2.value + L3.value + value + ///
                F1.value + F2.value + F3.value) / 7 if !mi(n_id)
                
* reduce to the original data
keep if !mi(n_id)

Comment

Monica Muller

Join Date: Jul 2014

Posts: 226
#7

26 Jul 2015, 18:06

Hi Robert,

Thanks for the explanation. It's more clear now. The problem was the installation. Now, I don't receive any errors but the new_values are identical to the old values. For example value and new_value in the first row are both 1.23. I mean proportions did not work on the values. Did you have the correct values? I am copy pasting exactly what you have here. So, our data are the same.

Do you know where I am making mistake?
Thanks a lot.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

27 Jul 2015, 07:00

My bad, I did not look at the new values and didn't realize that I was not calculating the new values as you wanted. The setup can remain the same, I think all you need is to average daily values over periods defined from the start of n_date until the next n_date. Something like

Code:

clear
input int w_id int w_date int n_id int n_date float value
1  18993  1  18996  1.23
2  19000  2  19003  1.46
3  19007  3  19010  1.69
4  19014  4  19017  1.92
5  19021  5  19024  2.15
6  19028  6  19031  2.38
7  19035  7  19038  2.61
8  19042  8  19045  2.84
9  19049  9  19052  3.07
10  19056  10  19059  3.8
11  19063  11  19066  3.03
12  19070  12  19073  3.26
13  19077  13  19080  3.49
14  19084  14  19087  3.72
15  19091  15  19094  3.95
16  19098  16  19101  4.18
17  19105  17  19108  4.41
18  19112  18  19115  4.64
19  19119  19  19122  4.87
20  19126  20  19129  5.1
21  19133  21  19136  4.83
22  19140  22  19143  5.06
23  19147  23  19150  5.29
24  19154  24  19157  5.52
25  19161  25  19164  5.75
26  19168  26  19171  5.98
27  19175  27  19185  6.21
28  19182  28  19192  6.44
29  19189  29  19199  6.67
30  19196  30  19206  6.9
31  19203  31  19213  7.13
32  19210  32  19220  7.36
33  19217  33  19227  7.59
34  19224  34  19234  7.82
35  19231  35  19241  8.05
36  19238  36  19248  8.28
37  19245  37  19255  8.51
38  19252  38  19262  8.74
39  19259  39  19269  8.97
40  19266  40  19276  9.2
41  19273  41  19283  9.43
42  19280  42  19290  9.66
43  19287  43  19297  9.89
44  19294  44  19304  10.12
45  19301  45  19311  10.35
46  19308  46  19325  10.58
end
format %td *_date
order value, before (n_id)
save main, replace
list, clean

* verify assumptions about the data
isid w_date
isid n_date

* special case to reach the last n_date
local lastdate = n_date[_N]

* expand to daily data for these variables
keep w_id w_date value
gen days = w_date[_n+1] - w_date
replace days = `lastdate' - w_date + 1 in l // to reach the last n_date
expand days
bysort w_date : gen n_date = w_date + _n - 1
format %td *_date

list, sepby(w_id) noobs

* merge back using n_date
merge 1:1 n_date using "main", assert(master match) nogen

* calculate new value based on value on the days around n_date
gen n_group = sum(!mi(n_id))
bysort n_group (n_date): egen new_values = mean(value)

* show a few examples
list if inlist(n_group,1,2,26), sepby(n_group) noobs

* reduce to original observations
keep if !mi(n_id)

Comment

Monica Muller

Join Date: Jul 2014

Posts: 226
#9

27 Jul 2015, 08:18

No worries at all. It helped me even more because I spent some time to think about it more and came up with a solution, very basic one of course. I learned so much with your help. I really appreciate it. Have a wonderful day
Comment

Announcement

creating new variable with varied proportions of current variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment