Generate variable containing weighed rowmean

Nadine Bast

Join Date: Aug 2018

Posts: 23
#1

Generate variable containing weighed rowmean

20 Feb 2019, 09:25

Dear all,

it might be a wood for the trees kind of problem, but I cannot wrap my head around it.
I have two datasets, one where the observations are seconds and variables are a score for each second for a person:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double(Sec PersonA) str7 PersonB double PersonC str7 PersonD 12 2 "4" 2 "4" 13 4 "4" 4 "4" 14 4 "2" 4 "2" 15 4 "4" 4 "4" 16 4 "3" 4 "3" 17 3 "3" 3 "3" 18 3 "2" 3 "2" 19 3 "1" 3 "1" 20 3 "2" 3 "2" 21 5 "3" 5 "3" 22 5 "4" 5 "4" 23 5 "5" 5 "5" 24 5 "5" 5 "5" 25 4 "5" 4 "5" end

(All values are numeric, I got this wrong in my example).

What I did before was this: egen mean=rowmean( PersonA- PersonD ). But now it got trickier:

Another data set contains information about each of these persons, on the basis of which the mean for each second of all persons is to be weighted. In my example data, my goal would be to eaven out values based on kind of car and pet. Unfortunately, every person is a case here, so I can't just match the sets and create a variable for weighting.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str7 PersonID double(car Pet) "PersonA" 1 2 "PersonB" 2 4 "PersonC" 1 4 "PersonD" 1 4 end

Can anyone think of a smart way to work my way around?The dataset is one of many and contains data for thousands of seconds and many persons.
As always, thankful for every suggestion and many thanks in advance!

Best
N.
Tags: None

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

20 Feb 2019, 11:13

I think you're going to have to reshape your dataset with seconds in order to make these calculations. And perhaps reshape back into long after, if there other calculations to be made.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double Sec PersonA PersonB PersonC PersonD
12 2 4 2 4
13 4 4 4 4
14 4 2 4 2
15 4 4 4 4
16 4 3 4 3
17 3 3 3 3
18 3 2 3 2
19 3 1 3 1
20 3 2 3 2
21 5 3 5 3
22 5 4 5 4
23 5 5 5 5
24 5 5 5 5
25 4 5 4 5
end

*reshape
reshape long Person, i(Sec) j(PersonID) string
reshape wide Person, i(PersonID) j(Sec)
ren Pers* Sec*
ren SeconID PersonID
replace PersonID ="Person"+PersonID
* merge with weighting vars
merge 1:1 PersonID using dataset2.dta
*drop _merge

*weighting
* 12 and 25 in this loop are the first and last second in dataset
* wieghting here as original value*value of car, as example
forvalues i = 12(1)25 {
replace Secon`i' = Secon`i'*car
}

Comment

Nadine Bast

Join Date: Aug 2018

Posts: 23
#3

21 Feb 2019, 01:29

Thank you, Jorrit, I will try that! I will have to rename my variables first as they are unfortunately named with random ids without a uniform stem.
I will report how it went.
Thanks again!

Edit: I just did this with my example data to comprehend each step. But I struggle with reshaping the data back to the original set afterwards.
I dropped the car- and pet-variable first since I see no way keeping them in the second-by-second-set. Then I did:

reshape long Secon, i(PersonID) j(Sec)
reshape wide Secon, i(Sec) j(PersonID) string

The data seems fine, but I want to make sure I didn't make a mistake a can't trace later in the huge dataset... Would that be the correct 'way back'?

Thank you!

Last edited by Nadine Bast; 21 Feb 2019, 02:18.
Comment
Nadine Bast

Join Date: Aug 2018

Posts: 23
#4

21 Feb 2019, 03:24

I think my data might be to much for Stata... I got this error:

Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 807438 -> 138
Number of variables 3 -> 5852
j variable (5851 values) Sekunde -> (dropped)
xij variables:
Person_ -> Person_1 Person_2 ... Person_5851
characteristic contents too long
The maximum value of the contents is 67,784.
r(1004);

Is there any way to work around this?
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#5

21 Feb 2019, 04:34

Your code in post #3 is fine.

As for the error in #4:
Where Stata complains about the length of the statement "Person_1 Person_2 ... Person_5851" in the reshape command: when expanded for all personID's, this statement exceeds 67,784 characters.
In some cases this limitation would mean you'd need to cut your dataset in pieces and run the code on each individual piece.
A simple fix in this specific case is to make var names shorter. You can keep a list of 5851 variable names shorter than those 67,784 characters by doing e.g. below changes.
Adjust to your variable names and PersonID

Code:

*Some fake data clear set maxvar 10000 set obs 138 forvalues i = 1(1)5851 { gen Person`i' = runiform() } gen Sec= _n *reshape reshape long Person, i(Sec) j(PersonID) string reshape wide Person, i(PersonID) j(Sec) ren Pers* Sec* ren SeconID PersonID replace PersonID ="Person"+PersonID * merge with weighting vars merge 1:1 PersonID using dataset2.dta *weighting forvalues i = 12(1)25 { replace Secon`i' = Secon`i'*car } *Reshape back drop car Pet _merge reshape long Secon, i(PersonID) j(Sec) ren Secon S replace PersonID = subinstr(PersonID, "Person", "P", .) reshape wide S, i(Sec) j(PersonID) string ren SP* Person_*
Comment
Nadine Bast

Join Date: Aug 2018

Posts: 23
#6

21 Feb 2019, 05:02

Thank you, Jorrit!
I already started to split my data in sets of 450 observations each, which worked but produced 14 sets.

The error actually occurs already in the first part:
reshape long Person, i(Sec) j(PersonID) string works but reshape wide Person, i(PersonID) j(Sec) produces the error So I will try to put your code in earlier. Your help is greatly appreciated!
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#7

21 Feb 2019, 05:07

I didnt have an issue running the exact code posted in #5. Might be some slight changes to variable names in your dataset?
Logic remains the same though: make sure that the list of variables that reshape generates does not exceed 67,784 characters. So with 5800 vars, stick to short variable names.
Comment
Nadine Bast

Join Date: Aug 2018

Posts: 23
#8

21 Feb 2019, 05:31

Sorry, I have expressed myself unclearly - the code is fine. What I meant is that the error concerning excession of characters occured already with the first code (in
my long dataset, not the example). It works fine with a smaller part of the data.
I thought stata complained due to the enormous amount of observations in the long set (80.7000),
but will try if it works with your suggestion and shorten the names. As each person got a random id of 12 letters each, which I can't change because I need them to match the data later,
I'm nut sure if shortening the prefix is enough.
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#9

21 Feb 2019, 05:36

Ah, okay. I would agree it would be easiest/safest to retain original ID numbers. So in that case splitting the dataset would be a sensible workaround.
There might also be alternative routes to what you are trying to accomplish, but I cant think of them.
Comment
Nadine Bast

Join Date: Aug 2018

Posts: 23
#10

21 Feb 2019, 07:32

Nevermind, thanks for all your help!
Comment

Announcement

Generate variable containing weighed rowmean

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment