How do I loop sequentially through observations within a variable?

Anthony Voyage

Join Date: Sep 2014

Posts: 14
#1

How do I loop sequentially through observations within a variable?

02 Sep 2014, 17:47

Hi there,
I am quite new to STATA and am still getting to grips with the syntax. I want to loop through my variables using something like:

local j
j=2
while (var(j)==var(j-1)) { /*Here, var(j) means the j'th observation of the variable 'var'*/
...do something...
j=j+1
}

My question is how should I be writing the var(j) part? Also, should I be using _n instead of declaring some j?

Thanks
Tags: None
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#2

02 Sep 2014, 18:33

You probably want to tell us exactly what you're trying to do. It's pretty rare that you actually need to loop through observations within a variable since Stata processes commands by doing just that.

Likely you want something like:
some command if var==var[_n-1]

However all advice at this point is just speculation without more information about what you're trying to do.
Comment
Anthony Voyage

Join Date: Sep 2014

Posts: 14
#3

03 Sep 2014, 00:23

Right, sorry. I've attached a sample of my data with the relevant variables. Firmno is an arbitrary counter for firms, fpedats is an end-date for relevant forecasts and revdt is the date of these forecasts. I would like to create a do-file that will have some nested loops which will run through in a manner like:

local i 1
local j 1
for (i = 1, firmno[i]==firmno[i-1],++i) {
while (fpedats[j]==fpedats[j-1]) {
[INDENT=2]...do something with revdt...[/INDENT]
}
}

The 'do something' part is no problem for me. I realise that this is a very basic question but my problem is with STATA's language. I know this isn't right but I'm don't know how to structure this using STATA's loops (foreach and forvalue) or if I'm calling the observations in the correct manner.

EDIT: Actually, sorry, for completeness I'm also generating a new variable to store results in. So my loop look something like this:

local i 1
local j 1
gen var C_w

for (i = 1, firmno[i]==firmno[i-1],++i) {
j=1
while (fpedats[j]==fpedats[j-1]) {
if j>2 {
C_w = 0.5*(fpedat[j-1] + fpedat[j-2])
}
}
}

Thanks again for any help
Attached Files

sample.dta (14.5 KB, 1 view)

Last edited by Anthony Voyage; 03 Sep 2014, 00:49.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#4

03 Sep 2014, 01:52

Sarah is spot on here: in Stata [NB; not STATA] a loop over observations is rarely needed and indeed the tool of last resort.

Your central calculation is averaging two previous values. If you set up your data as panel data using tsset or xtset then the calculation is quite possibly as little as one line using time series operators. See help tsset and help tsvarlist
Comment
Anthony Voyage

Join Date: Sep 2014

Posts: 14
#5

03 Sep 2014, 02:43

Thanks Nick,

I have tried using the xtset command with firmno as the panelvar and revdt as the timevar. The problem is that there are duplicate observations (i.e. for a given firm a forecast may occur at the same revdt). I cannot simply remove the duplicates though because even if the observation is duplicated by firmno and revdt, it likely isn't by the actual forecast value (note: the above sample data does not have the full variables list, I culled it because I thought it would focus on the problem (I have about 20 variables)). I have thought about slightly altering the affected revdt values as I am more interested in their order rather than their actual date and time but I know of no easy way to do this (and the full sample contains a bit over 2 million lines so I can't do it by hand). It was actually this problem that brought me here.

In the end, I trying to generate the new variable to then use it as a control variable when I run a regression. The regression itself will not be a time-series regression so I was hoping that, by generating a new variable via code, I could avoid the tsset/xtset problem.

I have attached another sample that contains all of the variables in my data. I'm sorry that it's in Excel, it wasn't letting me upload the Stata file for some reason.
Attached Files

sample2.xlsx (43.1 KB, 1 view)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#6

03 Sep 2014, 03:25

In the FAQ Advice under my name there is advice generally against attaching Excel files. I practise what I preach by not even trying to read them.

That aside, we're struggling to find a common basis here. We [meaning, more experienced users interested in trying to help] don't particularly want or need to see your entire dataset. We want to see a fragment of your dataset small enough to copy and paste into our own Statas with enough detail to see exactly what you are trying to calculate. That's best shown by copying and pasting enough observations as CODE mark-up.

With your extra explanation that you don't have panel data in a strict sense, given duplicates, it is still highly likely that most if not all of your problem doesn't need looping. The machinery of by:, _n and _N is still likely to apply.
Comment
Anthony Voyage

Join Date: Sep 2014

Posts: 14
#7

03 Sep 2014, 05:24

Thank you both. I'll give it a shot.
Comment
Anthony Voyage

Join Date: Sep 2014

Posts: 14
#8

04 Sep 2014, 03:00

Hello again! Thanks again for the advice, I have now successfully generated that variable. I used this:

gen yyyy = year(fpedats)
bysort firmno yyyy: gen cw = 0.5*(value[_n-1] + value[_n-2]) if yyyy[_n]==yyyy[_n-1] & yyyy[_n-1]==yyyy[_n-2]

Unfortunately I have to generate another more complicated average as well. Rather than a simple average of the previous 2 observations given that those observations meet the contingent statement (above), I now need to generate a simple average of all observations, going back to a maximum 10, given those observations meet the same contingent statement.

As this is a more complicated variable I was thinking I should break it into 2 steps:
1. Create a variable, j, that counts how many relevant previous observations should be used to calculate the average for a given observation
Can I do this using a lot of contingency statements?
i.e. bysort firmno yyyy: gen j = 0 if yyyy[_n]!=yyyy[_n-1] else j=1 if yyyy[_n]==yyyy[_n-1] & yyyy[_n-1]!=yyyy[_n-2} ........ else j =10

2. Create the average variable, cb, that I think will run something like:
bysort firmno yyyy: gen cb = -99 if j==0 else cb = (1/j[_n])*(value[_n-1] + value[_n-2] + ... + value[_n-j]

Is there an easier way to tell here to tell Stata to go back as far as j?

Thanks again for any help.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#9

04 Sep 2014, 03:59

Good to hear you're making progress.

The short answer is "Don't do it that way".

I am not clear how you think the restriction

Code:

if yyyy[_n]==yyyy[_n-1] & yyyy[_n-1]==yyyy[_n-2]

works as within groups of observations defined by distinct firmno yyyy, yyyy is necessarily identical, so the restriction is redundant, except that the restriction does insist that there are at least 2 previous observations in the same group, as otherwise one or both of the previous yyyy will be missing.
But if there aren't two previous observations in the same group, your average of 2 previous values will be returned as missing any way, so the restriction is not needed.

I will set that on one side.

With your set-up the number of previous observations for the same firmno yyyy is

Code:

bysort firmno yyyy: gen n_previous = _n - 1

Check: if it is observation 42, the number previous is 41. If 1, then 0.

The cumulative sum of previous values within the same group is

Code:

by firmno yyyy: gen double sum_previous = sum(value[_n-1])

Watch out: the sum() function returns the cumulative or running sum. Missings are ignored.

So you are interested in this average

Code:

by firmno yyyy : gen mean_previous = (sum_previous - sum_previous[_n - min(n_previous, 10)]) / min(n_previous, 10)

The average is, like any average, of the form (sum of pertinent values) / (number of pertinent values). Taking that more slowly:

1. Number of pertinent values. Your restriction to at most 10 previous works out as

Code:

min(10, n_previous)

Check: If there were only 7 previous, that's all you could use. So it's min(10, 7) in that case, not max(10, 7). (I find it easy to get it wrong first time with these problems.) Warning: this number of previous values does not ignore missings on anything. How could it? If you have missings on value, you need something more complicated, so sing out.

2. Sum of pertinent values. That is most easily got here as a difference between two cumulative sums, which is why we calculated those. (I learned this trick some while back from Michael Blasnik on this forum. It has the flavour of "Yes, of course; why didn't I see that?".)

It's a good idea to play with a very small dopey example dataset when doing this kind of thing.

The power of by: is easy to underestimate. Some resources that may help are http://www.stata-journal.com/sjpdf.h...iclenum=pr0004 and the data management FAQs on the Stata website, many of which use this prefix command.

EDIT: Looking at the code again it can be seen that the variable n_previous is redundant. as _n - 1 could always be used instead. Nevertheless I suspect many of us like to calculate a variable like that as a way to solve part of the problem and set it on one side. I'll leave the code as it is to show the way I thought through to a solution.

Last edited by Nick Cox; 04 Sep 2014, 04:12.
1 like
Comment
Anthony Voyage

Join Date: Sep 2014

Posts: 14
#10

04 Sep 2014, 17:24

Thanks very much again for the help and the additional pointers
Comment

Announcement

How do I loop sequentially through observations within a variable?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment