Dfuller loop, result interpretation or loop problem

Manuel Rodriguez

Join Date: Mar 2015

Posts: 37
#1

Dfuller loop, result interpretation or loop problem

13 Jun 2016, 02:57

Hello statalisters:

I will be very greatful for any help because I find some strange results when running my dfuller loop. The loop is

Code:

forval c = 1/26 { foreach k of varlist Y Yx D Px PX e M m X x PM R Kp KP { local i = optimal_lag dfuller `k' if Cou==`c', lags(`i') trend } }

It runs through 26 countires and 14 variables. My objective is not to treat them as a panel but as separate time series.

I have a varibale called optimal_lag that aims to input the optimal lag in each case, using my local `i'

The results show:

A different Test Statistic and different p-values, for each dfuller (26*14=364 tests) but they all show the same Critical Values.

e-g:

Code:

Augmented Dickey-Fuller test for unit root Number of obs = 39 ---------- Interpolated Dickey-Fuller --------- Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value ------------------------------------------------------------------------------ Z(t) -2.856 -4.251 -3.544 -3.206 ------------------------------------------------------------------------------ MacKinnon approximate p-value for Z(t) = 0.1770

This is why I think there is a problem somewhere.

Thank you in advanced.

Manuel

*********************************************

Update:

I´ve changed my code to add a return list after every dfuller and I see that my local `i' is not feeding the optimal_lag but every time the same lag = 1

Thanks for any help.

Manuel.

Last edited by Manuel Rodriguez; 13 Jun 2016, 03:04.
Tags: loop, syntax, Time Series
Nick Cox

Join Date: Mar 2014

Posts: 35696
#2

13 Jun 2016, 05:59

What's optimal_lag precisely?

Whether it's a variable or a scalar, dfuller will use precisely the same value of optimal_lag each time around the loop.

If it's a variable the code will always use the first value optimal_lag[1]

What in your code do you think will lead Stata to do anything else? Where do you expect Stata to look?
Comment
Manuel Rodriguez

Join Date: Mar 2015

Posts: 37
#3

13 Jun 2016, 06:17

Thank you for your answer.

is a variable that assigns an optimal lag per Country and Variable (the rest of them) and ranges from 1 to 4.

I have optimal_lag variable asigned by Country and have an auxiliary variable idV that ranges from 1 to 14 to identify the optimal_lag to the variable to which it serves.Hence I could say I have a parallel matrix to my data that uses Country , idV and optimal_lag

As my loca l`i' is included in the loop I expected it to work in a similar way and that Stata took teh corresponding lag per Country and Variable. I see now I am wrong.

I was thinking of using this code structure that I see you recommended in the past,

Code:

sum optimal_lag , mean local lags `r(mean)'

but I haven´t been able to make it work yet.

Thank you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#4

13 Jun 2016, 06:28

Well, first, to get the country-specific value of optimal_lag into your loop you need to do it this way (inside your loop):

Code:

summ optimal_lag if Cou == `c', meanonly local i = r(mean) ... dfuller...., lags(`i') //etc.

But it seems your situation is more complicated than that because you need not just a country-specific optimal lag but an optimal lag that is specific to both country and the dependent variable (`k'). I don't understand your description of how your variable idV works to get you from the `i' calculated above to the actual optimal lag you want. So if you can finish the job from here, fine. If not, you need to provide much greater clarity on how optimal_lag and idV are used to calculate the correct value of `i'. That almost certainly means posting some example data and showing a worked (hand-worked if need be) example of the calculation. Please remember to use -dataex- (-ssc install dataex- -help datatex-) to post example data.
Comment

Manuel Rodriguez

Join Date: Mar 2015
Posts: 37

13 Jun 2016, 09:47

Thank you for your response. I will try to explain what I´ve done in order to clarify and get some help.

I have an original data set that ranges from 1970 to 2010, for 26 Countries and 14 variables (1066 observations per variable)

I run a code to obtain my optimal lag variable (I know there are some flaws, but it works):

Code:

levelsof Cou, local(levels) 
foreach l of local levels {
foreach var of varlist Y Yx D Px PX e M m X x PM R Kp KP {
    varsoc `var' if Cou == `l'
    
    matrix A = r(stats)
    
        
    matrix C = nullmat(C) \ A
        }
                }

svmat C, name(col)

egen id = fill(1 1 1 1 1 2 2 2 2 2)

egen idV= fill( 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10 11 11 11 11 11 12 12 12 12 12 13 13 13 13 13 14 14 14 14 14 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10 11 11 11 11 11 12 12 12 12 12 13 13 13 13 13 14 14 14 14 14 )

egen minAIC = min(AIC), by(id)

gen optimal_lag = lag if minAIC == AIC

drop lag LL LR df p FPE AIC HQIC SBIC id minAIC

After this I got the data below.

As you may see I generated idV to identify to which variable my optimal lag belonged, as well as generating an idCou to identify its country (1820 observations for these variables and 1066 for the original ones)

As you see the problem of my program is that it generated an observation per lag (the varsoc uses a default test of 5 lags each time and solves for the optimal) , leaving a blank space in those positions were there is no match and hence are not an optimal lag.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int t double(Y Yx) long Cou float(idCou idV optimal_lag)
2010  370263402233.4621  760976820378092.1  1  1  9 .
2010 1244882071794.8757  847126851349784.4  2  2  3 .
2010  377679835152.1713 289468277209045.06  3  2 11 .
2010 2143035260499.0022  936165830683456.9  4  3  5 .
2010  1574051234262.832 1192762615143165.2  5  3 13 4
2010  217556231203.6262  881580949415636.8  6  4  8 .
2010  287018179193.9199    874180897268004  7  5  2 .
2010 312949596155.81226 1255853547288881.5  8  5 10 .
2010 236706436522.74374 288539516215534.25  9  6  4 .
2010 294223177785.40125 249979838051483.16 10  6 12 .
2010 12564705028.970507 1214323231855005.7 11  7  7 .
2010 1702345926490.1643    907256307500898 12  7  1 1
2010    709190822690.74  631935189357317.5 13  8  9 .
2010  218435338936.2444  453555197072954.2 14  9  3 .
2010  5495387182996.103  794682307524535.6 15  9 11 .
2010 1094499350178.4564  796564859819759.9 16 10  6 .
2010 1049924758675.7711 1143689698500420.5 17 10 14 1
end

To solve the problem of missing observations I separated data by dropping and saving separately idCou, idV, optimal_lag. Once I separated I dropped the missing observations in optimal_lag and merged back to the original data set ( used merge m:m) , renamed idCou=Cou to allow the merging and merged by Cou.

(Note: A problem I found by merging is that it duplicated some observations (because of using m:m) , now I have in each country a repeated optimal lag for one variable. I don´t think it will trouble me with the loop that’s why I haven´t solved it)

Know I have :

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int t double(Y Yx Cou) float(idV optimal_lag)
2010  370263402233.4621  760976820378092.1  1 14 2
2010 1244882071794.8757  847126851349784.4  2 14 3
2010  377679835152.1713 289468277209045.06  3  1 1
2010 2143035260499.0022  936165830683456.9  4  1 1
2010  1574051234262.832 1192762615143165.2  5  1 1
2010  217556231203.6262  881580949415636.8  6  1 1
2010  287018179193.9199    874180897268004  7  1 1
2010 312949596155.81226 1255853547288881.5  8  1 3
2010 236706436522.74374 288539516215534.25  9  1 3
2010 294223177785.40125 249979838051483.16 10  2 1
end

This is the data I am using to run the dfuller loop that aims to use optimal lag that is specific to both country and the dependent variable.

Hope I explained myself correctly.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#6

13 Jun 2016, 11:37

used merge m:m)

Really, there is no need to read further at this point. -merge m:m- is a recipe for garbage except under the most unusual circumstances (and, these are not your circumstances.) You have already noticed that the process produced some surplus observations; you probably didn't notice that many of the observations are also mismatches of sorts. There is no point trying to fix up anything that came after that until this gets fixed.

When you are tempted to use -merge m:m-, one of the following is nearly always true:

1. You really need to use -joinby- or -cross-.
2. You really need to use -merge 1:m- or -merge m:1-, or even -merge 1:1- but to do that you need to specify one or more additional variables from adata set that will uniquely identify the observations in that data set.
3. There are data errors in one or both data sets that result in multiple observations on the merge key when only one should exist.

The bottom line is that -merge m:m- is almost never appropriate. It only produces useful results when there are the same number of observations at each level of the merge key in both data sets, and both data sets have been pre-sorted to assure that the corresponding observations appear in the same order, or, if there are excess observations in one data set at some level, the excess observations all correspond to the final observation at that level in the other data set.

But let's look ahead to when you have fixed that up and actually have the right optimal lag associated with each combination of country and outcome variable. Then I would proceed with something along these lines:

Code:

forvalues c = 1/26 { forvalue n = 1/14 { local k: word `n' of Y Yx D Px PX e M m X x PM R Kp KP summ optimal_lag if idCou == `c' & idV == `n' local i = r(mean) dfuller `k' if Cou==`c', lags(`i') trend } }

This assumes that in your data set at this point there are variables idCou and idV which range from 1 through 26 and 1 through 14 respectively, each combination appearing once, and in the same observation as the correct value of optimal_lag for that combination of country and outcome variable.

All of that said, I wouldn't use this approach in the first place. You are really mutilating your data set by throwing in variables idCou and idV that contain values that are not associated with the values of most of the other variables in the same observations. That is a data structure that one sometimes sees in spreadsheets intended for human eyes, but is awkward to work with in a statistical package (as you are seeing.) I think I would handle the identification of the appropriate optimal_lag by reference to the appropriate element of matrix C.
Comment
Manuel Rodriguez

Join Date: Mar 2015

Posts: 37
#7

14 Jun 2016, 09:22

Clyde thank you very much.

I have taken sometime to think over your answer. It is very useful.

First. I used merge m:m because the other alternatives of merge didn´t work. My optimal_lag (identified by idV and idCou) has repeated values and its length is smaller (364 observations) than the original data set (it has 1066 observations).

. merge 1:m Cou using "D:\Users\VA07\Desktop\All Stata 12\Tesis\DO Tesis\Do 2016\opt_lag.dta"
variable Cou does not uniquely identify observations in the master data
r(459);

. merge m:1 Cou using "D:\Users\VA07\Desktop\All Stata 12\Tesis\DO Tesis\Do 2016\opt_lag.dta"
Cou was long now double
variable Cou does not uniquely identify observations in the using data
r(459);

. merge 1:1 Cou using "D:\Users\VA07\Desktop\All Stata 12\Tesis\DO Tesis\Do 2016\opt_lag.dta"
variable Cou does not uniquely identify observations in the master data
r(459);

Used also joinby and cross both generated duplicates. The joinby repeated the lags per idV. Means that I had for 1 country all the lags repeated at least 2 times and even 3.

The cross also repeated in the most strange way. until repeated optimal lag 662480 times.

I am working on this but just I am just trying to explian wy I used m:m. Seemed the less strnage.

With regard to your code, thousands of thank you. I think it will work or goes in the correct direction. I run it over the original set of data, the one that had blanks

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int t double(Y Yx) long Cou float(idCou idV optimal_lag) 2010 370263402233.4621 760976820378092.1 1 1 9 . 2010 1244882071794.8757 847126851349784.4 2 2 3 . 2010 377679835152.1713 289468277209045.06 3 2 11 . 2010 2143035260499.0022 936165830683456.9 4 3 5 . 2010 1574051234262.832 1192762615143165.2 5 3 13 4 2010 217556231203.6262 881580949415636.8 6 4 8 . 2010 287018179193.9199 874180897268004 7 5 2 . 2010 312949596155.81226 1255853547288881.5 8 5 10 . end

It worked for around 200 of the 364 dfullers it has to do until at the end it collapsed saying

Code:

option lags() incorrectly specified r(198); end of do-file

However, it is perfect, I didn´t know about the word option ans know I am studying it.

Last edited by Manuel Rodriguez; 14 Jun 2016, 09:25.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#8

14 Jun 2016, 10:13

Since I still don't understand your whole data setup with idCou and idV and optimal_lag, I'm making blind guesses here. But perhaps there is some combination of idCOU and idV for which there is no corresponding value of optimal_lag: perhaps optimal_lag is missing for that. If so, local macro i will be empty, and -lags(`i')- becomes -lags()-, which is not legal syntax for the dfuller command.

You can test whether that is happening by adding a -display `i'- command in the code from #6 just before the -dfuller- command. If I am right, it will show a number every time it works, and then when it hits the breaking point there will be no output from -display-. At that point, there are two possibilities.

1. There is a good reason that there is no optimal_lag value for this combination of idCou and idV. In that case we need to revise the loop to skip over such cases:

Code:

forvalues c = 1/26 { forvalue n = 1/14 { local k: word `n' of Y Yx D Px PX e M m X x PM R Kp KP summ optimal_lag if idCou == `c' & idV == `n' local i = r(mean) if "`i'" == "" { display `"No optimal lag defined for idCou == `c' & idV == `n'"' } else { dfuller `k' if Cou==`c', lags(`i') trend } } }

2. There should be an optimal_lag value but the previous code failed to locate it properly in the data. (Given that you messed around with -merge m:m-, that wouldn't surprise me at all. -merge m:m- almost always ends in tears.) If this is the case then you have to fix that to make sure you have an optimal_lag specified for each combination. If you are indeed going to have to revisit the code that created that correspondence, I strongly urge you not to try to revise it, but to scrap it altogether and use a different approach. It is difficult for me to advise you on specifics because I know nothing about the -varsoc- command, and when I tried to read the help file and manual section it felt like reading a foreign language. I just don't work with time series and don't even know the basics.

That said, here is what I could understand. When you run -varsoc-, it returns a matrix r(stats) which contains a bunch of statistics. Somehow you are able to extract the optimal lag from that matrix of statistics. So what I would do is something like this:

Code:

levelsof Cou, local(levels) local n_countries: word count `levels' matrix OL = J(14, `n_countries', .) forvalues m = 1/`n_countries' { local l: word `m' of `levels' forvalues n = 1/14 { local var: word `n' of Y Yx D Px PX e M m X x PM R Kp KP varsoc `var' if Cou == `l' matrix A = r(stats) // HERE INSERT CODE TO EXTRACT THE OPTIMAL LAG FOR // Country `l' AND VARIABLE `var' // PLACING THAT OPTIMAL LAG IN A LOCAL MACRO opt_lag matrix OL[`n', `m'] = `opt_lag' } }

Then in the loop from #6, replace

Code:

summ optimal_lag if idCou == `c' & idV == `n' local i = r(mean)

with

Code:

local i = OL[`n', `c']

and I think things will go much more smoothly.
Comment
Manuel Rodriguez

Join Date: Mar 2015

Posts: 37
#9

14 Jun 2016, 10:26

Wow many many thanks Clyde.

Great help!!!

I´ll try to be clear , explaining myself.

1. I am able to have my original data before messing with m:m. However as you suggest , may be my way of optaining the optimal lag is too messy.

What I did with varsoc is run a loop and stock all the results of r(stats) in a matrix. These stats give different criteria to chose an optimal lag and I choose only one of them.

Consequently I extracted that information and created a new variable optimal_lag that had the caveat of having blanks and having more observations than my original data set.

In order to identify the observations of this variable that it was something like the code below I created two auxiliary varibles that identified each block of 5 observations (the varsoc stores r(stats) for 5 lags per variable and country) with a country and variable. So I had in parallel to the optimal_lag I created these auxiliary varibles.

Code:

optimal_lag 1 . . . . . . 2 . . .

2. With regard to the dfuller loop I will try it as soon as I solve my previous problems.

I really thank you for your help and will let you know how things go.

Thank you.

Last edited by Manuel Rodriguez; 14 Jun 2016, 10:28.
Comment

Manuel Rodriguez

Join Date: Mar 2015
Posts: 37

#10

15 Jun 2016, 05:38

Hello again,

I´ve been working on this, solved my problems of merging via a matrix and know have started to run the dfuller loop.

I fixed somethings and it works alround.

I used this code, with various display to check if things go wrong.

Code:

forvalues c = 1/26 {
    forvalue n = 1/14 {
        local k: word `n' of Y Yx D Px PX e M m X x PM R Kp KP 
        summ optimal_lag if idCou == `c' & idV == `n'
        local i = r(mean)
        if "`i'" == "" {
            display `"No optimal lag defined for idCou == `c' & idV == `n'"'
        }
        else {
        display `i'
        display `k' 
        display `c'
            dfuller `k' if Cou==`c', lags(`i') trend
                        return list
        }
    }
}

I only have a problem and is with the parallel local k.

When it displays k, I only get one value of each of the variables, instead of all the variables of the time serie. This is is my dfuller runs only over a year the first year.

this is what I get.

Code:

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
 optimal_lag |         1           1           .          1          1
1
2.374e+10
1

Augmented Dickey-Fuller test for unit root         Number of obs   =        39

                               ---------- Interpolated Dickey-Fuller ---------
                  Test         1% Critical       5% Critical      10% Critical
               Statistic           Value             Value             Value
------------------------------------------------------------------------------
 Z(t)             -2.620            -4.251            -3.544            -3.206
------------------------------------------------------------------------------
MacKinnon approximate p-value for Z(t) = 0.2709

scalars:
               r(lags) =  1
                  r(N) =  39
                 r(Zt) =  -2.619610584487132
                  r(p) =  .2709487996569067

As I don´t control much macros, especially the "word" extended function I guess the problem is there.I am working on it, but just wanted to let you know how things are and see if there is an easy way to solve it that an amateur like me is not seeing.

Thank you again!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#11

15 Jun 2016, 07:56

While I don't know anything about -dfuller-, just knowing how Stata programs generally provide output, it seems to me that the example of -dfuller- output you show has run using r(N) = 39 observations.

I think what is confusing you is misunderstanding how -display- works. The local macro k in this program contains the name of a variable. When you run -display- with the name of a variable, Stata interprets that as asking it to display the value of that variable in the first observation. If you want to see all the value of that variable, then you need the -list- command, not -display-.

If the purpose of your -display `k'- statement is just to enable you to track the progress of your loops, I would change it to -display "`k'"-. That will cause Stata to display the actual name of the variable being Dickey-Fullered in the current iteration. So it will respond with things like Y, or Yx, etc. If you actually want to see the full list of values of `k' that will be used by -dfuller- then you can't do that with -display-. The command would, instead, be -list `k' if Cou == `c'-

In any case, I'm pretty sure that you are actually getting -dfuller- to run on the right sample in your code.
Comment
Manuel Rodriguez

Join Date: Mar 2015

Posts: 37
#12

15 Jun 2016, 08:53

Wonderful you've opened my eyes.

Thank you very much for everything.

As soon as I check everything I will post the full code so it remains for future doubts.

Thank you again.
Comment

Manuel Rodriguez

Join Date: Mar 2015
Posts: 37

#13

17 Jun 2016, 03:22

Dear all thank you for your help.

Here is the final working code.

The last part is to generate avariable with the p values.

Code:

forvalues c = 1/26 {
    forvalue n = 1/14 {
        local k: word `n' of Y Yx D Px e M m PM X x PX R Kp KP
        summ optimal_lag if idCou == `c' & idV == `n'
        local i = r(mean)
        if "`i'" == "" {
            display `"No optimal lag defined for idCou == `c' & idV == `n'"'
        }
        else {
        display `i'
        display "`k'" 
        display `c'
            dfuller `k' if Cou==`c', lags(`i') trend
                        return list    
                        
                        matrix B = r(p)
                        matrix D = nullmat(D) \ B
        }
    }
}


svmat D, name(col)
rename c1 Pdf

thank for your help Clyde.

Announcement