choosing x in " rangestat (reg) y x " as the x specified in a variable

charlie wong

Join Date: Jan 2016
Posts: 154

choosing x in " rangestat (reg) y x " as the x specified in a variable

04 Dec 2017, 07:50

hello

I wonder if rangestat reg would allow for choosing in " rangestat (reg) y x " as the x specified in a variable. I hope the following example is self-explanatory.

Code:

clear *
set obs 100
gen day=_n
gen y=runiform(1,10)
gen x1=runiform(1,10)
gen x2=runiform(1,10)
gen xvar="x1" if mod(day,2)==0
replace xvar="x2" if mod(day,2)>0

rangestat (reg) y x1 , interval(day -30 -1)
rename (reg_* b_* se_*) (x1reg_* x1b_* x1se_*)

rangestat (reg) y x2 , interval(day -30 -1)
rename (reg_* b_* se_*) (x2reg_* x2b_* x2se_*)

gen reg_r2=.
replace reg_r2=x1reg_r2 if xvar=="x1"
replace reg_r2=x2reg_r2 if xvar=="x2"  //can reg_r2 like this be obtained with one single rangestat? or any better solution?

Thanks in advance for your advice.

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35715
#2

04 Dec 2017, 08:28

I understand the code but I am not clear that I understand the desire here.

At its simplest, there is no need to manipulate variable names here. You just calculate what you want to use in advance of calling rangestat (SSC, as you are asked to explain).

Alternatively, two different regressions are -- two different regressions.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#3

04 Dec 2017, 08:46

In my actual case, there are many x's and if I do separate rangestat for each x, then i need to do many of them. so I wonder if there is a way to do it with one rangestat. i guess cases similar to my example may arise when, for example, in cases of international trade, country Y's largest trading partner (X) vary over years, and one is interested to find out the impact of the largest trading partner on Y.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#4

04 Dec 2017, 08:58

Well, there is a discrepancy between what you say you want in words and what your code does. To do what you said you want in words you can run:

Code:

clear * set obs 100 gen day=_n gen y=runiform(1,10) gen x1=runiform(1,10) gen x2=runiform(1,10) gen xvar="x1" if mod(day,2)==0 replace xvar="x2" if mod(day,2)>0 levelsof xvar, local(xx) gen real_x = . foreach x of local xx { replace real_x = `x' if xvar == "`x'" } rangestat (reg) y real_x, interval(day -30 -1)

This creates a new variable, real_x, which is equal to the value of the variable designated by xvar and then runs the rolling regressions with a 30 day window using that.

The code you show in #1 is different: it regresses the entire data set against x1, and then the entire data set against x2, and then selects the regression coefficient corresponding to the regression using the variable designated by xvar. That is very different.

Only you know which one you really want.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#5

04 Dec 2017, 09:02

Clyde Schechter . Thanks for looking into. What I really want is what the code in #1 does.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35715
#6

04 Dec 2017, 09:04

Perhaps like Clyde, I remain fuzzy on what you want and what you are failing to achieve with your real problem.

You can fire up different regressions with rangestat (SSC). You just need to spell out different variable names for the results.

Conversely, if the code in #1 does what you want, I really don't know what the question is.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#7

04 Dec 2017, 09:21

Nick Cox , my question is whether there is a more efficient way to achieve what the code in #1 does. My understanding is that in rangestat, for the iteration for each observation, the data in memory is cleared and replaced with the set of observations in range for the current observation. I wonder if at this point, it is possible to instruct rangestat to look for the x of interest (the varname of which being specified in another variable). If this is possible, it seems to me more efficient than doing multiple rangestat and then combine the results to get what is wanted (as the code in #1 does).
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

04 Dec 2017, 09:41

The (reg) statistic can only be used once per rangestat call because of variable renaming issues. That means that it is limited to one regression per call. It seems that you want to be able to perform different regressions depending on a characteristic of the current observation. You could program your own Mata routine to do this with rangestat. Otherwise, you can do this with rangerun (also from SSC):

Code:

clear all
set seed 1234
set obs 100
gen day=_n
gen y=runiform(1,10)
gen x1=runiform(1,10)
gen x2=runiform(1,10)
gen nxvar=1 if mod(day,2)==0
replace nxvar=2 if mod(day,2)>0

program switch_reg
    local n = nxvar[_N]
    local last = _N - 1
    reg y x`n' in 1/`last'
    gen reg_nobs = e(N)
    gen reg_r2 = e(r2)
end
rangerun switch_reg, interval(day -30 0)

* spot check with obs 50 and 51
list in 50/51
reg y x1 if inrange(day, day[50]-30, day[50]-1)
dis e(r2)
reg y x2 if inrange(day, day[51]-30, day[51]-1)
dis e(r2)

and the spot check results:

Code:

. * spot check with obs 50 and 51
. list in 50/51

     +--------------------------------------------------------------------+
     | day          y         x1         x2   nxvar   reg_nobs     reg_r2 |
     |--------------------------------------------------------------------|
 50. |  50    8.99813   4.776833   5.852485       1         30   .0010042 |
 51. |  51   9.214911    2.01923   9.878716       2         30   .0001987 |
     +--------------------------------------------------------------------+

. reg y x1 if inrange(day, day[50]-30, day[50]-1)

      Source |       SS           df       MS      Number of obs   =        30
-------------+----------------------------------   F(1, 28)        =      0.03
       Model |  .206867688         1  .206867688   Prob > F        =    0.8680
    Residual |  205.794637        28  7.34980847   R-squared       =    0.0010
-------------+----------------------------------   Adj R-squared   =   -0.0347
       Total |  206.001505        29  7.10350017   Root MSE        =    2.7111

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .0345631   .2060177     0.17   0.868     -.387445    .4565712
       _cons |   5.462059   1.224887     4.46   0.000     2.952991    7.971127
------------------------------------------------------------------------------

. dis e(r2)
.0010042

. reg y x2 if inrange(day, day[51]-30, day[51]-1)

      Source |       SS           df       MS      Number of obs   =        30
-------------+----------------------------------   F(1, 28)        =      0.01
       Model |    .0427904         1    .0427904   Prob > F        =    0.9411
    Residual |  215.327863        28  7.69028082   R-squared       =    0.0002
-------------+----------------------------------   Adj R-squared   =   -0.0355
       Total |  215.370653        29  7.42657425   Root MSE        =    2.7731

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x2 |   .0158605   .2126251     0.07   0.941    -.4196823    .4514032
       _cons |   5.633254   1.245702     4.52   0.000      3.08155    8.184958
------------------------------------------------------------------------------

. dis e(r2)
.00019868

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#9

04 Dec 2017, 09:47

I wonder if at this point, it is possible to instruct rangestat to look for the x of interest (the varname of which being specified in another variable).

This does not make sense in light of your example. For a given range of values (day from -30 to -1), there is no such thing as "the" x of interest--different x's are designated for different observations within that range.

Be that as it may, I don't see any way to get a single run of -rangestat- to use different variables for the regression within a single range of data.

You can avoid having to code each regression separately by putting them in a loop.

Code:

levelsof xvar, local(xx) gen nobs = . gen r2 = . gen adj_r2 = . gen b = . gen se = . foreach x of local xx { rangestat (reg) y `x', interval(day -30 -1) replace r2 = reg_r2 if xvar == "`x'" replace adj_r2 = reg_adj_r2 if xvar == "`x'" replace b = b_`x' if xvar == "`x'" replace se = se_`x' if xvar == "`x'" }

That makes the coding more efficient, but doesn't materially alter the execution.

Added: Crossed with #8.

Last edited by Clyde Schechter; 04 Dec 2017, 09:49.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#10

04 Dec 2017, 10:00

Robert Picard , thank you very much. This is exactly what I am looking for.

You mention this could also be programmed with Mata routine (which unfortunately I am not familiar with). May I ask if it would be faster than using the rangerun approach you shown?

Thanks again.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#11

04 Dec 2017, 10:03

Clyde Schechter , thank you for looking into!
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#12

04 Dec 2017, 10:10

It would definitively be faster to do it in a custom Mata function because the Mata regression code does not have to perform a lot of the overhead that comes with the regress command. But you have to balance the gain in execution speed with the time it would take you to learn Mata so that you can code your special case.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#13

04 Dec 2017, 10:13

Thanks for the reply, Robert!
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#14

04 Dec 2017, 20:03

It just dawned on me that you can speed up the whole task by looping over trading partners, as suggested by Clyde in #9, and using an invalid interval bound for all observations with different partners. You still have to call rangestat as many times as there are partners but it will only perform regressions for the observations with the current partner.

The following example has 100,000 observations (arranged as a panel) and 50 partners. I first do a dry run to see how long it takes to do a single call (100K regressions). Then I use a loop over each partner using a valid upper bound only for the current partner. Finally, I repeat by adapting Clyde's code in #9:

Code:

* demontration dataset
clear all
set seed 1234
set obs 100
gen long id = _n
expand 1000
bysort id: gen day = mdy(1,1,1980) + _n
format %td day
gen y=runiform()
gen partner = runiformint(1,50)
forvalues i=1/50 {
    gen x`i' = runiform()
}

* a single call that performs all regressions
timer on 1
rangestat (reg) y x1 , interval(day -30 -1) by(id)
drop reg_* b_* se_*
timer off 1

* multiple calls, one for each partner
* use an invalid interval bound to limit regression for target partner
timer on 2
gen rs_nobs = .
gen rs_r2   = .
gen rs_b    = .

qui forvalues i=1/50 {
    gen high = cond(partner == `i', day-1,-999)
    rangestat (reg) y x`i' , interval(day -30 high) by(id)
    replace rs_nobs = reg_nobs if partner == `i'
    replace rs_r2   = reg_r2 if partner == `i'
    replace rs_b    = b_x`i' if partner == `i'
    drop reg_* b_* se_* high
}
timer off 2

* compare with repeating calls by adapting Clyde's code
levelsof partner, local(xx)
timer on 3
gen nobs = .
gen r2 = .
gen b = .
qui foreach x of local xx {
    rangestat (reg) y x`x', interval(day -30 -1) by(id)
    replace nobs = reg_nobs if partner == `x'
    replace r2   = reg_r2 if partner == `x'
    replace b    = b_x`x' if partner == `x'
    drop reg_* b_* se_*
}
timer off 3

timer list

* show that results match
assert nobs  == rs_nobs
assert rs_r2 == r2
assert rs_b  == b

The timing results on my computer are:

Code:

. timer list
   1:      2.04 /        1 =       2.0450
   2:     15.20 /        1 =      15.1960
   3:     98.89 /        1 =      98.8890

I should add a disclaimer that I only do data management and I offer solutions to the questions asked but I make no claims or representations regarding the statistical issues involved.

Comment

charlie wong

Join Date: Jan 2016

Posts: 154
#15

04 Dec 2017, 20:36

Robert Picard , thank you very much again. And your disclaimer is duly noted!

May I ask is there any particular reason you add the panel structure to the data, apart from making the data bigger?
Comment

Announcement

choosing x in " rangestat (reg) y x " as the x specified in a variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment