Multiple linear regression

Freddy Panakkal

Join Date: Jul 2018

Posts: 20
#16

13 Jul 2018, 13:14

I tried to run the code, but it was running for hours so I had to stop it. In total there are around 600 regressions to be done. Can I do only one regression, so I can check if it is correct? How do I have to change the code?

Also if there are six variables are created for each peer of each regression, wouldn't it be easier to save the regression table in an other file instead of the same dataset?
I estimate 15 peers per regression so there will be more the 50'000 new variables created.
Comment
Freddy Panakkal

Join Date: Jul 2018

Posts: 20
#17

13 Jul 2018, 13:34

Or is it possible to run the regressions without having the two datasets (list of peers and shareprice returns) merged together? So a program which takes all the peers from the first dataset and then take the corresponding five years of the shareprice returns from the second data set?
Since a company can be a peer for several focal firms and for several fiscal years the same peer (and its shareprice returns) will be in the data set several times.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#18

13 Jul 2018, 13:36

I tried to run the code, but it was running for hours so I had to stop it. In total there are around 600 regressions to be done. Can I do only one regression, so I can check if it is correct? How do I have to change the code?

If you add the -status- option to the -runby- command you will get a period progress report as Stata proceeds through the iterations, so you can get some assurance that progress is actually being made. To test it with just one regression, just drop from the data set all of the observations except those corresponding to a single cik Startfiscalyear combination.

Also if there are six variables are created for each peer of each regression, wouldn't it be easier to save the regression table in an other file instead of the same dataset?
I estimate 15 peers per regression so there will be more the 50'000 new variables created.

No, it wouldn't be easier; it would be a bit more work. If you want a more compact data set, then before running the regression code, you can drop all the variables that aren't actually needed. That way you will end up with the same thing you would have if you put the results in a different data set.

But now I have to ask you what on earth you are going to do with these 50,000 variables. I can't even imagine it.
Comment
Freddy Panakkal

Join Date: Jul 2018

Posts: 20
#19

13 Jul 2018, 14:18

But now I have to ask you what on earth you are going to do with these 50,000 variables. I can't even imagine it.

For each regression I have to calculate a measure for which I need the beta coefficents of each peers.

Other question, is it possible to have the peers numerated? So instead of variables like b_0000040704 there would be a variable b_peer1, b_peer2...
So the number of variables would be dramatically minimized.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#20

13 Jul 2018, 14:40

Other question, is it possible to have the peers numerated? So instead of variables like b_0000040704 there would be a variable b_peer1, b_peer2...
So the number of variables would be dramatically minimized.

Sure. Just change program one_regression as shown below.

Code:

capture program drop one_regression
program define one_regression
    keep cik peerCIK mdate spr_*
    reshape wide spr_peer, i(cik mdate) j(peerCIK)
    regress spr_focal spr_peer*
    matrix M = r(table)
    local peer_ciks: colvarlist M
    local peer_ciks: subinstr local peer_ciks "spr_peer" "", all
    local i = 1
    foreach p of local peer_ciks {
        if substr("`p'", 1, 2) != "o." {
            gen b_peer`i' = M[1, `i']
            gen se_peer`i' = M[2, `i']
            gen t_peer`i' = M[3, `i']
            gen p_peer`i' = M[4, `i']
            gen lb_peer`i' = M[5, `i']
            gen ub_peer`i' = M[6, `i']
        }
        local ++i
    }
    gen rsq = e(r2)
    gen n_obs = e(N)
    exit
end

Comment

Freddy Panakkal

Join Date: Jul 2018

Posts: 20
#21

13 Jul 2018, 15:18

If you add the -status- option to the -runby- command you will get a period progress report as Stata proceeds through the iterations, so you can get some assurance that progress is actually being made. To test it with just one regression, just drop from the data set all of the observations except those corresponding to a single cik Startfiscalyear combination.

Sorry for the question, but what do I have to write to add status to the rangerun command?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#22

13 Jul 2018, 15:41

Just insert the word status at the end of that line (separated from what precedes it by a blank space).

Some unsolicited advice. Based on this question, I infer that you have not yet learned the basics of Stata's syntax. The kind of project you are undertaking here is, frankly, too advanced for a beginner. I'm happy to help you through working out this code. But once you have this behind you, before you undertake any new projects (or, if this project does not have an urgent deadline, take a break and do this now) you should read the Getting Started [GS] and User's Guide [U] volumes of the PDF documentation that comes with your Stata installation. It covers the fundamental commands that are used all the time in data management and analysis, with clear explanations and worked examples. It also will give you a sense of the overall approach to data mangement and analysis that Stata embodies. (You need to think about things differently from how you do in other statistical packages or spreadsheets.) It is, admittedly, a lot of reading. And you won't remember all the details. But it will enable you to approach most situations with the ability to identify the commands that are likely to be needed. Then, whatever details you don't remember, you can look up in the help files or the PDF documentation. I assure you this investment of your time will be amply repaid in short order.
Comment
Freddy Panakkal

Join Date: Jul 2018

Posts: 20
#23

13 Jul 2018, 15:45

Hmm, I tried:

rangerun one_regression, by(cik Startfiscalyear) interval(mdate earliest latest) status

and following error appeared:

option status not allowed
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#24

13 Jul 2018, 17:16

Sorry, my error. -rangerun- does not have a status option. I was thinking of -runby-, which is not in play here.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#25

13 Jul 2018, 17:33

I should add that I think something is probably going wrong if you are even concerned about the amount of time it's taking You indicated that you have about 600 of these regressions to run. Now, there is a little bit of overhead associated with -rangerun- itself, but the bulk of the time should be due to -regress- and -reshape-. When I make up a toy data set with about 900 observations (15 peers with 60 months of data each: your typical regression) and do a regression and a reshape, and iterate that process 600 times, it takes about 3 minutes. Obviously different machines will take different amounts of time, but it still should be something that runs in a very tolerable amount of time.

Did you incorporate in your code the lines

Code:

// CREATE AN EMPTY MDATE RANGE FOR ALL BUT ONE OBSERVATION PER CIK-Startfiscalyear // TO PREVENT UNNECESSARY REPETITIONS by cik Startfiscalyear (earliest), sort: replace earliest = mdate+1 if _n > 1 by cik Startfiscalyear (latest), sort: replace latest = mdate-1 if _n > 1

If you left those out, then you will be doing the regression for every single observation in the data set, rather than just once for each cik Statfiscal year combination. That could blow up the run time by a couple orders of magnitude!
Comment
Freddy Panakkal

Join Date: Jul 2018

Posts: 20
#26

16 Jul 2018, 03:10

Yes I did run those lines.
I think the total number of observations is too big (almost 1,5 m observations). So I divided the data set into three (for each fiscal year) and could run all the regressions in about 3 hours.

Thank you very much for your help!
Comment
Freddy Panakkal

Join Date: Jul 2018

Posts: 20
#27

16 Jul 2018, 13:01

In regard of the regressions with a low number of observations:
Do you know, if it is possible to force stata to do the regressions with a minimum number of observations (e.g. at least 30 observations) by dropping the independent variable with the most missing data until the number of observations is achieved? Or do I have to do this manually?

Currently I am thinking about how to deal with the missing data, since more than 1/3 of the regressions have too low number of observations.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#28

16 Jul 2018, 14:09

Well, with 1.5 million observations, doing this by hand sounds like a task you are unlikely to complete in your life time. Even if you do, I would give long odds that you will make mistakes along the way. So I wouldn't even consider doing this by hand.

Yes, this can be programmed. But it's a bit complicated to program and it will really slow the execution of the model appreciably. Are you really sure you want to do it this way? I'm not convinced that it's even a valid approach to use, although not knowing exactly how you plan to use the regression coefficients I can't draw sharp conclusions either way.

Nevertheless, try this revised -program one_regression-. It is partially tested. If it doesn't work in your data, post back with an example of data that is ready to go into -runby- and contains both cases where 30 usable observations are present, and some where they are not.

Code:

capture program drop one_regression
program define one_regression
    keep cik peerCIK mdate spr_*
    reshape wide spr_peer, i(cik mdate) j(peerCIK)

    //    ELIMINATE VARIABLES WITH THE MOST MISSING DATA
    //    UNTIL WE GET AN ESTIMATION SAMPLE OF AT LEAST 30
    ds spr_peer*
    local dvs `r(varlist)'
    local mcounts
    foreach v of local dvs {
        count if missing(`v')
        local mcounts `mcounts' `:display %010.0f r(N)'#`v'
    }
    local mcounts: list sort mcounts
    local mcounts: subinstr local mcounts "#" " ", all
    tokenize `mcounts'
    local dvs_c: subinstr local dvs " " ", "
    count if !missing(spr_focal, `dvs_c')
    while `r(N)' < 30 & "`2'" != "" {
        local dvs: subinstr local dvs " `2'" ""
        local dvs: subinstr local dvs "`2'" ""
        macro shift 2
        local dvs_c: subinstr local dvs " " ", "
        count if !missing(spr_focal, `dvs_c')
    }

//    NOW DO THE REGRESSION (WHICH MAY BE JUST ON A CONSTANT)
    regress spr_focal `dvs'
    matrix M = r(table)
    local peer_ciks: colvarlist M
    local peer_ciks: subinstr local peer_ciks "spr_peer" "", all
    local i = 1
    foreach p of local peer_ciks {
        if substr("`p'", 1, 2) != "o." {
            gen b_peer`i' = M[1, `i']
            gen se_peer`i' = M[2, `i']
            gen t_peer`i' = M[3, `i']
            gen p_peer`i' = M[4, `i']
            gen lb_peer`i' = M[5, `i']
            gen ub_peer`i' = M[6, `i']
        }
        local ++i
    }
    gen rsq = e(r2)
    gen n_obs = e(N)
    display _newline(5)
    exit
end

Comment

Freddy Panakkal

Join Date: Jul 2018

Posts: 20
#29

16 Jul 2018, 15:32

Unfortunately the code doesn't seem to work. Nothing happens when I run the program.

Attached are two examples, one with more then 30 valid observations and one with less than 30.

Attached Files

dataex.do (106.7 KB, 1 view)
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#30

16 Jul 2018, 16:11

This works with your example:

Code:

capture program drop one_regression
program define one_regression
    keep cik peerCIK mdate spr_*
    reshape wide spr_peer, i(cik mdate) j(peerCIK)
    //    ELIMINATE VARIABLES WITH MISSING DATA
    ds spr_peer*
    local dvs `r(varlist)'
    local mcounts
    foreach v of local dvs {
        count if missing(`v')
        local mcounts `mcounts' `:display %010.0f r(N)'#`v'
    }
    local mcounts: list sort mcounts
    local mcounts: subinstr local mcounts "#" " ", all
    tokenize `mcounts'
    local dvs_c: subinstr local dvs " " ", ", all
    count if !missing(spr_focal, `dvs_c')
    while `r(N)' < 30 & "`2'" != "" {
        local dvs: subinstr local dvs "`2'" ""
        local dvs = trim(itrim(`"`dvs'"'))
        macro shift 2
        local dvs_c: subinstr local dvs " " ", ", all
        if `"`dvs_c'"' == "" {
            continue, break
        }
        count if !missing(spr_focal, `dvs_c')
    }
    regress spr_focal `dvs'
    matrix M = r(table)
    local peer_ciks: colvarlist M
    local peer_ciks: subinstr local peer_ciks "spr_peer" "", all
    local i = 1
    foreach p of local peer_ciks {
        if substr("`p'", 1, 2) != "o." {
            gen b_peer`i' = M[1, `i']
            gen se_peer`i' = M[2, `i']
            gen t_peer`i' = M[3, `i']
            gen p_peer`i' = M[4, `i']
            gen lb_peer`i' = M[5, `i']
            gen ub_peer`i' = M[6, `i']
        }
        local ++i
    }
    gen rsq = e(r2)
    gen n_obs = e(N)
    exit
end

//    CONVERT CIK AND PEER CIK TO NUMERIC VARIABLES
destring cik peerCIK, replace
//    CREATE AN EMPTY MDATE RANGE FOR ALL BUT ONE OBSERVATION PER CIK-Startfiscalyear
//    TO PREVENT UNNECESSARY REPETITIONS
replace earliest = mdate+1 if !flag
replace latest = mdate-1 if !flag

rangerun one_regression, by(cik Startfiscalyear) interval(mdate earliest latest)

Added or changed code in italics. The problem was primarily in my manipulating the list of dependent variables' commas and spaces. What I had before worked with only two peers, but failed with larger numbers. This should work generally.

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment