Loop regression

Jan Vandenbossche

Join Date: Apr 2016

Posts: 4
#1

Loop regression

23 Apr 2016, 12:56

Hi, I’m really new to Stata so I don’t know anything. Forgive me beforehand for giving unclear information or asking stupid questions.
I finally was able to create a good dataset (I think). For 194 months I have the Monthly return, the Size and other variables of 557 equity funds. Now I want to do a regression for every month so that I will get 194 coefficients for every variable and want to get to mean of every variable.
I posted a photo of my data. (I used a reshape long command for my excel files to get it like this, normally it was in a wide form) Fonds mean Fund, R stands for Return of a fund, logS is the logarithm of the Size of the fund, logF is the logarithm of the Size of the Family fund, A is the age of the fund, E is the monthly expense ratio and H is the number of holdings the fund has in a certain month.
A friend of mine who is a bit familiar with stata said I should make a do-file with this code :

tempfile Regression1
local j=1
while `j' <=194 {
quietly {
preserve
noisily di in green "Jan" _continue
reg R logS logF A E H
matrix M = e(M)
svmat M, name(Coeff_)
collapse (mean) Coeff*

if `j' > 1 {
append using `Regression1'
}
save `Regression1', replace
restore
local j = `j'+1
}
}

But if i do this i get all identical regressions. Can anyone help here?

1 Photo
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

23 Apr 2016, 13:18

Well, the loop is doing the same regression over and over again because there is no command inside the loop that makes any reference to the looping parameter `j'. So each iteration of the loop does exactly the same thing: a regression on the full data set

If I understand your goal correctly, you want to do 194 regressions, one for each month: each regression will include all (or as many as possible) of the 557 funds.

There are several other things in the code that can be improved. Although the -while- construction still works, -forvalues- would be simpler. The code refers to a matrix e(M), but -regress- does not in fact create any matrix by that name. Rather than constantly -preserve-ing and -restore-ing a large data set, it is better to put the results in a -postfile- and not change the data in memory over and over again. And finally, there actualy is no need for a loop at all, because there is a command, -statsby- that automates this entire process.

Code:

tempfile results statsby _b _se, saving(`results') by(month): regress R logS logF A E H use `results', clear // NOW DO WHATEVER YOU WANT WITH THE RESULTS

Your screen shot is hard to read, especially off towards the right. This is typical of screen shots (in fact many of them are entirely unreadable), which is why they are strongly discouraged on this Forum. For the best ways to post usable data samples (-dataex-) and code/results (code blocks), see FAQ #12, esp. paragraph 7 for advice. Nobody minds beginner questions: we were all beginners at one time, and people here want to be helpful. But do read the FAQ so you know how to best format your posts to help those who would help you.

One other thought: at least in the sample of your data that is shown in the screen shot, variable E contains only missing values. Any observation with missing values on any of the variables in the regression command will be excluded from analysis automatically. So when you get this running, you may find that your output is an empty data file. Do verify that you have usable data on all your variables before you proceed. Consider:

Code:

// COUNT OBSERVATIONS THAT WILL BE EXCLUDED // DO TO MISSING DATA egen int mcount = rowmiss(R logS logF A E H) count if mcount > 0

If the number of such observations is more than a handful you might want to explore them to find out why and see what you can do about it.
1 like
Comment
Jan Vandenbossche

Join Date: Apr 2016

Posts: 4
#3

24 Apr 2016, 05:41

Originally posted by Clyde Schechter View Post

Well, the loop is doing the same regression over and over again because there is no command inside the loop that makes any reference to the looping parameter `j'. So each iteration of the loop does exactly the same thing: a regression on the full data set

If I understand your goal correctly, you want to do 194 regressions, one for each month: each regression will include all (or as many as possible) of the 557 funds.

There are several other things in the code that can be improved. Although the -while- construction still works, -forvalues- would be simpler. The code refers to a matrix e(M), but -regress- does not in fact create any matrix by that name. Rather than constantly -preserve-ing and -restore-ing a large data set, it is better to put the results in a -postfile- and not change the data in memory over and over again. And finally, there actualy is no need for a loop at all, because there is a command, -statsby- that automates this entire process.

Code:

tempfile results statsby _b _se, saving(`results') by(month): regress R logS logF A E H use `results', clear // NOW DO WHATEVER YOU WANT WITH THE RESULTS

Your screen shot is hard to read, especially off towards the right. This is typical of screen shots (in fact many of them are entirely unreadable), which is why they are strongly discouraged on this Forum. For the best ways to post usable data samples (-dataex-) and code/results (code blocks), see FAQ #12, esp. paragraph 7 for advice. Nobody minds beginner questions: we were all beginners at one time, and people here want to be helpful. But do read the FAQ so you know how to best format your posts to help those who would help you.

One other thought: at least in the sample of your data that is shown in the screen shot, variable E contains only missing values. Any observation with missing values on any of the variables in the regression command will be excluded from analysis automatically. So when you get this running, you may find that your output is an empty data file. Do verify that you have usable data on all your variables before you proceed. Consider:

Code:

// COUNT OBSERVATIONS THAT WILL BE EXCLUDED // DO TO MISSING DATA egen int mcount = rowmiss(R logS logF A E H) count if mcount > 0

If the number of such observations is more than a handful you might want to explore them to find out why and see what you can do about it.

Thanks for the corrections and new information! Really appreciate it! I will also take the tips about the usage of this forum in mind.

I do have some additional questions now.
First of all, the statsby code worked so I can see the coefficients of my variables for all 194 regressions. But in the first 60 months I don't get coefficients probably because of missing data. After that sometimes some coefficients in certain months are missing as well. I do indeed have a lot of missing data, when I perform the 'egen int mcount' command you proposed, it says 100,077. Does this means that out of the 108058 rows in my dataset 100,077 are not complete? It can be, because indeed I have a lot of missing data in the first months and also a lot of missing data of the variables holdings and expenseratio (for all months). Is it better to lose the first months of my dataset and to lose the control variables holdings and expenseratio then?
Also, how can I calculate the mean of the coefficients for every variable and its t-stats and confidence intervals?

Thanks in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#4

24 Apr 2016, 10:16

Does this means that out of the 108058 rows in my dataset 100,077 are not complete?

Yes.

Is it better to lose the first months of my dataset and to lose the control variables holdings and expenseratio then?

Better than what? Better in what sense? It depends on what alternatives you have and what your goals are.

Also, how can I calculate the mean of the coefficients for every variable and its t-stats and confidence intervals?

I don't know what you mean here. If you mean you want to average the coefficient of logS with that of logF and those of A, E, and H in each of your regressions, it would be:

Code:

egen avg_coefficient = rowmean(_b_logS _b_logF _b_A _b_E _b_H)

If you mean that for each of those variables, you want the mean coefficient across all regressions:

Code:

summ _b_logS _b_logF _b_A _b_E _b_H

Both of those would be carried out in the data set created by -statsby-. I don't know which you want.

If you want, in each regression, to calculate the t-statistics and confidence intervals for each coefficient:

Code:

local z95 = -invnorm(0.025) foreach x in logS logF A E H { gen _t_`x' = _b_`x'/_se_`x' gen lb_`x' = _b_x' - `z95'*_se_`x' gen ub_`x' = _b_`x'*_se_`x' }

Added note: The _b/_se ratio is a t-statistic. For the confidence intervals I used a multiplier based on the normal distribution rather than the t. This was done pragmatically because the way the -statsby- code was set up it did not capture the degrees of freedom. But based on the problem you are solving, the degrees of freedom is very large, and for more than 30 degrees of freedom the difference between the normal and the t distribution is too small to be of any practical importance.
Comment
Jan Vandenbossche

Join Date: Apr 2016

Posts: 4
#5

26 Apr 2016, 08:18

Originally posted by Clyde Schechter View Post

Yes.

Better than what? Better in what sense? It depends on what alternatives you have and what your goals are.

I don't know what you mean here. If you mean you want to average the coefficient of logS with that of logF and those of A, E, and H in each of your regressions, it would be:

Code:

egen avg_coefficient = rowmean(_b_logS _b_logF _b_A _b_E _b_H)

If you mean that for each of those variables, you want the mean coefficient across all regressions:

Code:

summ _b_logS _b_logF _b_A _b_E _b_H

Both of those would be carried out in the data set created by -statsby-. I don't know which you want.

If you want, in each regression, to calculate the t-statistics and confidence intervals for each coefficient:

Code:

local z95 = -invnorm(0.025) foreach x in logS logF A E H { gen _t_`x' = _b_`x'/_se_`x' gen lb_`x' = _b_x' - `z95'*_se_`x' gen ub_`x' = _b_`x'*_se_`x' }

Added note: The _b/_se ratio is a t-statistic. For the confidence intervals I used a multiplier based on the normal distribution rather than the t. This was done pragmatically because the way the -statsby- code was set up it did not capture the degrees of freedom. But based on the problem you are solving, the degrees of freedom is very large, and for more than 30 degrees of freedom the difference between the normal and the t distribution is too small to be of any practical importance.

Again really thanks for the quick and clear response, this is extremely helpful for me! I wanted the mean coefficients across all regressions indeed, I also did this for the intercept by adding _b_cons in the summ code and this worked. Also i got t-statistics and CI for every single regressions but what i would like to obtain is the t-statistic for the mean coefficients and the intercept across all regressions. Do you know how I can get this?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#6

26 Apr 2016, 09:05

So, instead of -summ _b_logS _b_logF _b_A _b_E _b_H-, do -ci means _b_logS _b_logF _b_A _b_E _b_H- and you will get means, standard errors and CIs for each coefficient across all regressions.
Comment
Ewout Vermeersch

Join Date: Apr 2018

Posts: 7
#7

12 Apr 2018, 10:27

I have a question very related to this one with some distinct differences. I am also, unfortunately, new to Stata. To my own aggravation, I do not understand half of the programming within Stata while I usually understand most other programming languages.

My dataset is not presented in a "long" format to avoid an unnecessary repetition of the same fund.
My dataset contains a unique identifier for each fund, followed by several columns defining fund characteristics, followed by the weekly returns of each fund from 2000-2017.
My dataset contains both dead and surviving funds, meaning a fund's first weekly return can start in 2003 but also end in 2008.

I have been able to butcher my way through Excel to conduct a CAPM analysis for the whole period of 2000-2017 for all my available funds. Using Excel made me able to circumvent the issue of missing values being omitted from the regressions.

My issue arises when I wish to conduct a Fama-French (Carhart) 4 factor-model analysis:

I would prefer to conduct all possible programming in Stata, as the use of the software at some point is mandatory for the paper I am writing. This being said, I need to be able to regress all funds that have some values within the period of 2000-2017.

My dataset contains data for each of these factors on a weekly basis from 2000-2017.

The issue I have is twofold:

1. I wish to conduct this regression for each of my funds, requiring a loop function of some sort (most likely the 'statsby' function). However, I wish to understand exactly what the statsby function does and how it operates. Which is to my great frustration absolute gibberish to me when I try to make sense of the manual regarding the function.

In the code provided above:

Originally posted by Clyde Schechter View Post

Code:

tempfile results statsby _b _se, saving(`results') by(month): regress R logS logF A E H use `results', clear // NOW DO WHATEVER YOU WANT WITH THE RESULTS

I assume that exp_list means the list of statistics we wish to gain as output. In that case I assume _b refers to the mean and that _se is the standard error. The use of the tempfile `results' will create a seperate .dta file on my computer, hopefully? The use of that file at the end of the code will load the dataset into Stata and therefore remove my other dataset or add it as a second dataset? I realise the simplicity of the question but I have never worked with multiple datasets in Stata so I try and understand the function in its entirety at once.

I would also prefer to append these results at the end of my existing dataset as I will use the factors gained from this initial regression to conduct further explanatory regressions based on fund characteristics. I believe this is easily done with the append function but this grants me a long list of columns with only only 4 values per column. I would actually need the output to be saved in 4 columns, one for each of the factors, and have the output of each regression be added to those 4 columns. That way I would instantly have the output of each regression on the same line of each fund.

The by(month) code, does this refer to the way the results will be saved or does this refer to how the regression will be conducted?

2. My second issue is the deletion of funds from a regression when there is a missing value. As not all funds have data for the whole period, I need to be able to create a loop that adjusts its input range of both the dependent and independent variables to the range of non-empty cells of each fund. The only upside here is that my data consists of only one starting point and one end point to the data. I have no caveats in my data. Once a fund exists, I have data for that fund until it dies again. If it dies.

I think I have listed all my issues and truly hope this is a clear explanation in a way that can easily be remedied. If any type of sample data is required to answer my question, I can easily provide it.

Last edited by Ewout Vermeersch; 12 Apr 2018, 11:05.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#8

12 Apr 2018, 10:49

I am also, unfortunately, new to Stata. To my own aggravation, I do not understand half of the programming within Stata while I usually understand most other programming languages.

Stata is rather different from other statistical programming systems such as SAS or SPSS. If you are used to working in those, it will take some readjustment as you not only have to learn new syntax, but you need to think about certain types of data management differently. If you are referring more broadly to general purpose programming languages, I think Stata is fairly similar to C--which is why I liked Stata so much when I first started using it.

Be that as it may. I don't understand the economic jargon in the middle of your post. This is a multi-disciplinary forum, and it is generally best to avoid using terminology or abbreviations that would not be understood by every educated person, or, if they must be used, to briefly explain them. In any case, I have a sense that none of that is key to understanding and responding to your questions. (Though if my answers are off base, this may be why!)

I assume that exp_list means the list of statistics we wish to gain as output. In that case I assume _b refers to the beta and that _se is the standard error. The use of the tempfile `results' will create a seperate .dta file on my computer, hopefully? The use of that file at the end of the code will load the dataset into Stata and therefore remove my other dataset or add it as a second dataset?

All correct.

I would also prefer to append these results at the end of my existing dataset as I will use the factors gained from this initial regression to conduct further explanatory regression based on fund characteristics.

This sounds odd. The variables in these results will be coefficients and standard errors; variables which do not occur in your starting data set. Appending data sets with no overlapping variables other than identifiers is usually not useful. Perhaps you mean you would like to -merge- them? That would put the regression coefficients for a given fund and date in the same observations as the original data on that fund for that date. This is much more common and has many subsequent uses.

2. My second issue is the deletion of missing value funds from a regression. As not all funds have data for the whole period, I need to be able to create a loop that adjusts its input range of both the dependent and independent variables to the range of non-empty cells of each fund.

I think it would be best if you show an example of your data. I can imagine several ways your data could be organized that are compatible with your description in #1, and the solutions would differ. The most effective way to show a data example is to use the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

In showing an example, please be sure your example includes several different funds and several different time periods, and that it also illustrates the missing data problems you are concerned about. It is also best if the data set contains enough observations for each regression that a quick test run can be done.
Comment

Ewout Vermeersch

Join Date: Apr 2018
Posts: 7

13 Apr 2018, 06:17

Thank you for the very quick and thorough answer. I will try and list all additional information clearly and concise.

I am regrettably also unfamiliar with C-- but I have worked in R before and have some very basic knowledge of Matlab. Those applications seem more straightforward to me.

I do apologise for adding the economic jargon to the formula. A basic y = c + b1x1 + b2x2 + b3x3 + b4x4 regression serves the same purpose.
With y = dependent variable
c = constant
x1 = independent variable 1
b1 = coefficient of x1
x2 = independent variable 2
b2 = coefficient of x2
x3 = independent variable 3
b3 = coefficient of x3
x4 = independent variable 4
b4 = coefficient of x4

Originally posted by Clyde Schechter View Post

This sounds odd. The variables in these results will be coefficients and standard errors; variables which do not occur in your starting data set. Appending data sets with no overlapping variables other than identifiers is usually not useful. Perhaps you mean you would like to -merge- them? That would put the regression coefficients for a given fund and date in the same observations as the original data on that fund for that date. This is much more common and has many subsequent uses.

From looking into the meaning of both of those functions, I understand that I will indeed need to -merge- the two datasets, my initial dataset (I will call it equityfunds to avoid confusion) and the `results' dataset. From running the regressions, the `results' dataset will consist of a 482x9 matrix (from what I have understood) that I want to -merge- at the back of my `equityfunds' dataset. This will add 9 extra variables (constant + coefficient (coef) of x1 + standardev (sdev) of x1 + coef of x2 + sdev of x2 + ... + sdev of x4) to the end of my dataset, I think.

Originally posted by Clyde Schechter View Post

I think it would be best if you show an example of your data.

I have used the -dataex- and cut a few results out as I only found out afterwards that the standard output is 100 observations. I feel posting 100 observations here would be far too much spam. However, if any extra data or more observations are required, feel free to ask.
I chose to include weeks 44-48 specifically to showcase the "birth" of several funds. All missing values presented here are funds that are born at a later date (e.g. week 100, 130 or higher). Nevertheless, I would still want to run a regression for each fund based on the fund returns I do have of that fund. The start date, life and death of a fund varies a lot (most funds do not die during my analysis period). This is also my main concern for running a regression as I do not know how to program a dynamic input data range that would adjust to the size of the values that I have for a specific fund.

Also, at the bottom of my dataset `equityfunds' I have the 4 factors needed for my regression. In my dataset, it would mean my 483rd fund is my independent variable x1, 484th fund is x2, 485th fund is x3 and 486th fund is x4. Each independent variable has data for each week for the whole period of 2000-2017 without caveats.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(Week44 Week45 Week46 Week47 Week48)
                   .                    .                    .                    .                    .
 -2.7174045524454957   1.3425954475545043  -1.9274045524454955  -2.2674045524454955   2.0125954475545043
                   .                    .                    .                    .                    .
 -1.8974045524454957   2.0525954475545043  -1.0074045524454955  -3.3574045524454954   1.9125954475545044
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
 -2.0774045524454956    .9925954475545045  -3.5674045524454954  -1.3774045524454956   2.4625954475545044
                   .                    .                    .                    .                    .
  1.9425954475545044   -.7074045524454956 -.037404552445495565   -4.527404552445496   1.2025954475545044
 -1.5374045524454956   1.5625954475545045  -2.1674045524454955  -2.1374045524454957   2.5725954475545043
 -1.3874045524454957   2.5725954475545043  -3.7074045524454955  -2.9174045524454955   3.4125954475545046
 -2.3474045524454956   1.6025954475545043  -2.9374045524454955  -3.6174045524454956   3.0525954475545043
 -2.0374045524454956    .9125954475545044  -3.4174045524454955  -1.5474045524454956   2.3425954475545043
                   .                    .                    .                    .                    .
  -.9174045524454956   3.6825954475545046   -4.037404552445496   -5.847404552445495   2.2125954475545044
                   .                    .                    .                    .                    .
 -3.7074045524454955    .8625954475545045  -2.7774045524454958   -.5774045524454956   2.4925954475545042
  -.9874045524454955   3.6825954475545046   -4.037404552445496   -5.847404552445495   2.2125954475545044
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
                   .                    .  -2.0574045524454956  -1.1774045524454955  -.13740455244549557
                   .                    .                    .                    .                    .
 -1.8274045524454956    .7625954475545045  -3.8574045524454954  -1.1674045524454955   2.2725954475545045
                   .                    .                    .                    .                    .
-.037404552445495565 -.037404552445495565 -.037404552445495565 -.037404552445495565  -2.6574045524454957
 -2.3074045524454956    .4525954475545044  -.07740455244549557  -3.0474045524454954    4.132595447554504
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
  -7.207404552445496   4.2725954475545045  -3.0874045524454954   -6.237404552445495   3.8725954475545046
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
 -3.8074045524454956   1.4525954475545044   2.3425954475545043   -8.847404552445497   3.2525954475545045
 -1.5874045524454956    .7525954475545045  -3.5474045524454954  -1.2474045524454955   2.0025954475545045
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
                   .                    .                    .                    .                    .
                   .                    .                    .                    .   2.0025954475545045
end

Originally posted by Clyde Schechter View Post

In showing an example, please be sure your example includes several different funds and several different time periods, and that it also illustrates the missing data problems you are concerned about. It is also best if the data set contains enough observations for each regression that a quick test run can be done.

I have included below a second example of data for weeks 900-905 to showcase the "death" of a fund. I am unsure in which order this second data example is ordered in as the fund that dies in this example (last return of the fund on the 7th line = 3.64569955738129) is our 427th fund when ranked alphabetically. I hope this is not relevant but figured I can only mention as much information as possible. This means the funds listed in the first data example are most likely not the same funds that are listed in this second example. I know that the fund that dies does not occur in the first data example.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(Week900 Week901 Week902 Week903 Week904 Week905)
3.9556995573812896   2.9956995573812897   -.2843004426187101   -.4043004426187101   -.2143004426187101    .6656995573812899
  4.03569955738129     1.73569955738129    .4856995573812899   -.6143004426187101    .7356995573812899    .9156995573812899
3.2956995573812895   3.0456995573812895   -.1543004426187101   -.8843004426187101   .09569955738128991      .55569955738129
3.4556995573812896   2.8856995573812902   -.5043004426187101   -.6443004426187101 -.004300442618710099   .27569955738128993
  3.72569955738129     2.76569955738129   -.3343004426187101   -.1643004426187101    .4656995573812899    .6956995573812899
  3.59569955738129   3.3256995573812898   -.4843004426187101     -.69430044261871   .09569955738128991    .4956995573812899
  3.89569955738129     3.64569955738129                    .                    .                    .                    .
  5.06569955738129     3.11569955738129    .0756995573812899   -.7743004426187101    .3156995573812899    .7256995573812899
                 .                    .                    .                    .                    .                    .
2.8756995573812896     1.60569955738129    .1756995573812899  -1.3143004426187102    .5156995573812899   -.8643004426187101
3.1656995573812896   2.7156995573812903   -.4043004426187101  -1.0643004426187102   -.1243004426187101    .0756995573812899
3.5756995573812898   2.8256995573812898   -.1843004426187101     -.69430044261871    .3756995573812899     1.05569955738129
  2.14569955738129     3.27569955738129   -.8743004426187101   -.1343004426187101    .6856995573812899   1.4556995573812899
  3.11569955738129   3.9156995573812896     1.01569955738129   -.8743004426187101   1.2156995573812899   1.9056995573812898
  3.98569955738129     3.39569955738129   -.0743004426187101   -.3443004426187101    .2656995573812899    .1856995573812899

Last edited by Ewout Vermeersch; 13 Apr 2018, 06:22.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#10

13 Apr 2018, 08:25

I'm not at all sure I understand what you are trying to do. In particular, I don't grasp the relationship between the regression equations you show in #7 and the variables in your example data. How do the variables you show in the example relate to the various terms in the regression equation?

It appears to me as if your regression equation involves variables that you are not showing, and that the t subscript refers to time. I also imagine that the t-subscript in the equation corresponds to the numbers following Week in the names of the variables you show. If that is the case, your data is not properly organized to do this kind of regression and needs to be -reshape-d into long layout. But then I still don't grasp what the numbers that are the values of Week correspond to in your equation. And at the very least it seems to me that you need to have a variable that identifies the funds. So I'm afraid we need some more information to figure this out.
Comment
Ewout Vermeersch

Join Date: Apr 2018

Posts: 7
#11

13 Apr 2018, 15:04

Thanks again for your answer!

Maybe the dataex below will clear things up. I have included the first 17 funds out of a total of 482. For every fund, I have a unique identifier, a name and 7 characteristics to be used later. From the 10th column, I have 935 variables called Week1-Week935. These variables are percentage returns for each fund (or missing if they didn't exist at that date).
In addition, below the 482 funds, I have observations with regards to 4 explanatory variables: row 483 to 486 have observations during the entire period, no caveats (Week1-Week935).
Lastly, as I tried to explain above, concerning the missings in the Week1-Week935 range: some funds have returns during the entire period (i.e. starting from Week1, like fund number 2 for example), while others (like for example fund number 1) start generating returns at some point (for example at Week399, like fund 1), while still others start at the beginning, but end before Week935. In short: the period in which the fund 'lived' is arbitrary and different for every fund (except for funds that have lived during the entire period). However, there are no caveats.

[CODE]

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int Nummer str42 Name byte(LargeIs1 V1B2G3) int Age byte Nrmanagers double(FundSize Expense TurnoverRatio Week1 Week2 Week3 Week4) 1 "8a+ Eiger R" 1 1 123 1 . . 968 . . . . 2 "AB FCP I Eurozone Equity IX Acc" 1 2 226 2 844639916 1.05 45.96 5.597597096256503 .1675970962565026 -.6224029037434974 3.2675970962565026 3 "Abante GF European Quality Equity Fund A" 1 1 55 1 157884086 . . . . . . 4 "ABN AMRO Euro Smaller Companies Eq C/D" 0 3 237 2 61010707 2.67 23.95 5.217597096256503 3.6875970962565026 -.5424029037434974 2.3075970962565027 5 "ABN AMRO Euro Sustainable Equities NC" 1 2 177 1 148014433 1.05 103.97 . . . . 6 "Actions Solidaires M" 1 3 105 . 6148752 . . . . . . 7 "Actys 3 A/I" 0 3 188 1 25418874 1.03 22.01 . . . . 8 "Afer Actions Euro A A/I" 1 1 233 . 3259313534 .59 4.92 6.727597096256503 1.4275970962565026 -2.1824029037434975 2.5175970962565026 9 "Afer Avenir Senior A/I" 1 . 3 . 60641000 . . . . . . 10 "AG Life Equities Euro" 1 2 234 . 196055890 . . 3.5675970962565025 5.6075970962565025 -2.122402903743497 .3675970962565026 11 "AIS Mandarine Active P" 1 2 254 2 233158832 1.5 100.58 5.537597096256503 1.0675970962565027 -1.5924029037434975 -.08240290374349742 12 "AL Trust Aktien Europa" 1 2 225 1 25101081 1.69 . 6.307597096256503 3.3975970962565025 -2.4824029037434974 1.2475970962565026 13 "Alfred Berg Aktiivinen Fokus B" 1 3 221 1 57418257 1.8 230 2.087597096256503 2.1975970962565023 -.9124029037434974 .8675970962565026 14 "Allianz Actions Euro C" 1 2 234 2 150353025 1.23 19.64 7.6475970962565025 2.1875970962565026 -2.832402903743497 3.0975970962565027 15 "Allianz Best Styles Euroland Eq I EUR" 1 2 149 2 251283335 . 35.68 . . . . 16 "Allianz Euroland Equity Growth W EUR" 1 3 134 2 3368445238 . 10.4 7.307597096256503 5.687597096256503 -2.0724029037434972 1.9175970962565025 17 "Allianz Euroland Equity SRI W EUR" 1 2 86 2 60114204 . 53.95 . .

What I want to do is the following: run 482 regression analyses of the form: y = c + b1x1 + b2x2 + b3x3 + b4x4, where:

First regression analysis:

y= fund 1 (row 1: Week399-Week935)
x1 = independent variable 1 (row 483: same period as above, evidentally)
x2 = independent variable 2 (row 484: same period as above, evidentally)
x3 = independent variable 3 (row 485: same period as above, evidentally)
x4 = independent variable 4 (row 486: same period as above, evidentally)
c = constant

Second regression analysis:
y= fund 2 (row 2: Week1-Week935)
x1 = independent variable 1 (row 483: same period as above, evidentally)
x2 = independent variable 2 (row 484: same period as above, evidentally)
x3 = independent variable 3 (row 485: same period as above, evidentally)
x4 = independent variable 4 (row 486: same period as above, evidentally)
c = constant

... and so on until the 482nd fund. I figure I need some kind of loop to do this? In Excel, it is possible to do regressions using rows as variables (what I'm trying to do here), but is this possible in Stata? Even if I would have to reorganize my data, I don't think I should reshape it, but I think I would just have to transpose it. Still, with a transposed dataset, I would need a loop command.

From every regression, I need the constant c in one column, the coefficients of the independent variables in another 4 colums, and their p-values in another 4 colums. This is the 482*9 matrix I was referring to before. I don't know how to store these coefficients once Stata has completed the regressions.

A second step in our analysis is to do a second multiple regression analysis, explaining the constants computed above, by the 7 fund variables (LargeIs1, V1B2G3, Age, Nrmanagers, FundSize Expense and TurnoverRatio). This means the regression equation looks like: constants = c + b1LargeIs1 + b2V1B2G3 + b3Age .... This, I know how to do, since this doesn't require a loop.

I hope this helps! Thanks in advance for considering my problem!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#12

16 Apr 2018, 09:54

This sounds like the kind of data organization that illustrates the worst aspects of spreadsheets.

Thank you for the -dataex- showing the first 17 observations (not rows) in your data set. They are what a Stata data set is supposed to look like. Each observation contains numerous variables (not columns).

You do not show what happens from "row" 483 on. You say it contains x1 = independent variable 1. But then how do we know which value in that "row" corresponds to which fund? I suppose my first guess would be that the value that occurs in the variable Nummer (which you would call the 1st column) is that of firm number 1, and that the value that occurs in the variable Name (which you would call the 2nd column) is the value for firm number 2, etc. But do you already see a problem? Variable Name is a string variable, so we will have to do some extra work to convert that to a real numeric value.

So this organizations of the data is just dysfunctional. At a top level the steps you will have to take are as follows:

1. Remove everything from "row" 483 on from this data set. Save it separately.
2. -reshape- what remains into long layout:

Code:

reshape long Week, i(Nummer) j(week) rename Week return

Save this new data set. It now contains one observation per week per firm and it is suitable for analysis.

3. Now we have to deal with the other variables that we removed before. I can't offer you specific code to fix it without seeing an example of it. Most likely it will require extensive "surgery" to make it usable. But ultimately it will need to be arranged like a real Stata data set with each observation containing the four independent variables, and one observation for each firm. Depending on the source of the data, it may be easier to simply scrap what you did to bring it into Stata so far and start over, or it may be easier to work with that piece of the Stata version of the data and try to fix it. Again, I can't tell from a distance.

4. Once step 3 is accomplished, it will probably lead to using the -merge- command (it will likely be a -merge m:1-) to combine these variables with the data saved after step 2.

5. At that point you will be prepared to do your regressions. The -runby- command will probably be the easiest way to do this and get the results incorporated directly into the data set. -runby- is available from SSC, written by Robert Picard and me. But you have several steps to accomplish before you get to this point, so let's put off discussing how to code the regressions for use with -runby-.

The big point here is that you are thinking about data in spreadsheet terms. Stata is not a spreadsheet and if you try to use it as if it were you get nowhere slowly, as you have seen. To work with Stata you have to actively purge spreadsheet-think from your mind. If you follow habits and instincts you have developed from using spreadsheets, it will be, at best, unhelpful, and at worst really destructive. So never again use the words "row" and "column" when thinking about a Stata data set. They are observations and variables, respectively: and calling them that will help you keep straight what has to go where in a Stata data set.
Comment
Ewout Vermeersch

Join Date: Apr 2018

Posts: 7
#13

17 Apr 2018, 05:14

There is indeed a large issue with our data management. We have "reshaped" our data in two ways in an effort to properly organise it.

dataex1:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(v481 v482) double(ExcessRm SMB HML MOM) . 4.827597 -.812430823901708 -.0040000000000000036 -.496 1.456 . 1.1475971 2.5640833840320374 .358 -.7499999999999999 .632 . -.1924029 2.5640833840320374 .21799999999999997 -.07400000000000001 -.45 . 2.697597 2.5640833840320374 -.04799999999999997 -.44000000000000006 .8559999999999999 . 4.6275973 2.5640833840320374 -.07000000000000002 -.54 .576 end

The first -dataex-, named dataex1, represents v481 and v482, respectively fund 481 and fund 482. The observations provided are the weekly returns for those funds. Added to this example are the variables ExcessRm SMB HML and MOM. Each of these variables has 935 observations without missing observations, one for each week in our analysis.
This data structuring was achieved by eliminating all fund characteristics from the dataset showcased in #11 (specifically str42 Name, byte(LargeIs1 V1B2G3), int Age, byte Nrmanagers and double(FundSize Expense TurnoverRatio)). The dataset then only retained variables Nummer, Week1, Week2,...,Week935. This dataset was then tranposed (-xpose-) and the variables named ExcessRm, SMB, HML and MOM were -merge-d to form dataex1.

This setup would allow us to regress a fund's returns on the variables ExcessRm, SMB, HML and MOM.

The second -dataex- represents FundNumber, Week, Return, ExcessRm, SMB, HML and MOM. For each observation of FundNumber, there exists 935 Weeks where each Week has a corresponding Return (which may be a missing value). Added to this example are the variables ExcessRm, SMB, HML and MOM. These variables have 935 observations per fund without caveats. As we have 482 funds, there is a total of 935 * 482 = 450670 observations for each variable in this setup. The Return variable is the only variable where missing observations are present.
This data structuring was achieved by eliminating all fund characteristics from the dataset showcased in #11 (specfically str42 Name, byte(LargeIs1 V1B2G3), int Age, byte Nrmanagers and double(FundSize Expense TurnoverRatio)). The dataset then only retained variables Nummer, Week1, Week2,...,Week935. This dataset was then -reshape-d following:

Code:

reshape long Week, i(Nummer) j(week) rename Week return

The variables ExcessRm, SMB, HML and MOM were -merge-d to this dataset. Each of these variables only holds 935 observations so this sequence of 935 observations was repeated 482 times to form the variables ExcessRm, SMB, HML and MOM with also 450670 observations each. It follows logically that the order of the observations for the variables ExcessRm, SMB, HML and MOM was kept intact. Nummer has also been renamed FundNumber.
For each unique FundNumber we now have 935 observations, where the variable Return is the only variable with possible missing values.

dataex2:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(FundNumber Week) double(Return ExcessRm SMB HML MOM) 1 1 . -.812430823901708 -.0040000000000000036 -.496 1.456 1 2 . 2.5640833840320374 .358 -.7499999999999999 .632 1 3 . 2.5640833840320374 .21799999999999997 -.07400000000000001 -.45 1 4 . 2.5640833840320374 -.04799999999999997 -.44000000000000006 .8559999999999999 1 5 . 2.5640833840320374 -.07000000000000002 -.54 .576 end

With this setup of the database, we would most likely need a -runby- program to loop through each group of 935 observations. Intuitively, I do think that this will be more problematic with the missing observations in the Return variable. This setup will allow us to regress each fund's weekly returns on the variables ExcessRm, SMB, HML and MOM, I think.

I hope one of these two ways is a correct structuring of the data. Our goal is to run one regression for each fund on the variables ExcessRm, SMB, HML, MOM. We do not wish to regress each fund's weekly observation on the weekly observation of the variables ExcessRm, SMB, HML, MOM.

Thank you for reading this lengthy explanation and thank you in advance for any further help. The previous comments have already helped us tremendously despite only more problems appearing thus far.

EDIT [06:10]: Variable explanation:
The variables we would like to use in our regressions are ExcessRm, SMB, HML, MOM. These are variables which correct for the increase in risk of investing in different types of equity funds. Some equity funds are riskier than others. If they are, an investor wants a higher return for that fund, no matter its performance. These variables are a way to adjust for the higher demanded returns, labeled "risk premium". These variables are an effort to make a more "apples" to "apples" comparison between different types of equity funds. Each variable is explained in more detail below if anyone is interested or if the information is required.
1. ExcessRm represents the remuneration gained from investing in, what we defined as, the "market". The "market" in our model is the equity fund market. When investing in an equity fund, a specific risk is taken on. Investors will only accept this risk if it is rewarded. If it is not rewarded, all investors would just invest in risk-free assets. This reward for the risk taken on by entering the "market" is quantified, to the best of our ability, through the ExcessRm variable.
2. SMB stands for Small-Minus-Big and represents an adjustment specific to equity funds. Equity funds come with different strategies. It has been shown that, on average, small capitalisation funds perform better than large capitalisation funds. A very simplified example is that an increase of €100 for a small capitalisation fund is much more significant than €100 for a large capitalisation fund.
3. HML stands for High-Minus-Low and represents another adjustment specific to equity funds. It has been shown that, on average, funds with high book-to-market value (their accounting value compared to their stock value) perform better than funds with low book-to-market value.
4. MOM stands for Momentum. Equity funds are subject to momentum changes. This means that when a specific fund exhibits positive returns, these positive returns are likely to persist for a certain time. These are often caused due to a bandwagon effect of investors. Investors invest into the fund which increases expectations of that funds, which attracts more investors, and so on. This also means that on the longer term there is usually a reversal. This means that on the longer term, funds with good performances will generally have worse performances. The MOM variable quantifies this effect, which is not linked to the actual performance of the fund but merely the expectations of investors.

Last edited by Ewout Vermeersch; 17 Apr 2018, 06:14.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#14

17 Apr 2018, 09:01

OK. The layout in your dataex2 is what you want, with one caveat. In that brief example, the variable Return has missing values in every observation. Clearly you won't get far trying to regress a variable with only missing values. You are aware of the problem and have your doubts. In any case, presumably for some of your firms you will have sufficient non-missing observations of Return to carry out a regression. The code to do this with -runby- is:

Code:

capture program drop one_fund program define one_fund regress Return ExcessRm SMB HML MOM foreach v of varlist ExcessRm SMB RML MOM { gen b_`v' = _b[`v'] gen se_`v' = _se[`v'] } exit end runby one_fund, by(FundNumber) status

(The status option will give you a period progress report of how many FundNumbers have already been processed and how much time is estimated to remain.)

The code in the one_fund program will put the coefficients and standard errors for all of the predictor variables into your data set (for those funds where enough non-missing values of Return were present).

While the approach involves writing a short program, on balance I think it is simpler than using -statsby- and then -merge-ing the data sets. Even if it isn't overall simpler, in a data set this size it will run much faster.
Comment
Ewout Vermeersch

Join Date: Apr 2018

Posts: 7
#15

17 Apr 2018, 12:19

I have listed below a new example of the dataset presented in dataex2. This is just to show that the first observation of FundNumber (repeated 935 times for each observation of Week) starts having observations from Week 399 onwards. This means the variable Return has 935 - 399 = 536 observations for FundNumber 1, from Week 399 to Week 935.
All FundNumbers are in this manner, starting either before our first analysis week (meaning the fund will have 935 observations, unless it dies during our analysis period) or starting at a specific week. Once a fund lives, our dataset has continuous weekly observations for that fund up until the fund dies or there are weekly observations up until week 935. Each FundNumber has enough observations to run a regression (the lowest amount of observations we have seen is 20 for one specific fund).

dataex3:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(FundNumber Week) double(Return ExcessRm SMB HML MOM) 1 396 . 1.4682145702389633 -.038 -.082 .20800000000000002 1 397 . -3.0937834297610367 .268 -.12400000000000003 .028000000000000004 1 398 . .9606255702389633 -.458 -.05 .25999999999999995 1 399 -.03505742976103665 3.2714285702389634 -.442 2.7755575615628915e-18 .372 1 400 .6049425702389634 .26343357023896335 -.09000000000000001 .05400000000000001 .25 1 401 2.054942570238963 2.0958265702389633 .009999999999999999 -.003999999999999998 -.20199999999999996 1 402 .5849425702389633 .8325405702389633 .022 -.11200000000000002 .322 1 403 -1.8050574297610367 -1.6942304297610367 .244 -.092 .396 end

I simply ran the code you provided through the dataset, of which an example is provided above. Running it produced the following output:

capture program drop one_fund

.
. program define one_fund
1.
. regress Return ExcessRm SMB HML MOM
2.
. foreach v of varlist ExcessRm SMB RML MOM {
3.
. gen b_`v' = _b[`v']
4.
. gen se_`v' = _se[`v']
5.
. }
6.
. exit
7.
. end

.
.
.
. runby one_fund, by(FundNumber) status

elapsed ----------- by-groups ---------- ------- observations ------ time
time count errors no-data processed saved remaining
------------------------------------------------------------------------------------
00:00:01 100 100 0 93,500 0 00:00:04
00:00:02 203 203 0 189,805 0 00:00:03
00:00:03 307 307 0 287,045 0 00:00:02
00:00:04 403 403 0 376,805 0 00:00:01
00:00:05 482 482 0 450,670 0 00:00:00

--------------------------------------
Number of by-groups = 482
by-groups with errors = 482
by-groups with no data = 0
Observations processed = 450,670
Observations saved = 0
--------------------------------------

Due to this error, I have just started to analyse the code to see where the issue might lie. I hope that I understood it properly:
-This program starts by the function capture, which executes the following code "program drop one_fund". That code drops the program one_fund from memory. Is that line of code not unnecessary if the program has not been coded yet? Or is this just to avoid any possible chance that I already have a program in memory called one_fund?

-This program is defined as one_fund which makes a regression of Return on ExcessRm, SMB, HML and MOM as independent variables for each observation (referred to by the iterator v). It then generates the variables b_`v' and se_`v', the mean and standard deviation respectively. Does this not mean that it will create two variables for each regression? By which I am wondering if it will create b_`1' and se_`1' for FundNumber 1, b_`2' and se_`2' for FundNumber 2, ... Leading to a creation of 450,670 x 2 variables? I doubt this is the case. I think it will rather lead to the creation of 482 x 2 variables because when the program is run through "runby one_fund, by(FundNumber)", the by(FundNumber) should be enough to have the program only run 482 times. I am uncertain, however.

-The results, which holds a sequence of two generate commands is then ended with "}". There is then an exit of... Stata? I presume the exit command must refer to something else, embedded in the program sequence although I do not know what. And afterwards the program is -end-ed, finishing the definition of the one_fund program.

-The program defined earlier is executed with the -runby- command where it is specified to -runby- each FundNumber, which lets me assume that there will only be 482 regressions taking place. The output, despite having errors for each group, seems to corroborate this. This leaves me baffled as to where the issue might lie. The only issue I might be able to think of is that Stata might refuse to perform an iteration for a FundNumber when it has a single missing observation for the Return variable. Which would still mean that there shouldn't be any errors for a FundNumber for which there are no missing observations in the Return variable.

Therefore, I frankly have no idea what the error might be. The only information I can still add is that:

Originally posted by Clyde Schechter View Post

The code in the one_fund program will put the coefficients and standard errors for all of the predictor variables into your data set

This is to be interpreted that the one_fund program will overwrite my current dataset and is supposed to put the coefficients and standard errors into the dataset. Because, after running the code, my dataset is "empty", in the sense that my dataset presented partially in dataex2 (in #13) or dataex3 listed above is no longer present. Due to the program finding errors for each group of observations, it is normal that the Data Editor is now empty. I am just trying to understand how the results are being stored, if the program were to have worked properly for my dataset.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment