Advice on Large Difference Between Number of Observations in Descriptive Stats and Obs Used in Regression

Farhan Hasnat

Join Date: Dec 2021

Posts: 91
#1

Advice on Large Difference Between Number of Observations in Descriptive Stats and Obs Used in Regression

12 Aug 2022, 09:43

I am bit worried on why the descriptive statistics (picture2) "number of observations" has large difference with "number of observations" used in the regressions (Picture 1). I am not sure if it is mechanical or something needs to be taken care of.

Code I used for the descriptive stats:

Code:

asdoc tabstat Var 1 Var 2, stat(N mean p50 sd p25 p75), dec(4)

Code I used to run the OLS Fixed Effect Model

Code:

xtreg ABS_DA_w POST_REG INBD ROA_w Size MTB_x_w LEV_w LOSS i.Year, fe robust

Picture 1: Regression Results

Picture 2: Descriptive Statistics

Attached Files

Last edited by Farhan Hasnat; 12 Aug 2022, 09:50.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#2

12 Aug 2022, 09:49

Bear in mind that in any estimation command, any observation that has a missing value for any variable in the estimation command is omitted from the estimation sample. Looking at your example data, you have large numbers of observations where everything but Ticker and Year are missing. Those observations will not be found in the estimation sample, but they also won't be found in the -tabstat- results. But on top of that there are numerous observations where some, but not all, of the variables (other than Ticker and Year) have missing values. Those observations will count in the -tabstat- results, but not in the regression results.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

12 Aug 2022, 09:50

Your data contains a large number of missing values. You should review the output of

Code:

help missing

where you will learn from the discussion of estimation commands that observations with a missing value for one or more of the variables needed by an estimation command will be omitted when running the command.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

12 Aug 2022, 10:05

To what Clyde wrote in post #2 I will add that your example data and your descriptive statistics do not include POST_REG and LOSS. Those both sound like indicator variables, and we often see code on Statalist that creates 1/. rather than 1/0 indicator variables, so that too may be part of the problem.

You need to review

Code:

help misstable

and use the command to better understand the prevalence and effect of missing values in your data.
2 likes
Comment
Farhan Hasnat

Join Date: Dec 2021

Posts: 91
#5

12 Aug 2022, 10:12

Clyde Schechter Thank you for your useful advise. This is exactly was my thought initially and yes my data has large number of missing data. I would appreciate if I could get your input on two actions I have in mind:

1. I did drop all missing values from these independent variables - which actually lead to all of the independent variables to have the same number of observations in the "Descriptive Stats". My rationale to drop them was to show the number of observations in the descriptive stats as close to as shown (instead of my paper reviewers kind of getting surprised of the large variation) in the regressions results/ or used in the regressions. Is there is any other methods you would suggest to this if I am going the wrong way about it ?

2. According to you, is it wrong to show descriptive stats to have large variation in the number of observations compared to the what is actually used in the regression ?

Thanking you in advance
Comment
Farhan Hasnat

Join Date: Dec 2021

Posts: 91
#6

12 Aug 2022, 10:15

William Lisowski Thank you for your useful advise. I will looking through misstable command. Yes, POST_REG and LOSS are indicator variables
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#7

12 Aug 2022, 10:41

1. I did drop all missing values from these independent variables - which actually lead to all of the independent variables to have the same number of observations in the "Descriptive Stats". My rationale to drop them was to show the number of observations in the descriptive stats as close to as shown (instead of my paper reviewers kind of getting surprised of the large variation) in the regressions results/ or used in the regressions. Is there is any other methods you would suggest to this if I am going the wrong way about it ?

Well, in my experience, reviewers understand that missing values on variables lead to a reduction in the sample size when multivariable analyses are done. If all you will be showing are descriptive statistics and regression results, then I think it is reasonable to discard all observations with missing variables and just consider the sample to be the complete cases (i.e. the observations with no missing values.) But then in your methods section you need to disclose that you have done this. If, however, you will also be showing some bivariate analyses, then it is better to use the full sample for the descriptive statistics and the bivariate analyses. If you want to put a sentence in your results section indicating why there is a sharp fall-off in sample size when you then get to the regression analysis, you can do that. (Although, as I indicated, I don't think that is necessary if the reviewers know what they are doing.)

That said, in my view, there are no good solutions to the problem of missing data. Or, rather, the only really good solution is to get the actual missing values and put them in the data set, which, in real life, is seldom feasible. That usually leaves only bad solutions and the trick is to choose the least bad among them. It is well known that using only complete cases will usually bias the results of regression analyses, so good reviewers will expect to see some effort to reduce or estimate bounds for that bias. How to do that, and whether it is even possible, depends on why the missing values are actually missing. The subject of handling missing data is complicated. For a good overview, with Stata examples, I suggest you look at https://statisticalhorizons.com/wp-c...aterials-1.pdf.

2. According to you, is it wrong to show descriptive stats to have large variation in the number of observations compared to the what is actually used in the regression ?

I think I've answered this in my response to 1., but if you want further clarification on the matter, do post back.

Last edited by Clyde Schechter; 12 Aug 2022, 10:44.
1 like
Comment
Farhan Hasnat

Join Date: Dec 2021

Posts: 91
#8

13 Aug 2022, 00:06

Clyde Schechter Thank you for the detailed explanation and giving me a concrete understand on this. Appreciate your help
Comment

Announcement

Advice on Large Difference Between Number of Observations in Descriptive Stats and Obs Used in Regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment