How can I keep original number of observations in linear regression

Wah Myint

Join Date: Jan 2020

Posts: 15
#1

How can I keep original number of observations in linear regression

02 Jan 2020, 00:58

Hello,
I would like to know one thing. I am running a linear regression in Stata 16. When I added an independent variable, the total number of observations drops. Which command should I use to maintain the original number of observations? Thank you.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#2

02 Jan 2020, 01:53

Wah:
welcome to this forum.
What you experienced is caused by missing values in your added predictor. In order to make calculation feasible (most in Stata is translated into matrices), by default Stata omits observations with missing values in any variable.
Hence, the only fix you (and everybody else who might face the same issue) have is to impute (or, generally speaking, dealing with) missing values (see -mi- entry in Stata .pdf manual).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Wah Myint

Join Date: Jan 2020

Posts: 15
#3

03 Jan 2020, 00:30

Thanks so much. I tried as suggested. For the variable that was imputed, it shows the missing value. However, when I run linear regression with one dependent variable and five independent variables, although it was imputed, the number of observations drops again. And I cannot add the imputed variable in the "independent variable list". Any other suggestions?
Warm regards
Wah
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#4

03 Jan 2020, 01:11

Wah:
in order to increase your chances of getting helpful replies, you should share what you typed and what Stata gave you back via CODE delimiters (as per FAQ).
Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Wah Myint

Join Date: Jan 2020

Posts: 15
#5

03 Jan 2020, 17:58

When I am about to send you the codes and outputs, I just solved it as per your suggestion. Thanks so much again. I will include codes next time.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

04 Jan 2020, 02:52

If I understood right, you can perform the most complex regression model, then use - if e(sample) - for the remaining models so as to always get the same number of observations.

Best regards,

Marcos
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#7

04 Jan 2020, 02:59

Wah:
one of the most relevant reward in being active part of the Stata forum is to benefit from other listers' solutions.
Your today's problem can be somebody else's one tomorrow: hence, posting the way you solved your problem is highly welcomed. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Wah Myint

Join Date: Jan 2020

Posts: 15
#8

04 Jan 2020, 12:43

Thanks Carlo. (I tried to follow the suggestions on Code delimiter (as per FAQ).

Problem: When I run the linear regression adding an independent variable (here v395), the number of observations dropped. I wanted to keep the original number of observations (which is 49 627). (I used Stata 16.0 SE)

Solution: My dependent variable named "total_methods" is a continuous variable and my independent variables are (1) age groups (here the variable name is v013 and (2) another independent variable named v395. (Both independent variables are dichotomous variables).
Below are the codes that I used.

[mi set mlong]
[mi register imputed v395]
(26656 m=0 obs. now marked as incomplete)

[mi misstable summarize v395]
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
v395 | 26,656 22,971 | 2 0 1
-----------------------------------------------------------------------------

And then, I run logistic regression for the imputed variable and the code and output are as follows.

[mi impute logit v395 i.total_methods i.v013, add(20) rseed(1234)]

Univariate imputation Imputations = 40
Logistic regression added = 20
Imputed: m=21 through m=40 updated = 0

------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
v395 | 22971 26656 26656 | 49627
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)

Finally, I run a linear regression with a dependent variable (total_methods) and a couple of independent variables (here v013 and v395).

[mi estimate, eform : regress total_methods i.v013 i.v395, eform(exp(Coef.))]

Multiple-imputation estimates Imputations = 40
Linear regression Number of obs = 49,627
Average RVI = 0.1074
Largest FMI = 0.4597
Complete DF = 49619
DF adjustment: Small sample DF: min = 187.90
avg = 37,992.16
max = 47,644.39
Model F test: Equal FMI F( 7,15449.1) = 1063.30
Within VCE type: OLS Prob > F = 0.0000

------------------------------------------------------------------------------
total_meth~s | exp(b) Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
v013 |
20-24 | 4.26237 .1280102 48.28 0.000 4.01871 4.520804
25-29 | 7.213979 .218951 65.11 0.000 6.797344 7.656151
30-34 | 7.711721 .2301804 68.44 0.000 7.273505 8.176338
35-39 | 7.552131 .2214417 68.95 0.000 7.130338 7.998876
40-44 | 7.354909 .2191597 66.96 0.000 6.937656 7.797257
45-49 | 6.10961 .1873996 59.01 0.000 5.753128 6.488182
|
v395 |
yes | 1.369588 .0412582 10.44 0.000 1.290571 1.453444
_cons | 59.93106 1.22452 200.33 0.000 57.5784 62.37985
------------------------------------------------------------------------------

Warm regards,
Wah
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

05 Jan 2020, 03:54

There are a couple of things I fail to follow, and exponentiating coefficients under a linear regression is one of them.

Best regards,

Marcos
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#10

05 Jan 2020, 04:46

Wah:
like Marcos, I fail to get why you exponentiated coefficients in your OLS.

Kind regards,
Carlo
(Stata 19.0)
Comment
Wah Myint

Join Date: Jan 2020

Posts: 15
#11

06 Jan 2020, 03:11

Marcos: & Carlo:
I am trying to have the output with an odd ratio, but the default is "exponentiated coefficients" in the linear regression. Now I realized that I can choose from the drop-down list under the main tab. So, the code and output are as follows.

[mi estimate, eform("Odds Ratio") : regress total_methods v013 v395, eform(exp(Coef.))]

Multiple-imputation estimates Imputations = 40
Linear regression Number of obs = 49,627
Average RVI = 0.2682
Largest FMI = 0.4438
Complete DF = 49
DF adjustment: Small sample DF: min = 201.48
avg = 20,206.63
max = 41,265.23
Model F test: Equal FMI F( 2, 873.3) = 1502.31
Within VCE type: OLS Prob > F = 0.0000

total_meth~s Odds Ratio Std. Err. t P>t [95% Conf. Interval]

v013 1.287934 .0054798 59.47 0.000 1.277238 1.29872
v395 1.652788 .0504437 16.46 0.000 1.556257 1.755308
_cons 105.2282 2.015708 243.07 0.000 101.3505 109.2543

Marcos: Can you explain with an example to use - if e(sample)? Because I am still having a problem with missing data (both the dependent variable and independent variables) when I run logistic regression. So, your explanation may work for that.

Problem in logistic regression: After I run the mi code for missing data for those two variables (here: v307_05 and v395), and tried to run logistic regression, the original number of observations (12, 885) is not showed up, and it showed only half (6463) (Please see below). Please suggest.

mi estimate, eform("Odds Ratio") : logistic v307_05 v013 v025 v106 v190 v502 v384a v384b v384c v395

Multiple-imputation estimates Imputations = 20
Logistic regression Number of obs = 6,463
Average RVI = 0.2083
Largest FMI = 0.2913
DF adjustment: Large sample DF: min = 233.51
avg = 1,318.80
max = 3,453.38
Model F test: Equal FMI F( 9, 5902.1) = 3.46
Within VCE type: OIM Prob > F = 0.0003

------------------------------------------------------------------------------
v307_05 | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
v013 | 1.206609 .1050285 2.16 0.031 1.016885 1.431729
v025 | .8413821 .2778571 -0.52 0.601 .440013 1.60887
v106 | 1.217835 .2305061 1.04 0.298 .8401895 1.765224
v190 | 1.30328 .1964717 1.76 0.080 .9687322 1.753362
v502 | 1.594499 .4520771 1.65 0.100 .9137371 2.782448
v384a | 1.281032 .4562676 0.70 0.487 .6371659 2.575536
v384b | .8571055 .2885819 -0.46 0.647 .4428341 1.658928
v384c | 1.354522 .4679256 0.88 0.380 .6880681 2.666493
v395 | 2.529506 .8674701 2.71 0.007 1.290403 4.958452
_cons | .0008935 .000988 -6.35 0.000 .0001012 .0078928
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Thanks again.
Warm regards,
Wah
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#12

06 Jan 2020, 03:39

Wah:
I really do not understand why you're not using -logistic- after -mi- if you want ORs.
As an aside, please share what you typed and what Stata gave you back via CODE delimiters. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#13

06 Jan 2020, 07:50

Marcos: Can you explain with an example to use - if e(sample)? Because I am still having a problem with missing data (both the dependent variable and independent variables) when I run logistic regression. So, your explanation may work for that.

When we deal with missing data in a regression analysis, we have casewise deletion. For example, if you have 20% missing values for sex and 10% (different) missing values for age, you'll get 30% missing values. In case you have, say, 3 models, the so called "full" model can become the e(sample). Just start with this regression. Then, for the remaining models, add "if e(sample" after the predictors.

I am trying to have the output with an odd ratio, but the default is "exponentiated coefficients" in the linear regression.

If we exponentiate coefficients under a logistic regression, we'll get ORs. So far so good. But we are not supposed to exponentiate coefficients under a linear regression. Anyway, by doing so, we won't get ORs.

In short, these 2 explanations are fundamental in regression analysis, yet you wish to deal with a more sophisticated machinery (multiple imputation). Beware it is much safer to grasp the core knowledge before delving into MI commands.

Last edited by Marcos Almeida; 06 Jan 2020, 07:52.

Best regards,

Marcos
1 like
Comment
Wah Myint

Join Date: Jan 2020

Posts: 15
#14

07 Jan 2020, 16:44

Originally posted by Carlo Lazzaro View Post

Wah:
I really do not understand why you're not using -logistic- after -mi- if you want ORs.
As an aside, please share what you typed and what Stata gave you back via CODE delimiters. Thanks.

Carlo: & Marcos:
Thanks for your comments. You are right. I realized your confusion. I should just give the output with a coefficient in linear regression. And as you said, if I want OR, I should use logistic.

Warm regards,
Wah
Comment

Announcement

How can I keep original number of observations in linear regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment