Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I keep original number of observations in linear regression

    Hello,
    I would like to know one thing. I am running a linear regression in Stata 16. When I added an independent variable, the total number of observations drops. Which command should I use to maintain the original number of observations? Thank you.

  • #2
    Wah:
    welcome to this forum.
    What you experienced is caused by missing values in your added predictor. In order to make calculation feasible (most in Stata is translated into matrices), by default Stata omits observations with missing values in any variable.
    Hence, the only fix you (and everybody else who might face the same issue) have is to impute (or, generally speaking, dealing with) missing values (see -mi- entry in Stata .pdf manual).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks so much. I tried as suggested. For the variable that was imputed, it shows the missing value. However, when I run linear regression with one dependent variable and five independent variables, although it was imputed, the number of observations drops again. And I cannot add the imputed variable in the "independent variable list". Any other suggestions?
      Warm regards
      Wah

      Comment


      • #4
        Wah:
        in order to increase your chances of getting helpful replies, you should share what you typed and what Stata gave you back via CODE delimiters (as per FAQ).
        Thanks.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          When I am about to send you the codes and outputs, I just solved it as per your suggestion. Thanks so much again. I will include codes next time.

          Comment


          • #6
            If I understood right, you can perform the most complex regression model, then use - if e(sample) - for the remaining models so as to always get the same number of observations.
            Best regards,

            Marcos

            Comment


            • #7
              Wah:
              one of the most relevant reward in being active part of the Stata forum is to benefit from other listers' solutions.
              Your today's problem can be somebody else's one tomorrow: hence, posting the way you solved your problem is highly welcomed. Thanks.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Thanks Carlo. (I tried to follow the suggestions on Code delimiter (as per FAQ).

                Problem: When I run the linear regression adding an independent variable (here v395), the number of observations dropped. I wanted to keep the original number of observations (which is 49 627). (I used Stata 16.0 SE)

                Solution: My dependent variable named "total_methods" is a continuous variable and my independent variables are (1) age groups (here the variable name is v013 and (2) another independent variable named v395. (Both independent variables are dichotomous variables).
                Below are the codes that I used.

                [mi set mlong]
                [mi register imputed v395]
                (26656 m=0 obs. now marked as incomplete)

                [mi misstable summarize v395]
                Obs<.
                +------------------------------
                | | Unique
                Variable | Obs=. Obs>. Obs<. | values Min Max
                -------------+--------------------------------+------------------------------
                v395 | 26,656 22,971 | 2 0 1
                -----------------------------------------------------------------------------

                And then, I run logistic regression for the imputed variable and the code and output are as follows.

                [mi impute logit v395 i.total_methods i.v013, add(20) rseed(1234)]

                Univariate imputation Imputations = 40
                Logistic regression added = 20
                Imputed: m=21 through m=40 updated = 0

                ------------------------------------------------------------------
                | Observations per m
                |----------------------------------------------
                Variable | Complete Incomplete Imputed | Total
                -------------------+-----------------------------------+----------
                v395 | 22971 26656 26656 | 49627
                ------------------------------------------------------------------
                (complete + incomplete = total; imputed is the minimum across m
                of the number of filled-in observations.)

                Finally, I run a linear regression with a dependent variable (total_methods) and a couple of independent variables (here v013 and v395).

                [mi estimate, eform : regress total_methods i.v013 i.v395, eform(exp(Coef.))]

                Multiple-imputation estimates Imputations = 40
                Linear regression Number of obs = 49,627
                Average RVI = 0.1074
                Largest FMI = 0.4597
                Complete DF = 49619
                DF adjustment: Small sample DF: min = 187.90
                avg = 37,992.16
                max = 47,644.39
                Model F test: Equal FMI F( 7,15449.1) = 1063.30
                Within VCE type: OLS Prob > F = 0.0000

                ------------------------------------------------------------------------------
                total_meth~s | exp(b) Std. Err. t P>|t| [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                v013 |
                20-24 | 4.26237 .1280102 48.28 0.000 4.01871 4.520804
                25-29 | 7.213979 .218951 65.11 0.000 6.797344 7.656151
                30-34 | 7.711721 .2301804 68.44 0.000 7.273505 8.176338
                35-39 | 7.552131 .2214417 68.95 0.000 7.130338 7.998876
                40-44 | 7.354909 .2191597 66.96 0.000 6.937656 7.797257
                45-49 | 6.10961 .1873996 59.01 0.000 5.753128 6.488182
                |
                v395 |
                yes | 1.369588 .0412582 10.44 0.000 1.290571 1.453444
                _cons | 59.93106 1.22452 200.33 0.000 57.5784 62.37985
                ------------------------------------------------------------------------------


                Warm regards,
                Wah

                Comment


                • #9
                  There are a couple of things I fail to follow, and exponentiating coefficients under a linear regression is one of them.
                  Best regards,

                  Marcos

                  Comment


                  • #10
                    Wah:
                    like Marcos, I fail to get why you exponentiated coefficients in your OLS.
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      Marcos: & Carlo:
                      I am trying to have the output with an odd ratio, but the default is "exponentiated coefficients" in the linear regression. Now I realized that I can choose from the drop-down list under the main tab. So, the code and output are as follows.

                      [mi estimate, eform("Odds Ratio") : regress total_methods v013 v395, eform(exp(Coef.))]

                      Multiple-imputation estimates Imputations = 40
                      Linear regression Number of obs = 49,627
                      Average RVI = 0.2682
                      Largest FMI = 0.4438
                      Complete DF = 49
                      DF adjustment: Small sample DF: min = 201.48
                      avg = 20,206.63
                      max = 41,265.23
                      Model F test: Equal FMI F( 2, 873.3) = 1502.31
                      Within VCE type: OLS Prob > F = 0.0000


                      total_meth~s Odds Ratio Std. Err. t P>t [95% Conf. Interval]

                      v013 1.287934 .0054798 59.47 0.000 1.277238 1.29872
                      v395 1.652788 .0504437 16.46 0.000 1.556257 1.755308
                      _cons 105.2282 2.015708 243.07 0.000 101.3505 109.2543




                      Marcos: Can you explain with an example to use - if e(sample)? Because I am still having a problem with missing data (both the dependent variable and independent variables) when I run logistic regression. So, your explanation may work for that.

                      Problem in logistic regression: After I run the mi code for missing data for those two variables (here: v307_05 and v395), and tried to run logistic regression, the original number of observations (12, 885) is not showed up, and it showed only half (6463) (Please see below). Please suggest.

                      mi estimate, eform("Odds Ratio") : logistic v307_05 v013 v025 v106 v190 v502 v384a v384b v384c v395

                      Multiple-imputation estimates Imputations = 20
                      Logistic regression Number of obs = 6,463
                      Average RVI = 0.2083
                      Largest FMI = 0.2913
                      DF adjustment: Large sample DF: min = 233.51
                      avg = 1,318.80
                      max = 3,453.38
                      Model F test: Equal FMI F( 9, 5902.1) = 3.46
                      Within VCE type: OIM Prob > F = 0.0003

                      ------------------------------------------------------------------------------
                      v307_05 | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                      v013 | 1.206609 .1050285 2.16 0.031 1.016885 1.431729
                      v025 | .8413821 .2778571 -0.52 0.601 .440013 1.60887
                      v106 | 1.217835 .2305061 1.04 0.298 .8401895 1.765224
                      v190 | 1.30328 .1964717 1.76 0.080 .9687322 1.753362
                      v502 | 1.594499 .4520771 1.65 0.100 .9137371 2.782448
                      v384a | 1.281032 .4562676 0.70 0.487 .6371659 2.575536
                      v384b | .8571055 .2885819 -0.46 0.647 .4428341 1.658928
                      v384c | 1.354522 .4679256 0.88 0.380 .6880681 2.666493
                      v395 | 2.529506 .8674701 2.71 0.007 1.290403 4.958452
                      _cons | .0008935 .000988 -6.35 0.000 .0001012 .0078928
                      ------------------------------------------------------------------------------
                      Note: _cons estimates baseline odds.





                      Thanks again.
                      Warm regards,
                      Wah

                      Comment


                      • #12
                        Wah:
                        I really do not understand why you're not using -logistic- after -mi- if you want ORs.
                        As an aside, please share what you typed and what Stata gave you back via CODE delimiters. Thanks.
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          Marcos: Can you explain with an example to use - if e(sample)? Because I am still having a problem with missing data (both the dependent variable and independent variables) when I run logistic regression. So, your explanation may work for that.
                          When we deal with missing data in a regression analysis, we have casewise deletion. For example, if you have 20% missing values for sex and 10% (different) missing values for age, you'll get 30% missing values. In case you have, say, 3 models, the so called "full" model can become the e(sample). Just start with this regression. Then, for the remaining models, add "if e(sample" after the predictors.

                          I am trying to have the output with an odd ratio, but the default is "exponentiated coefficients" in the linear regression.
                          If we exponentiate coefficients under a logistic regression, we'll get ORs. So far so good. But we are not supposed to exponentiate coefficients under a linear regression. Anyway, by doing so, we won't get ORs.

                          In short, these 2 explanations are fundamental in regression analysis, yet you wish to deal with a more sophisticated machinery (multiple imputation). Beware it is much safer to grasp the core knowledge before delving into MI commands.
                          Last edited by Marcos Almeida; 06 Jan 2020, 07:52.
                          Best regards,

                          Marcos

                          Comment


                          • #14
                            Originally posted by Carlo Lazzaro View Post
                            Wah:
                            I really do not understand why you're not using -logistic- after -mi- if you want ORs.
                            As an aside, please share what you typed and what Stata gave you back via CODE delimiters. Thanks.
                            Carlo: & Marcos:
                            Thanks for your comments. You are right. I realized your confusion. I should just give the output with a coefficient in linear regression. And as you said, if I want OR, I should use logistic.


                            Warm regards,
                            Wah

                            Comment

                            Working...
                            X