Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with misspecification

    Hello All,

    Using a dataset of 600,000 observations and 60 variables I ran regress and then tested for misspecification using the ovtest. The resulting F test had a P-value of 0.0000.
    This initial regression had a log dependent variable, 3 continuous independent variables (2 of which were logged) as well as 2 dummy variables to represent categorical variables.

    Following the ovtest, polynomials and interaction terms between the continuous variables and the dummies were added to the regression. Once again I ran regress, although in this case, the F was lower,
    the P value was still 0.0000.

    Would be grateful for any advice as to how to specify this model.

    Best,

    Ciaran

  • #2
    Ciaran:
    with such a sky-rocketing number of observations, even a hiccup more/less can give back statistical significance.
    That said: have you tried -linktest- after -regress-? What's its outcome?
    As an aside (and per FAQ), please share what you typed and what Stata gave you back via CODE delimiters. Thanks.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks again Carlo,

      I was concerned that perhaps the size of the dataset was causing issues thank you for clarifying. Perhaps I will look to clean it up or maybe just take a random sample.

      In relation to the linktest, P for both hat and hatsq is <0.0001

      Below is the CODE I used.

      Code:
      reg LogEnergyRequirement LogHLI LogGFA Age Terraced SemiD
       
            Source |       SS           df       MS      Number of obs   =   382,543
      -------------+----------------------------------   F(5, 382537)    >  99999.00
             Model |  66000.3431         5  13200.0686   Prob > F        =    0.0000
          Residual |  15071.3472   382,537  .039398404   R-squared       =    0.8141
      -------------+----------------------------------   Adj R-squared   =    0.8141
             Total |  81071.6903   382,542  .211928861   Root MSE        =    .19849
       
      ------------------------------------------------------------------------------
      LogEnergyR~t |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            LogHLI |   .8692931   .0012401   701.00   0.000     .8668626    .8717236
            LogGFA |   .8144884   .0009994   815.01   0.000     .8125296    .8164471
               Age |   .0004029   .0000139    29.06   0.000     .0003758    .0004301
          Terraced |  -.0514299   .0011039   -46.59   0.000    -.0535936   -.0492663
             SemiD |  -.0308624    .000786   -39.26   0.000     -.032403   -.0293218
             _cons |   5.430726   .0052307  1038.24   0.000     5.420474    5.440978
      ------------------------------------------------------------------------------
       
       ovtest
       
      Ramsey RESET test using powers of the fitted values of LogEnergyRequirement
             Ho:  model has no omitted variables
                    F(3, 382534) =   1030.78
                        Prob > F =      0.0000
      Very grateful for your advice

      Kind Regards

      Ciaran
      (Stata 15.1)

      Comment


      • #4
        Ciaran:
        what if you go -c.age##c.age- instead of -age- only?
        If what above does not produce positive effects, the usual recipe is to start it all over again, add one predictor at time, run -regress- and -linktest- and see when problems start to come alive.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thanks again Carlo,

          I will go back to regress Energyrequirement HLI and use ovtest and linktest to identify powers and interactions to expand and specify the model as I go.

          Best,

          Ciaran

          Comment

          Working...
          X