Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • firthlogit in STATA

    Hi,

    I am using firthlogit in Stata on a small sample (120).

    Below is the result:


    Click image for larger version

Name:	Screenshot 2025-05-14 121402.png
Views:	1
Size:	27.4 KB
ID:	1777389

    As you can see, the OR of the information variable is very high with wide CIs and huge SE. I have checked for multicolinearity issues but it is not there.
    Moreover, when calculating margins, dydx(*) I am getting it equal to the coefficient of the main model. Results below:

    Click image for larger version

Name:	Screenshot 2025-05-14 121417.png
Views:	1
Size:	30.2 KB
ID:	1777390


    The sample is primary data-based and unbalanced. See below the tabulation of the dependent variable (meeting [binary]) and the independent variable (information).

    Click image for larger version

Name:	Screenshot 2025-05-14 121440.png
Views:	1
Size:	13.3 KB
ID:	1777391

    Now is it good to go with these estimates? Moreover, given the very small sample size can I ignore the p-values as it comes significant in most cases?
    Last edited by Kaibalyapati Mishra; 14 May 2025, 01:06.

  • #2
    On Statalist we don't use screenshots (in your case they are unreadable anyhow). Instead we copy the text and paste them in the message in a code block (the # in the bar when you type a message).

    My guess is that you either have very few 0s or very few 1s. In that case your data just does not contain a lot of information. That is not very satisfactory, but the honest thing to do when you don't know something is to say that you don't know. That seems to be your case: your data just does not contain enough information for you to answer your question. To quote John Tukey (1986): "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data".

    As to your last question: definitely not. First, you got it the wrong way around: in small samples the p-values tend to be large (i.e. "not significant"). Second, research with small samples is actually the case where p-values can be somewhat useful: We humans are very good at seeing "patterns" in random noise (e.g. the Rorschach or inkblot test https://en.wikipedia.org/wiki/Rorschach_test ). Statistical tests can help us prevent making such mistakes. Such random noise with apparent patterns are especially common in small datasets. So, in very large datasets statistical tests contain little information, but in small datasets they are a useful first step. This comes back to my first point: if the data just isn't good enough to answer our question, then the right thing to do is to say that and start collecting better data.

    I realize that that is not the answer you are hoping for, but sometimes somebody has to give you bad news, and this time that somebody is me.

    Tukey, J. W. (1986). Sunset Salvo. The American Statistician, 40(1), 72–76.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Dear Prof.,

      Thank you so much for your response. Sorry about the screenshots. I have pasted the results into the code block below. I incorrectly wrote that, P-values come significant instead of insignificant as you can see below:
      Can you comment on what seems to be the problem now? Or is it still the quality of data?

      Tabulation results:

      Code:
       tabulate information        - Independent variable with high odds
      
          Receive |
      Information |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                0 |        177       73.75       73.75
                1 |         63       26.25      100.00
      ------------+-----------------------------------
            Total |        240      100.00
      
      
      
      . tab meeting   -                             Dependent Variable
      
      Attended WC |
          Meeting |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                0 |        195       81.25       81.25
                1 |         45       18.75      100.00
      ------------+-----------------------------------
            Total |        240      100.00
      
      . tab meeting information
      
        Attended |  Receive Information
      WC Meeting |         0          1 |     Total
      -----------+----------------------+----------
               0 |       174         21 |       195 
               1 |         3         42 |        45 
      -----------+----------------------+----------
           Total |       177         63 |       240
      OR results:

      Code:
                                                              Number of obs =    120
                                                              Wald chi2(14) =  15.02
      Penalized log likelihood = -10.932615                   Prob > chi2   = 0.3770
      
      -------------------------------------------------------------------------------
            meeting | Odds ratio   Std. err.      z    P>|z|     [90% conf. interval]
      --------------+----------------------------------------------------------------
      _Iinformati_1 |      42.45   50.72137     3.14   0.002      5.94752    302.9839
           _Islum_1 |   .9995229   1.455872    -0.00   1.000     .0910527    10.97217
                age |   1.027918   .0391524     0.72   0.470     .9654939    1.094378
      _Imember_po_1 |   30.62242   50.64336     2.07   0.039     2.016722    464.9787
           _Iclub_1 |   .1385785   .1807705    -1.52   0.130     .0162126    1.184514
         _Igender_1 |   .8898505   1.152278    -0.09   0.928     .1057538    7.487525
      _Icouncilor_1 |   6.128302   7.505467     1.48   0.139     .8174456    45.94322
      _Isociety_c_1 |   8.907647    15.9983     1.22   0.223      .464275    170.9034
          _Icaste_1 |    .924658   1.182606    -0.06   0.951     .1128109    7.578986
      _Iownership_2 |   1.872342   1.900229     0.62   0.537     .3526913    9.939756
             hhsize |   .5333421   .1655405    -2.03   0.043     .3200982    .8886452
          log_per_Y |   .4465741   .5616841    -0.64   0.522      .056417    3.534901
      _Ieducation_2 |   .5035438   .6015231    -0.57   0.566     .0705811    3.592412
      _Ieducation_3 |   .2366295    .392097    -0.87   0.384     .0155019    3.612051
              _cons |   .3215287   1.830108    -0.20   0.842     .0000276    3743.097
      dydx(*) Results:

      Code:
      -------------------------------------------------------------------------------
                    |            Delta-method
                    |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
      --------------+----------------------------------------------------------------
      _Iinformati_1 |   3.748327    1.19485     3.14   0.002     1.406465    6.090189
           _Islum_1 |  -.0004772   1.456567    -0.00   1.000    -2.855296    2.854341
                age |   .0275354   .0380891     0.72   0.470    -.0471178    .1021887
      _Imember_po_1 |   3.421732     1.6538     2.07   0.039     .1803437    6.663121
           _Iclub_1 |  -1.976318   1.304463    -1.52   0.130    -4.533019    .5803826
         _Igender_1 |  -.1167018   1.294912    -0.09   0.928    -2.654682    2.421278
      _Icouncilor_1 |   1.812918   1.224722     1.48   0.139    -.5874936    4.213329
      _Isociety_c_1 |    2.18691   1.796019     1.22   0.223    -1.333222    5.707043
          _Icaste_1 |  -.0783314   1.278965    -0.06   0.951    -2.585057    2.428395
      _Iownership_2 |   .6271902   1.014894     0.62   0.537    -1.361966    2.616346
             hhsize |  -.6285923   .3103833    -2.03   0.043    -1.236932   -.0202522
          log_per_Y |  -.8061499   1.257762    -0.64   0.522    -3.271319    1.659019
      _Ieducation_2 |  -.6860845   1.194579    -0.57   0.566    -3.027417    1.655248
      _Ieducation_3 |   -1.44126   1.657008    -0.87   0.384    -4.688935    1.806416

      Comment


      • #4
        I fear that I can't be more optimistic on your behalf than was Maarten Buis. Your detailed results imply that you have many missing values on individual variables, which drop out of the model fit. So you are estimating 15 parameters from 120 observations; 120 or more observations fall by the wayside. Without wanting to live or die by more precise rules, I would call that a stretch, implying a need to think about simpler models. But simpler models wouldn't be more successful; just simpler....

        For social science, all seems about expectable, however. Wouldn't you be surprised if you could predict whether people attend a meeting (or whatever the outcome variable is) really just from some simple personal characteristics? My own attendance or non-attendance at meetings is often driven by purely personal and/or transient circumstances that wouldn't be in a dataset, and my guess is that such noise (statistical sense) is typical. .

        Comment


        • #5
          Thank you for your response Nick Cox .

          Comment


          • #6
            A few additional comments:

            I'm guessing you used the xi: prefix. That isn't necessary, since firthlogit supports factor variables. Using factor variables produces nicer looking output and is also often necessary to get correct results from margins.

            The default predict option for firthlogit is xb. Therefore, since you don't have any interactions or product terms, the marginal effects will be the same as the unexponentiated coefficients.

            Most of your variables do little or nothing. Especially with a small N, junk variables can increase the standard errors for all the variables and make it harder for any of them to be statistically significant. Rethink whether all these variables are theoretically necessary. Even if they are, you may have to sacrifice something because your sample size is too small to detect effects.

            Why are you losing so much data? Is one variable in particular poorly measured with a lot of md zapping you? If so consider dropping it. Or, are there alternative measures with less MD that you could use instead?

            Having said all that, it may just be that you do not have adequate data for testing your ideas. As Maarten says, "if the data just isn't good enough to answer our question, then the right thing to do is to say that and start collecting better data."

            Or, as Nick says, "Wouldn't you be surprised if you could predict whether people attend a meeting (or whatever the outcome variable is) really just from some simple personal characteristics?" Even a sample of 10,000 cases might not show very strong relationships.

            But you can still try to do more with your current data. Simplify the model to use fewer variables and/or see if there are reasonable ways to reduce the number of cases lost to MD.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 19.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              To see how the meeting is distributed, as Maarten indicated, there might be an issue of highly skewed zeros or ones. Just do the tabulation and cross-tab after running your regression.

              Code:
               firthlogit y x1 x2 ...

              Code:
               tab meeting if e(sample)==1
              Best regards,
              Mukesh

              Comment

              Working...
              X