Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Logit (not concave, pseudo r2 = 1)

    Dear Community,

    I'm really stuck trying to solve the issue with the logit regression that I'm running. My dataset consists of around 32 independent variables (most of which are dummy variables, many of them are category ones (e.g. I have 3 categories, I'm using one as a reference one and 2 dummy variables), some of the variables are taking values 0-200+). I have around 100 observations (I suspect this small number of observations might be the reason for the problem that i'm facing).

    Dummy variables are: sr_dummy, human, age5, age40, age50 (category variables, reference one is not included), freq115, freq115_150, freq150_11000, freq1100000 (category variables, reference one is not included), treat_low, treat_high ((category variables, reference one is not included)), inn, devst_mid, devst_late ((category variables, reference one is not included)), org, fund, goal_40k_300k, goal_1m (category variables, reference one is not included), qual, eff, phtm vid, res, plat, cure

    Regular variables (possible values (min-max)): phd (0-5), wmort (0-100), wleng (0-1706), wcomm (0-303), wtwit (0-2155), wupd(0-28), wback(0-2083),

    When running a logit regression, some of the variables are omitted, while there are no coefficients at all and pseudo r2 is 1. I cannot understand the reason for that. The first two photos are screenshots of the logit regression.
    Click image for larger version

Name:	1.JPG
Views:	1
Size:	123.3 KB
ID:	1429619

    Click image for larger version

Name:	2.JPG
Views:	1
Size:	85.1 KB
ID:	1429620




    However, when I run the probit regression, it takes couple minute for Stata to process it (16,000 iterations). After 16,000 iterations (not concave), it says that no convergence was achieved. The pseudo R2 is also 1.



    Click image for larger version

Name:	3.JPG
Views:	2
Size:	137.4 KB
ID:	1429621




    .........


    Click image for larger version

Name:	4.JPG
Views:	1
Size:	95.4 KB
ID:	1429623




    Click image for larger version

Name:	5.JPG
Views:	1
Size:	37.2 KB
ID:	1429624


    I cannot wrap my mind around why logit and probit have such a different number of iterations. I also cannot understand why pseudo r2 is that high, why so many variables are omitted, and why there are coefficients only for some of the variables?

    Would really appreciate if you could clarify some of the questions above.

    Cheers
    Attached Files

  • #2
    you have many examples of perfect prediction here; this means, e.g., that when the variable "human" is not equal to 1, that all 12 observations with this condition have the same value for the outcome (look at a crosstab of human by your outcome variable)

    the result is a sample size (64) that is "too small" for the number of predictors in the model; i.e., you need more, and better, data

    Comment


    • #3
      In addition to Rich Goldstein's spot-on advice, I would suggest that:

      1. Even if you had no perfect prediction at all and no observations were dropped, a sample of 100 observations is not really adequate for a regression model (of any kind) with 32 predictors.

      2. The -probit- results you see are completely meaningless because the model did not converge. Whenever you see "convergence not achieved" (in red!) at the end, you must disregard the "results" that are shown. All those results are good for is troubleshooting which variables might be causing the non-convergence. (In the output you show, that would be nearly all of them--which is really just another way of saying your data here are not suitable for the analysis.)

      3. The number of iterations with -probit- is just an artifact here: Stata's default for maximum likelihood estimation is to give up after 16,000 iterations if convergence is not achieved. Had you set that limit differently (you can override it with the -iterate()- option, the number of iterations would have been whatever you specified.) The -logit- model, by contrast, converged, although, as you can see, the results it converged to are not useful.

      Comment


      • #4
        Dear all,

        thank you for your replies. Decided not to proceed with logit regression.

        However, I've recently read about cox proportional hazards model regression, and I'm wondering if it can be a good solution in this case?
        I'm only a stata beginner, and don't know much about cox regressions, but I'm wondering what do you think? Is it feasible in my case?

        Would appreciate any information

        Comment


        • #5
          It is unlikely that you will have better luck with Cox proportional hazards than you have had with -logit- and -probit-. The problem is that your ratio of observations to variables is too low, not that you chose the wrong regression command. You either need to get a lot more observations, or you need to strip large number of variables out of your model. With 100 observations, you really should only have 3 or 4 predictors. Maybe stretch that to 10 if you are adventuresome and there aren't many consequences to getting results that fail to generalize. But there is no hope for a sensible model of 32 predictors from 100 observations. Even if you find a model that converges, all you will have done is overfit the noise in the data.

          Comment


          • #6
            Thank you, Clyde. I will try to decrease the number of independent variables and see how it goes.

            I'm wondering, if I decide to go with a cox regression afterwards, how should I run it?

            I've read a number of articles about cox regression, but didn't quite understand, how to apply it to my cross-sectional data. I'm wondering if you can help?

            Comment


            • #7
              I have to be surprised that you regard Cox [the one in question certainly was not me, but we Coxes all deserve our capitals] regression as an alternative to logit. It's about survival in time, not about explaining a binary outcome.

              Comment


              • #8
                Thank you for your reply Nick, I was trying to figure out options to answer my research question, thus, I was considering different options.

                As I said I'm a stata beginner, and, while I try to grasp as much knowledge as possible, sometimes it's challenging to interpret everything correctly.
                Thus, I might be mistaken assuming that I can use a Cox regression here. However, I'm wondering why it cannot be applied here?

                I understand that it is usually applied for survival analysis, however, I've also seen cases when it was applied to cross-sectional data. This makes me a bit confused.

                Comment


                • #9
                  There is some scope here for asking for and receiving statistical advice, although naturally discussants can't really provide you on the fly with all the statistical training you need to do whatever you're doing.

                  Despite the wonderful name -- as said, Cox is not me and I'm not even related to Sir David Cox -- I don't use Cox regression and am not well placed to advise on it.

                  To get more help, most likely from others, I think at a minimum you need

                  1. To say what your research questions are.

                  2. To explain your response or outcome variable. sr_dummy is something of interest. It would make matters concrete to tell us what it is.

                  3. To explain the structure of your data. Is it cross-sectional data on people for example? What would define any survival process in time in your data?

                  Comment


                  • #10
                    It is hard to comment on what you have read, as we don't know what you have read.

                    In Cox regression you try to explain how long it takes before a certain event happens. You can ask survival time data in cross section. For example: If you want to know how long it takes before someone gets their first child, you only need to ask two questions: when were you born and when did you get your first child? You don't need panel data for that. These two questions suffice. So the fact that you have seen Cox regression used in cross sectional data does not mean it has not been applied to survivor data.

                    ---------------------------------
                    Maarten L. Buis
                    University of Konstanz
                    Department of history and sociology
                    box 40
                    78457 Konstanz
                    Germany
                    http://www.maartenbuis.nl
                    ---------------------------------

                    Comment


                    • #11
                      thank you guys for your input. The variable that I'm interested in (dependent) is sr_dummy, which takes 1 if a project was funded, and 0 otherwise. All other variables are either dummy variables or normal variables, which influence the sr_dummy. All the data is cross-sectional, analyzed at one point of time.

                      Comment


                      • #12
                        That settles it: no Cox regression for you (this time).
                        ---------------------------------
                        Maarten L. Buis
                        University of Konstanz
                        Department of history and sociology
                        box 40
                        78457 Konstanz
                        Germany
                        http://www.maartenbuis.nl
                        ---------------------------------

                        Comment

                        Working...
                        X