Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multivariate regression for discrete outcomes

    Hi, Experts:

    I am doing a cancer study which I need to estimate factors determinate their belief of life length. I have three discrete outcome variables as dependent variables. Let us say A, B and C. What I need to do is A=X'b1+e1; B=X'b2+e2; C=X'b3+e3. The X variables are the same for all three equations.

    I prefer using multivariate regression because I guess the errors among three questions are correlated. However, my dependent variable is discrete. In such a case, can I still use multivariate regression? Is there assumption for multivariate regression, such as normality distribution?

    Thank you,

    Connie

  • #2
    -findit mvprobit-? If you mean "binary" when you refer to "discrete"

    Comment


    • #3
      Connie,

      I also have trouble understanding what you're asking for.

      Since you wrote that you are trying to "estimate factors determinate their belief of life length", I assume that you are not looking for a model with a binary dependent variable. How do your dependent variables look like exactly? I suppose they contain days, months, or something like that. In this case, I would call your outcomes continuous or quasi-continuous. As long as you don't have any issues with left or right-censoring (which is a common problem with duration data but probably does not apply to your dependent variables) you may use "normal" regressions in general.

      If you want to model the correlation between the error terms of your equations explicitly, you may take a look at structural equation modeling (SEM), which is, however, no easy task. There may be other ways to deal with the correlation between the error terms, but I do not have enough insight in your project to suggest any specific models.

      Comment


      • #4
        Prof. Stephen & Prof. Sebastian: Thank you for your reply. Sorry for confusion. My outcome is not binary. They are un-ordered category variables: less than 6 months, 6 months to 1 year, more than 1 year such that. Each patient is asked three questions related how long do they believe they will live with current treatment, without treatment and with best treatment.

        I think there may be some unobservable variable that influence the answers of three questions. So I am thinking using multivariate regression. However, since the outcomes are categorical, I am afraid I could not use multivariate linear regression (I tried to use OLS at first, but normality test violates). It seems Stata has no code for multivariate multinational logit model. I also considered SUR, but the test of whether off-diagonal elements are zero (error terms are not correlated) requires the assumption of normal distribution.

        I am now using multinational logit for each equation. I am worried if I use multinational logit on each equation separately, will the results be biased?

        Comment


        • #5
          Your responses seem to have a natural ordering, and so you might want to exploit that more than multinomial logistic regression can. Because each patient is asked each of the three questions, it seems that a multilevel / hierarchical / mixed-effects ordered probit or logistic regression might be suitable. Treating patient as a random effect can help accommodate the lack of independence between the responses that you're concerned about.

          Consider something like that below. (Begin at the "Begin here" comment. The first part of the do-file just creates the artificial dataset used in the illustration.)

          .ÿversionÿ14.2

          .ÿ
          .ÿclearÿ*

          .ÿsetÿmoreÿoff

          .ÿsetÿseedÿ1360445

          .ÿ
          .ÿquietlyÿsetÿobsÿ250

          .ÿgenerateÿintÿpidÿ=ÿ_n

          .ÿgenerateÿdoubleÿuÿ=ÿrnormal()

          .ÿ
          .ÿquietlyÿexpandÿ3

          .ÿquietlyÿdrawnormÿlatyÿlatx,ÿdoubleÿcorr(1ÿ0.5ÿ\ÿ0.5ÿ1)

          .ÿ
          .ÿforeachÿvarÿofÿvarlistÿlat?ÿ{
          ÿÿ2.ÿÿÿÿÿÿÿÿÿlocalÿmanifestÿ=ÿsubstr("`var'",ÿ-1,ÿ1)
          ÿÿ3.ÿÿÿÿÿÿÿÿÿgenerateÿbyteÿ`manifest'ÿ=ÿ1
          ÿÿ4.ÿÿÿÿÿÿÿÿÿforeachÿcutÿinÿ"0.333"ÿ"0.667"ÿ{
          ÿÿ5.ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿquietlyÿreplaceÿ`manifest'ÿ=ÿ`manifest'ÿ+ÿ1ÿifÿnormal((`var'ÿ+ÿu)ÿ/ÿsqrt(2))ÿ>=ÿ`cut'
          ÿÿ6.ÿÿÿÿÿÿÿÿÿ}
          ÿÿ7.ÿ}

          .ÿ
          .ÿlabelÿdefineÿResponsesÿ1ÿ"lessÿthanÿ6ÿmonths"ÿ2ÿ"6ÿmonthsÿtoÿ1ÿyear"ÿ3ÿ"moreÿthanÿ1ÿyear"

          .ÿlabelÿvaluesÿyÿResponses

          .ÿlabelÿvariableÿyÿResponse

          .ÿ
          .ÿlabelÿdefineÿQuestionsÿ2ÿ"currentÿtreatment"ÿ1ÿ"withoutÿtreatment"ÿ3"ÿwithÿbestÿtreatment"

          .ÿlabelÿvaluesÿxÿQuestions

          .ÿlabelÿvariableÿxÿQuestion

          .ÿ
          .ÿtableÿxÿy

          ---------------------------------------------------------------------------------
          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿResponseÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
          ÿÿÿÿÿÿÿÿÿÿÿÿQuestionÿ|ÿlessÿthanÿ6ÿmonthsÿÿ6ÿmonthsÿtoÿ1ÿyearÿÿÿÿmoreÿthanÿ1ÿyear
          ---------------------+-----------------------------------------------------------
          ÿÿÿwithoutÿtreatmentÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ174ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ61ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ23
          ÿÿÿcurrentÿtreatmentÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ70ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ123ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ66
          ÿwithÿbestÿtreatmentÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ7ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ66ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ160
          ---------------------------------------------------------------------------------

          .ÿ
          .ÿ*
          .ÿ*ÿBeginÿhere
          .ÿ*
          .ÿmeoprobitÿyÿi.xÿ||ÿpid:ÿ,ÿnolog

          Mixed-effectsÿoprobitÿregressionÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobsÿÿÿÿÿ=ÿÿÿÿÿÿÿÿ750
          Groupÿvariable:ÿÿÿÿÿÿÿÿÿÿÿÿÿpidÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿgroupsÿÿ=ÿÿÿÿÿÿÿÿ250

          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿObsÿperÿgroup:
          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿminÿ=ÿÿÿÿÿÿÿÿÿÿ3
          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿavgÿ=ÿÿÿÿÿÿÿÿ3.0
          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿmaxÿ=ÿÿÿÿÿÿÿÿÿÿ3

          Integrationÿmethod:ÿmvaghermiteÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿIntegrationÿpts.ÿÿ=ÿÿÿÿÿÿÿÿÿÿ7

          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿWaldÿchi2(2)ÿÿÿÿÿÿ=ÿÿÿÿÿ251.11
          Logÿlikelihoodÿ=ÿ-653.27099ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿProbÿ>ÿchi2ÿÿÿÿÿÿÿ=ÿÿÿÿÿ0.0000
          ----------------------------------------------------------------------------------------
          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿyÿ|ÿÿÿÿÿÿCoef.ÿÿÿStd.ÿErr.ÿÿÿÿÿÿzÿÿÿÿP>|z|ÿÿÿÿÿ[95%ÿConf.ÿInterval]
          -----------------------+----------------------------------------------------------------
          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿxÿ|
          ÿÿÿÿcurrentÿtreatmentÿÿ|ÿÿÿ.9492233ÿÿÿ.1154254ÿÿÿÿÿ8.22ÿÿÿ0.000ÿÿÿÿÿ.7229937ÿÿÿÿ1.175453
          ÿÿwithÿbestÿtreatmentÿÿ|ÿÿÿ2.129588ÿÿÿ.1344049ÿÿÿÿ15.84ÿÿÿ0.000ÿÿÿÿÿ1.866159ÿÿÿÿ2.393017
          -----------------------+----------------------------------------------------------------
          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ/cut1ÿ|ÿÿÿ.3396987ÿÿÿ.0921161ÿÿÿÿÿ3.69ÿÿÿ0.000ÿÿÿÿÿ.1591544ÿÿÿÿÿ.520243
          ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ/cut2ÿ|ÿÿÿ1.636695ÿÿÿ.1067688ÿÿÿÿ15.33ÿÿÿ0.000ÿÿÿÿÿ1.427432ÿÿÿÿ1.845958
          -----------------------+----------------------------------------------------------------
          pidÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
          ÿÿÿÿÿÿÿÿÿÿÿÿÿvar(_cons)|ÿÿÿ.2198141ÿÿÿ.0901879ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ.0983592ÿÿÿÿ.4912431
          ----------------------------------------------------------------------------------------
          LRÿtestÿvs.ÿoprobitÿmodel:ÿchibar2(01)ÿ=ÿ9.76ÿÿÿÿÿÿÿÿÿProbÿ>=ÿchibar2ÿ=ÿ0.0009

          .ÿ
          .ÿexit

          endÿofÿdo-file


          .

          Comment


          • #6
            It doesn't affect the suggestion, but the discretization step in creating the artificial dataset would have been better as something like
            Code:
            generate byte y = 1
            foreach cut in "0.333" "0.667" {
                quietly replace y = y + 1 if normal((laty + u) / sqrt(2)) >= `cut'
            }
            bysort pid (latx): generate byte x = _n

            Comment


            • #7
              Connie,

              with categorical data like this, you may consider interval regression, which is a generalization of a tobit regression. With interval regression you can specify the start and end of each interval. Stata's command for interval regressions is intreg. This command needs two dependent variables and follows the following convention


              Code:
                      intreg depvar1 depvar2 [indepvars] [if] [in] [weight] [, options]
              
                  depvar1 and depvar2 should have the following form:
              
                           Type of data                  depvar1  depvar2
                           ----------------------------------------------
                           point data          a = [a,a]    a        a
                           interval data           [a,b]    a        b
                           left-censored data   (-inf,b]    .        b
                           right-censored data   [a,inf)    a        .
                           ----------------------------------------------
              Your data is "interval data" and therefore the first dependent variable should contain the begin of the interval (e.g. 0 months) and the second dependent variable the end of the interval (e.g. 6 months). Your last category is probably right-censored ("more than x months"), and thus should contain a missing value (.).

              To illustrate, I use the artificial dataset created by Joseph and modify it for intreg:

              Code:
              * Setting up test dataset
              version 14.2
              
              
              clear *
              
              set more off
              
              set seed 1360445
              
              
              quietly set obs 250
              
              generate int pid = _n
              
              generate double u = rnormal()
              
              
              quietly expand 3
              
              quietly drawnorm laty latx, double corr(1 0.5 \ 0.5 1)
              
              
              generate byte y = 1
              foreach cut in "0.333" "0.667" {
                  quietly replace y = y + 1 if normal((laty + u) / sqrt(2)) >= `cut'
              }
              bysort pid (latx): generate byte x = _n
              
              
              label define Responses 1 "less than 6 months" 2 "6 months to 1 year" 3 "more than 1 year"
              
              label values y Responses
              
              label variable y Response
              
              
              label define Questions 2 "current treatment" 1 "without treatment" 3" with best treatment"
              
              label values x Questions
              
              label variable x Question
              
              
              table x y
              
              
              
              * Start here
              
              gen start = .                        // Start of interval
              replace start = 0  if y==1
              replace start = 6  if y==2
              replace start = 12 if y==3
              
              gen end = .                            // End of interval
              replace end = 6   if y==1
              replace end = 12  if y==2
              
              intreg start end x

              Comment


              • #8
                Prof.Sebastian and Joseph: Thank you so much for the detail explanation. Very helpful! I will try the models as you suggested.

                Comment

                Working...
                X