Multivariate regression for discrete outcomes

Connie Gao

Join Date: Jun 2016

Posts: 40
#1

Multivariate regression for discrete outcomes

16 Oct 2016, 22:15

Hi, Experts:

I am doing a cancer study which I need to estimate factors determinate their belief of life length. I have three discrete outcome variables as dependent variables. Let us say A, B and C. What I need to do is A=X'b1+e1; B=X'b2+e2; C=X'b3+e3. The X variables are the same for all three equations.

I prefer using multivariate regression because I guess the errors among three questions are correlated. However, my dependent variable is discrete. In such a case, can I still use multivariate regression? Is there assumption for multivariate regression, such as normality distribution?

Thank you,

Connie
Tags: None
Stephen Jenkins

Join Date: Apr 2014

Posts: 1430
#2

17 Oct 2016, 00:38

-findit mvprobit-? If you mean "binary" when you refer to "discrete"
Comment
Sebastian Geiger

Join Date: Oct 2015

Posts: 124
#3

17 Oct 2016, 10:13

Connie,

I also have trouble understanding what you're asking for.

Since you wrote that you are trying to "estimate factors determinate their belief of life length", I assume that you are not looking for a model with a binary dependent variable. How do your dependent variables look like exactly? I suppose they contain days, months, or something like that. In this case, I would call your outcomes continuous or quasi-continuous. As long as you don't have any issues with left or right-censoring (which is a common problem with duration data but probably does not apply to your dependent variables) you may use "normal" regressions in general.

If you want to model the correlation between the error terms of your equations explicitly, you may take a look at structural equation modeling (SEM), which is, however, no easy task. There may be other ways to deal with the correlation between the error terms, but I do not have enough insight in your project to suggest any specific models.
Comment
Connie Gao

Join Date: Jun 2016

Posts: 40
#4

17 Oct 2016, 20:11

Prof. Stephen & Prof. Sebastian: Thank you for your reply. Sorry for confusion. My outcome is not binary. They are un-ordered category variables: less than 6 months, 6 months to 1 year, more than 1 year such that. Each patient is asked three questions related how long do they believe they will live with current treatment, without treatment and with best treatment.

I think there may be some unobservable variable that influence the answers of three questions. So I am thinking using multivariate regression. However, since the outcomes are categorical, I am afraid I could not use multivariate linear regression (I tried to use OLS at first, but normality test violates). It seems Stata has no code for multivariate multinational logit model. I also considered SUR, but the test of whether off-diagonal elements are zero (error terms are not correlated) requires the assumption of normal distribution.

I am now using multinational logit for each equation. I am worried if I use multinational logit on each equation separately, will the results be biased?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4386
#5

17 Oct 2016, 22:16

Your responses seem to have a natural ordering, and so you might want to exploit that more than multinomial logistic regression can. Because each patient is asked each of the three questions, it seems that a multilevel / hierarchical / mixed-effects ordered probit or logistic regression might be suitable. Treating patient as a random effect can help accommodate the lack of independence between the responses that you're concerned about.

Consider something like that below. (Begin at the "Begin here" comment. The first part of the do-file just creates the artificial dataset used in the illustration.)

.ÿversionÿ14.2

.ÿ
.ÿclearÿ*

.ÿsetÿmoreÿoff

.ÿsetÿseedÿ1360445

.ÿ
.ÿquietlyÿsetÿobsÿ250

.ÿgenerateÿintÿpidÿ=ÿ_n

.ÿgenerateÿdoubleÿuÿ=ÿrnormal()

.ÿ
.ÿquietlyÿexpandÿ3

.ÿquietlyÿdrawnormÿlatyÿlatx,ÿdoubleÿcorr(1ÿ0.5ÿ\ÿ0.5ÿ1)

.ÿ
.ÿforeachÿvarÿofÿvarlistÿlat?ÿ{
ÿÿ2.ÿÿÿÿÿÿÿÿÿlocalÿmanifestÿ=ÿsubstr("`var'",ÿ-1,ÿ1)
ÿÿ3.ÿÿÿÿÿÿÿÿÿgenerateÿbyteÿ`manifest'ÿ=ÿ1
ÿÿ4.ÿÿÿÿÿÿÿÿÿforeachÿcutÿinÿ"0.333"ÿ"0.667"ÿ{
ÿÿ5.ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿquietlyÿreplaceÿ`manifest'ÿ=ÿ`manifest'ÿ+ÿ1ÿifÿnormal((`var'ÿ+ÿu)ÿ/ÿsqrt(2))ÿ>=ÿ`cut'
ÿÿ6.ÿÿÿÿÿÿÿÿÿ}
ÿÿ7.ÿ}

.ÿ
.ÿlabelÿdefineÿResponsesÿ1ÿ"lessÿthanÿ6ÿmonths"ÿ2ÿ"6ÿmonthsÿtoÿ1ÿyear"ÿ3ÿ"moreÿthanÿ1ÿyear"

.ÿlabelÿvaluesÿyÿResponses

.ÿlabelÿvariableÿyÿResponse

.ÿ
.ÿlabelÿdefineÿQuestionsÿ2ÿ"currentÿtreatment"ÿ1ÿ"withoutÿtreatment"ÿ3"ÿwithÿbestÿtreatment"

.ÿlabelÿvaluesÿxÿQuestions

.ÿlabelÿvariableÿxÿQuestion

.ÿ
.ÿtableÿxÿy

---------------------------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿResponseÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿQuestionÿ|ÿlessÿthanÿ6ÿmonthsÿÿ6ÿmonthsÿtoÿ1ÿyearÿÿÿÿmoreÿthanÿ1ÿyear
---------------------+-----------------------------------------------------------
ÿÿÿwithoutÿtreatmentÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ174ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ61ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ23
ÿÿÿcurrentÿtreatmentÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ70ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ123ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ66
ÿwithÿbestÿtreatmentÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ7ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ66ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ160
---------------------------------------------------------------------------------

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿmeoprobitÿyÿi.xÿ||ÿpid:ÿ,ÿnolog

Mixed-effectsÿoprobitÿregressionÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobsÿÿÿÿÿ=ÿÿÿÿÿÿÿÿ750
Groupÿvariable:ÿÿÿÿÿÿÿÿÿÿÿÿÿpidÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿgroupsÿÿ=ÿÿÿÿÿÿÿÿ250

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿObsÿperÿgroup:
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿminÿ=ÿÿÿÿÿÿÿÿÿÿ3
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿavgÿ=ÿÿÿÿÿÿÿÿ3.0
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿmaxÿ=ÿÿÿÿÿÿÿÿÿÿ3

Integrationÿmethod:ÿmvaghermiteÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿIntegrationÿpts.ÿÿ=ÿÿÿÿÿÿÿÿÿÿ7

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿWaldÿchi2(2)ÿÿÿÿÿÿ=ÿÿÿÿÿ251.11
Logÿlikelihoodÿ=ÿ-653.27099ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿProbÿ>ÿchi2ÿÿÿÿÿÿÿ=ÿÿÿÿÿ0.0000
----------------------------------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿyÿ|ÿÿÿÿÿÿCoef.ÿÿÿStd.ÿErr.ÿÿÿÿÿÿzÿÿÿÿP>|z|ÿÿÿÿÿ[95%ÿConf.ÿInterval]
-----------------------+----------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿxÿ|
ÿÿÿÿcurrentÿtreatmentÿÿ|ÿÿÿ.9492233ÿÿÿ.1154254ÿÿÿÿÿ8.22ÿÿÿ0.000ÿÿÿÿÿ.7229937ÿÿÿÿ1.175453
ÿÿwithÿbestÿtreatmentÿÿ|ÿÿÿ2.129588ÿÿÿ.1344049ÿÿÿÿ15.84ÿÿÿ0.000ÿÿÿÿÿ1.866159ÿÿÿÿ2.393017
-----------------------+----------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ/cut1ÿ|ÿÿÿ.3396987ÿÿÿ.0921161ÿÿÿÿÿ3.69ÿÿÿ0.000ÿÿÿÿÿ.1591544ÿÿÿÿÿ.520243
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ/cut2ÿ|ÿÿÿ1.636695ÿÿÿ.1067688ÿÿÿÿ15.33ÿÿÿ0.000ÿÿÿÿÿ1.427432ÿÿÿÿ1.845958
-----------------------+----------------------------------------------------------------
pidÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
ÿÿÿÿÿÿÿÿÿÿÿÿÿvar(_cons)|ÿÿÿ.2198141ÿÿÿ.0901879ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ.0983592ÿÿÿÿ.4912431
----------------------------------------------------------------------------------------
LRÿtestÿvs.ÿoprobitÿmodel:ÿchibar2(01)ÿ=ÿ9.76ÿÿÿÿÿÿÿÿÿProbÿ>=ÿchibar2ÿ=ÿ0.0009

.ÿ
.ÿexit

endÿofÿdo-file

.
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4386
#6

17 Oct 2016, 22:35

It doesn't affect the suggestion, but the discretization step in creating the artificial dataset would have been better as something like

Code:

generate byte y = 1 foreach cut in "0.333" "0.667" { quietly replace y = y + 1 if normal((laty + u) / sqrt(2)) >= `cut' } bysort pid (latx): generate byte x = _n
Comment

Sebastian Geiger

Join Date: Oct 2015
Posts: 124

18 Oct 2016, 04:43

Connie,

with categorical data like this, you may consider interval regression, which is a generalization of a tobit regression. With interval regression you can specify the start and end of each interval. Stata's command for interval regressions is intreg. This command needs two dependent variables and follows the following convention

Code:

        intreg depvar1 depvar2 [indepvars] [if] [in] [weight] [, options]

    depvar1 and depvar2 should have the following form:

             Type of data                  depvar1  depvar2
             ----------------------------------------------
             point data          a = [a,a]    a        a
             interval data           [a,b]    a        b
             left-censored data   (-inf,b]    .        b
             right-censored data   [a,inf)    a        .
             ----------------------------------------------

Your data is "interval data" and therefore the first dependent variable should contain the begin of the interval (e.g. 0 months) and the second dependent variable the end of the interval (e.g. 6 months). Your last category is probably right-censored ("more than x months"), and thus should contain a missing value (.).

To illustrate, I use the artificial dataset created by Joseph and modify it for intreg:

Code:

* Setting up test dataset
version 14.2


clear *

set more off

set seed 1360445


quietly set obs 250

generate int pid = _n

generate double u = rnormal()


quietly expand 3

quietly drawnorm laty latx, double corr(1 0.5 \ 0.5 1)


generate byte y = 1
foreach cut in "0.333" "0.667" {
    quietly replace y = y + 1 if normal((laty + u) / sqrt(2)) >= `cut'
}
bysort pid (latx): generate byte x = _n


label define Responses 1 "less than 6 months" 2 "6 months to 1 year" 3 "more than 1 year"

label values y Responses

label variable y Response


label define Questions 2 "current treatment" 1 "without treatment" 3" with best treatment"

label values x Questions

label variable x Question


table x y



* Start here

gen start = .                        // Start of interval
replace start = 0  if y==1
replace start = 6  if y==2
replace start = 12 if y==3

gen end = .                            // End of interval
replace end = 6   if y==1
replace end = 12  if y==2

intreg start end x

Comment

Connie Gao

Join Date: Jun 2016

Posts: 40
#8

18 Oct 2016, 19:25

Prof.Sebastian and Joseph: Thank you so much for the detail explanation. Very helpful! I will try the models as you suggested.
Comment

Announcement