Logistic Regession

Ted Kaniuka

Join Date: Apr 2014

Posts: 33
#1

Logistic Regession

12 Feb 2016, 13:07

Hello - Searching for advice or readings. I have a data set were the probability of an event occurring (passing an exam) is approx 95% for one group and 90% for the other. I wanted to run a logistic regression to determine if group membership was related to this outcome. I vaguely remember reading that this type of data, that is were the probability of success is highly skewed presents a problem or may cause concerns with logistic regression. Any help pointing me in a direction is welcomed. Ted
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

12 Feb 2016, 14:02

Hello Ted,

You gave very few information about your data. That said, perhaps a complementary log-log regression can do the trick. You may probably wish to take a look at this text: http://www.stata.com/manuals13/rcloglog.pdf

Best,

Marcos

Best regards,

Marcos
Comment

Ted Kaniuka

Join Date: Apr 2014
Posts: 33

12 Feb 2016, 16:46

Ok here are the percent passing for the three outcomes (algebra, English, and biology). The figures represent the percent of students passing for two different types of schools. Questions were raised as to the English scores since there are passing rates in the 90 percent range while for the other two tests we have get closer to 80/20 success failure ratio. What other types of information would you like? Thanks for the help.

	2009		2010		2011		2012
	Trad	ECHS	Trad	ECHS	Trad	ECHS	Trad	ECHS
Algebra	81.5	81.1	77.3	80.8	78.6	80.8	82.9	83.3
English	90.7	94.6	85.9	90.9	87.9	91.4	87.4	92.9
Biology	65.7	71.4	77.1	82.8	80.8	85.3	83.1	88.4

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17700
#4

13 Feb 2016, 02:50

Ted:
echoing Marcos' wise remark, you should also post, for instance, if those results refer to the same pupils measured across years or to different pupils measured once during the 2009-2013 timespan.
In the first case, you would have panel data to analyze, whereas in the latter a survey analysis seems to be the best way to deal with to your data.

Kind regards,
Carlo
(Stata 19.0)
Comment
Ted Kaniuka

Join Date: Apr 2014

Posts: 33
#5

13 Feb 2016, 07:58

Carlo and Marcos - Thanks for trying to help me even though I have provided too little. The data set contains the entire public school testing record for students by the year they were seniors. So the test scores occurred at different times in a student's career but the data is not panel as time is not a factor to be considered. So what I have is a data set with testing records organized by the year the students were seniors, and no student is in more than one year. Essentially it is a cross sectional data set and I am looking to see if with different groups of students using the same subject area exams does attending one type of high school over another have an association with test score differences. For each year, I created propensity scored matched groups using pre-high school covariates (race/sex/wealth/middle school test scores) so I could analyze the test scores according to the high type they attended. I ran a mixed level logistic regression with students nested into schools, but the full models' deviance statistic was not significantly different from the null (the level two model is very simple so I have considered adding some school related covariates). However regardless even if I do that, someone pointed out to me that since the data is highly skewed, (many more successes than failures), especially for English that I needed to explore an alternate method of analyzing the data. The complementary solution appears to be designed to handle skewed data such as I have, never thought of the survey approach. Question is why could that be a solution or with the additional information I have given are there other alternatives? Again thanks for all your time and help.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

13 Feb 2016, 08:27

I absolutely agree with Carlo's recommendations.

To add, I fear there is important information about the study design still lacking appropriate presentation, albeit it is getting more clear since #3.

IMHO, the information given in #1 would be tough to match with the information given in #3. The first one directs to a binary logistic regression ("pass" or "fail" outcomes) whereas the second one entails - at least to me - preliminary thoughts on multinomial regression.

Still trying to hazard a guess (maybe "guesses", in the plural), after reading #3 we get the impression the study deals with aggregate data. If so, please beware of the ecological fallacy. Moreover, if the study has data from different levels, well, a multilevel model is something to consider.

May the dependent variable be a percentage, please take a look at the recommendations for such sort of "situation". You can found several threads on this subject, for it was already broadly discussed in the Stata Forum.

To end, it was said in #1:

Any help pointing me in a direction is welcomed.

Well, there were several, including Carlo's warnings and tips.

Hopefully that helps!

Best,
Marcos

Last edited by Marcos Almeida; 13 Feb 2016, 08:35.

Best regards,

Marcos
Comment
Ted Kaniuka

Join Date: Apr 2014

Posts: 33
#7

13 Feb 2016, 09:02

Marcos - I am sorry I have failed to be clear. The outcome is binary (pass/fail). The data is not aggregate but individual student level data. I have approximately 98k students as potential participants for each of the 4 years and 550 schools in which they could be nested.

Table 2
Variable Coding

Outcome EOG

Algebra 0 = Fail, 1 = Pass

English 0 = Fail, 1 = Pass

Biology 0 = Fail, 1 = Pass

Above is a section of a table that shows how the variables were coded for the outcomes (EOG). Thanks, Ted
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17700
#8

13 Feb 2016, 09:14

Ted:
I do agree with Marcos' concerns,
As per Marcos' reply, your last post add some information (thanks for your effort), but things are still foggy for me.
You wrote you performed a mixed model but your data are not organized as panel data (but a mixed model cannot be interpreted as a two-way random effect model?), in that time-series dimension does not seem relevant for your research purposes (because your reserch design includes different pupils per year nested within the same schools each year?).
Skewness (of the dependent variable?) in itself is not a valid reason to shy away from performing mixed model; conversely, if you refer to perfect prediction (that may creep up in logistic regression), I can't say whether or not you can fix that problem.
If your study was performed, say, in UK, 90% success at English test should not be surprising (in Italy, for instance, it should!).
As a closing-out remark, your chance of getting (more) helpful replies instead of educated guesses (as Marcos highlighted) is conditional on posting what you typed and what Stata gave you back (as per FAQ). Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment


Outcome EOG
	Algebra	0 = Fail, 1 = Pass
	English	0 = Fail, 1 = Pass
	Biology	0 = Fail, 1 = Pass

Announcement

Logistic Regession

Comment

Comment

Comment

Comment

Comment

Comment

Comment