problems with data

Bristish Election Reproduction

Join Date: Sep 2018

Posts: 3
#1

problems with data

17 Sep 2018, 21:10

Hello,

Currently, I am working on a project for my econometrics project at uni. I am using data from the British Election Study, to be specific the BES2017_W13 dataset that was available online. We are trying to see which variables influence the probability of voting for the conservative party, however, there is a problem with the data set.
The variable profile_gross_personal, which is an int according to stata, contained intervals as the outcome. To overcome this strange problem I used the following commands:

tostring(profile_gross_personal), generate(gross)

tabulate gross

tabulate profile_gross_personal

gen income=0

replace income=2500 if gross=="1"

replace income=7500 if gross=="2"

replace income=12500 if gross=="3"

replace income=17500 if gross=="4"

replace income=22500 if gross=="5"

replace income=27500 if gross=="6"

replace income=32500 if gross=="7"

replace income=37500 if gross=="8"

replace income=42500 if gross=="9"

replace income=47500 if gross=="10"

replace income=55000 if gross=="11"

replace income=65000 if gross=="12"

replace income=85000 if gross=="13"

replace income=100000 if gross=="14"

drop if income==0 (is getting rid of the missing value that we cause by the generate command earlier)

And I continued this in order to get the average value in the interval as the outcome value. However, as a sanity check, I tried to run a basic regression. I used reg income england (dummy variable I created) in order to see if the process I did work. In this case, it worked and yielded me a regression outcome. However, when I run the logit income england I get a r(2000) error; outcome does not vary.
Does someone know a way around this problem?

my first guess that it was because of the fact that it was a float with strange values, since normally floats are only dummies. Hence, I used the following command:

recast int income, force

However, this cuts of all the values of income that are above 32500. Therefore, this workaround did not work and it did not solve the r2000 problem.

So my question is, does someone know a way to overcome the first problem w.r.t. the r2000 error or a solution to the initial problem that it was stored as a string but classified as an int? I also got some strange classifications in other variables, so this r2000 appears more often.
Thank you in Advance!

Kind Regards
Tags: None
John Riveros

Join Date: Sep 2018

Posts: 43
#2

17 Sep 2018, 21:34

Hello, r2000 problems can causally be associated with the dependent variable. (otherwise Stata usually rejects a independent variable because of Collinearity effects)

I would like to see how you coded the logistic regression and specially the specification of variables.

logistic regression usually takes binary responses in the dependent variable, this means that Y usually has only to values {0, and 1} . and X can be either continue or discrete variables or either dummies. (binary)

However, when i ran some logistic regressions in my econometric studies, i noticed that sometimes if Y has either low values of 0 and lots of 1, or the viceverse, sometimes the regression cannot be estimate due to r2000 problems. This means that the value related to 0 or 1 crossed with X value is mostly insufficient.

A solution to this is starting to look the structure of your dependent variable, since i noticed that you're regressing income, but income has no binary responses like 0 or 1 (but it do has values at = 2500, 7500 ... etc) and unless your using a multinomial regression, it believe it cannot be done the estimating by logistic regression.

Keep in consideration that income must be a variable with binary responses, and the independent variables will help you to understand in terms of probabilty the change of Y=0 and Y=1.

Y = Has to be binary. X can be either continue, discrete or binary.

Last edited by John Riveros; 17 Sep 2018, 21:50.
Comment
Bristish Election Reproduction

Join Date: Sep 2018

Posts: 3
#3

17 Sep 2018, 21:55

hey, thank you for your response.
Any recommendations about how to make a discrete variable from some string inputs?
I started with gen as the basic command and then just allocate more outcomes than the 0-1 limitation you just mentioned. I used that method since it was the only one they have taught us.

my regression is the following right now:
vote2017 (probability of voting for the conservatives, is a dummy in my dataset) = income (the discrete variable that I wanted to create and used the commands in original posts for) + BX. X is a vector of other variables that I am still busy with converting but also have the problem that they are stored in the wrong format in stata. And what I did for my origin post was just: logit income england, since I would think that any problems would surface using that regression which was the case.

Hopefully, this gives some insight into what I am trying to do.

Kind Regards
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

17 Sep 2018, 22:08

However, when I run the logit income england I get a r(2000) error; outcome does not vary.
Does someone know a way around this problem?

The problem is that a -logit- model with income as your dependent variable is just inappropriate and the way to get around it is to use a suitable model.

The outcome variable of a -logit- model should be a dichotomous variable; the simplest and best way is for it to be coded as 0 for False and 1 for True. When Stata encounters a variable listed as the dependent variable in a -logit- model that is not coded 0/1 it interprets it as 0 = False and everything except 0 = True. Since you deleted all observations with income = 0, your data is being read as having a True outcome in every observation. Therefore the outcome does not vary. It is simply not possible to use his variable as an outcome in a -logit- model.

It is unclear to me what you were thinking when you wrote that command. After all, the value of your income variable don't look like they were designed to be collapsed into a zero vs non-zero dichotomy. This looks like you have some kind of continuous variable in mind. So it puzzles me why you were even thinking about -logit- in connection with it.

Note: My response here is no different in substance from what was posted in #2. I'm just trying to make the same point that John Riveros made more emphatically and in greater detail.

Please note also that the norm in this community is to use our real given and surnames as our username, to promote collegiality and professionalism. Unfortunately, the Forum software does not allow you to change your username by editing your profile. Please click on the CONTACT US button in the lower right hand corner of this page and message the system administrator requesting a change of your user name. Thank you.
2 likes
Comment
Bristish Election Reproduction

Join Date: Sep 2018

Posts: 3
#5

17 Sep 2018, 22:25

hey, I just ran the intended regression with the probability of voting as the dependent variable and then it works indeed.
I know it was a stupid mistake afterwards
I will wait until this post is a bit further down so that when I change my name the professor cannot retrieve my stupid mistake.

Kind Regards,
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#6

18 Sep 2018, 00:31

There's probably no need to be bashful.

1. Your professor has made similar mistakes over the years. Count on it.

2. If your professor is spending time scouring the Internet searching for names of students who are seeking to gain a better understanding in the use of some software in order to ding them in some way, then tick all that apply:

□ Your professor needs a more constructive purpose in life

□ You need a new professor

3. If your classmates would benefit by your professor's becoming aware of where students need more instruction or assistance in correct syntax or understanding of the use of some software, then you're better off in the long run (and everyone else is, too, including your professor) just standing up in class and letting everyone know.
4 likes
Comment
Tom Norton

Join Date: Oct 2023

Posts: 2
#7

17 Oct 2023, 10:16

I am having a somewhat similar problem at least as to the sort of error, r2000 no observations.* I am attempting to use the General Social Survey (NORC) cumulative data file 1972-2022 and use logit analysis.

One can argue the applicability statistical method to the concept, but in the past, I get results at least. I am using STATA 18 SE. I had to redo the license the other day, but this does not remove STATA itself, it merely reactivates it and runs updates. I spent quite some time re-coding yesterday, as I was taught (I am very non-technical and not a programmer), a series of variables into dichotomous variables (see below for more information on the process of this)** After the error in multiple attempts using different re-coded variables, I assumed my re-coding was wrong. I selected a variable that was dichotomous by nature (no re-coding required) "should abortion be legal--yes or no" listed as "abany" in the GSS cumulative file 1972-2022. The IVs do not have to be dichotomous for logit to at least run and give a result. Same error. For example, I attempted "logit abany partyid." This has to return something! The data is there in data browser. But "r2000." After downloading different data sets, including single year cross sectional sets, and having them all fail with exactly the same error, I tried a GSS set I downloaded in March of this year. And of course it works just fine using logit. I used the same (I even copied from my log) re-coding process.) I have contacted NORC, but I wondered if anyone has any issues like this right now.

Let me also say this is the best I can do describing things. I am a bit frantic since this analysis is for a few days from now, and with everything else, there is not a lot of time to redo everything between now and then. I am very non-technical. What I have learned about STATA is mostly from having others teach it to me. As an example, I find the help function and pdf document extremely cryptic. I am able to find assistance through it only after much trial and error attempting to interpret it. I need more the 'dummy' level, step one type this, step 2 type this. Not a lot of things are intuitive here, but that is my issue not that of STATA. It has capacities for large data sets that make it really the singular option. Thanks for any assistance.

Thanks,
Tom

*The error is as follows:

. logit abany polviews

outcome does not vary; remember:
0 = negative outcome,
all other nonmissing values = positive outcome
r(2000);

** This re-coding involves changing a variable of 3 response categories ('a great deal," "somewhat," and "hardly any") to 3 different variables of 1 or 0, yes or no with one of the variables to serve as the marker for multi collinearity. So "a great deal" becomes yes 1 and "hardly any" becomes 0 for one variable considered "high" version and all other 'off' responses (the .d .i. n (non response codes)) get removed through the process. The coding is in the nature of generate ABC variable. Then "replace ABC variable = 1 if (original variable) == 1" Then same for 0 "if (original variable) == 3. Etc. This has worked just fine in the past. There is data there in the databrowser.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

17 Oct 2023, 10:46

The -logit- command requires that your outcome variable be coded with two values, 0 and something not-0. It treats all non-zero values as a positive outcome. Your outcome variable is apparently not coded 0 and not-0, and Stata is trying to tell you that. I happen to know that in the GSS data file, abany and the various other yes/not abortion attitude variables are coded 1 (yes), 2 (no), which you can verify by using -tabulate-. Some of the previous responses in this thread touch on that issue, so you might now find it useful to read them again.

Most of us on StataList follow the policy of not giving detailed help on homework, so I won't go farther than explaining what Stata expects regarding response variable coding. Beyond that, I'd encourage you to ask your professor for help, who is in a better position to know what they consider acceptable help on this assignment.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#9

17 Oct 2023, 10:47

You have already demonstrated that the code works properly with other data. So you know that the problem is with the data, at least to the extent of the data not being suitable for that code. You have given a reasonably clear description of the code. All you have said about the data is that it comes from the General Social Survey cumulative file for 1972-2022. I don't think anybody can give you useful advice based only on that. I think you need to post example data from your data set. So do this:

Code:

use type_the_name_of_your_data_set_here, clear by abany, sort: keep if _n <= 20 count local nobs = r(N) dataex abany polviews, count(`nobs')

Then follow the instructions on the screen which tell you what part of the output to select, copy, and paste into the Forum editor when you post back.

Added: Crossed with #2. I do follow the homework policy alluded to there, although I did not perceive #7 as being about a homework problem. If it is, then please do not post back--seek help within your school's resources instead. But if it is not, I am happy to proceed and provide more detailed and specific help.

Last edited by Clyde Schechter; 17 Oct 2023, 10:50.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#10

17 Oct 2023, 11:01

Clyde Schechter makes a good point. I presumed this was homework, having (like many other sociologists) used the GSS data for many years in a teaching context, and having had students run into problems with the 1/2 coding of this and some other yes/no outcome variables. ("frantic" and "for a few days from now" also sounded like a common experience in a homework situation, too.) My apologies to Tom for being unhelpful if this is *not* a homework assignment.
Comment
Tom Norton

Join Date: Oct 2023

Posts: 2
#11

17 Oct 2023, 13:02

Thank you everyone. Indeed, in my haste to diagnose, I forgot abany is 1 and 2 not 0 and 1. A little while ago I realized some variables I did not try with each other last night do in fact "work" in the sense that logit provides a result. And it set me on a course to correct things. After multiple trials taking things in order, I blew out my problem variables and started over. So far, so good. I am still not clear where the re-coding went wrong, but somewhere it did.

Thank you for your patience, and attempts to help without the usual specificity users bring to their questions and posts. As hard as I try to not panic, I still do despite too many years to still do so. No it is not homework. And no apologies needed at all. My post does have that tone. It is all a reminder of how far I still have to go, however, past what is available via formal instruction.
Comment

Announcement

problems with data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment