Variable Too Many Values

David Adler

Join Date: Sep 2017

Posts: 2
#1

Variable Too Many Values

25 Sep 2017, 06:54

Hello all,

I am not exactly new to Stata, but I am quite new to importing my own files after converting them to a .dta format via the Data Editor.

Anyway, I have imported a pretty big data set with 20,000 observations total. Within this dataset I have a variable: income. Income contains the information you would think it does; the exact values of each of the 20,000 obs. incomes. When I attempt to

Code:

tabulate income

the error appears: "income has too many values." I have read other forums mentioning the creation of dummy variables, or even reshaping the variables. However, I have tried running these suggestions and I've failed in each attempt.

Could one of you Stata wizards help explain to me how I should go about shortening a variable without eliminating potentially valuable numeric data?

Thanks in advance,
David
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35694
#2

25 Sep 2017, 07:16

The error message just tells you that you have too many distinct values to tabulate.

Code:

help limits

gives the limits for tabulate in your Stata. Usually when this happens you wouldn't want to scroll through a table with thousands of rows any way, so it's no loss.

However, I have no idea what you mean by "shorten a variable"?

http://xyproblem.info/ appears pertinent. That is, tell us about what you really want to do. Then there will be some solution. Perhaps you'd be better off with a graph.
Comment
David Adler

Join Date: Sep 2017

Posts: 2
#3

26 Sep 2017, 07:06

Hi Nick,

Right now I am just trying to experiment with the data. I was trying to do a two-way tabulation between variables. I have checked the limits, which has informed me that the limit it 1,200 observations, so thank you for that tip.

Just to give you an idea, here is the data set that I am working with. These are what the values are for each variable, which continue for 20,000 cells.

As you can see, there are variables that have values, which are restricting the possibility of tabulation.

However, I have previously worked with survey data that included 1600 observations, in which they were placed into categories with nominal or ordinal scales, resulting in 1-10 values.. This structure made it far easier to remove duplicates and create indices, so that I could run regressions. So my hope would be that I could somehow re-code the variable income to be a ratio scale (>=50k, =<50k, <=100k <=200k <=500k <=1m), so that I am not dealing with 20,000 specified values and instead am dealing with maybe 10 classifications and and can actually view a tabulation. Is there any way to do this other than manually changing the values for 20,000 observations?

Regards,
David

Last edited by David Adler; 26 Sep 2017, 07:09.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#4

26 Sep 2017, 07:47

The limits of tabulate are not in terms of the number of observations that can be handled, but on how many rows and/or columns can be shown.

Sure, you can categorise continuous variables but you're thereby just throwing away information. Regression isn't fazed even by every value being unique (literally) and there is little point in degrading continuous variables to categorical just because there are too many values to tabulate. I see no reason why e.g. income should not be left as it came. If there's a problem with its distribution, log of income may work better.

For an overview, consider any reasonable graph.
Comment

Announcement

Variable Too Many Values

Comment

Comment

Comment