Creating categories of string variables from a continuous variable

Xaviera Cardenas

Join Date: Apr 2020

Posts: 22
#1

Creating categories of string variables from a continuous variable

11 Apr 2020, 06:18

Hello Stata users,

I have a panel data set from survey data in long format that has to be kept confidential, that's why I can't show it here.

Nevertheless, I have the variable called "net yearly household income" which is numeric (float).

I would like to generate percentiles (p5, p10, p25 and so forth) and then create string variables for those percentiles "p5", "p10","p25"... in order to generate the categories labelled as "categories of income percentiles" and run regressions out of that.

The variable thinc_m = the net yearly household income
The variable ccthinc_m is the name of new variable

The code I have been using to create centiles is this one:

Code:

xtile ccthinc_m = thinc_m, nq (10) tab ccthinc_m tabstat thinc_m, stat (n mean min max sd p50) by (ccthinc_m)

I found on another Stata forum this code, which I also ran which generated similar values.

Code:

centile thinc_m , c(5 10 25 50 75 90 95) local nc : word count `r(centiles)' qui forval i = 1/`nc' { local list "`list'`r(c_`i')'," } summarize thinc_m, meanonly gen cthinc_m= recode(thinc_m,`list'`r(max)')

In summary, the first question: 1) How can I turn the centiles into string categories (I tried using recode/replace commands, but it seems that I need more difficult codes) and the second: 2) Which of the two codes that I used for centiles is considered better?

If you have better ideas, I am all ears. And if you can help me on this I thank you in advance.

Regards
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#2

11 Apr 2020, 07:26

If by converting them to string variables you mean you want to change a value of, say 15243 into "15243" then the -tostring- command will do that for you. Read -help tostring- for details, including how to control the format of the resulting string variable. I'm not sure why you would want to do that--it's hard for me to imagine why it would be useful in this context, but that is how you could do it.

As between the two code approaches you show, they do different things, and since I don't really understand what your purpose is here, it would be hard to say which is more suitable to your goal.

The first approach attempts to break the data into 10 equally sized groups based on the value of thinc_m, and ccthinc contains a number from 1 to 10 indicating which group the observation's value of thinc_m belongs to. It then display descriptive statistics of the value of thinc_m within each of those 10 groups.

The second approach identifies the 5th, 10th, 25th, 50th, 75th, 90th, and 95th percentile values of thinc_m (which will be values of thinc_m itself, not integers 1 through 7), and then creates a new variable which takes on the value of the 5th percentile for observations where thinc_m is less than the 5th percentile, the value of the 10th percentile for observations where thinc_m is at or above the 5th percentile but less than the 10th, etc. and finally shows the maximum observed value of thinc_m whenever thinc_m is at or above the 95th percentile.

Note that neither of those two approaches produces a string variable.
1 like
Comment
Xaviera Cardenas

Join Date: Apr 2020

Posts: 22
#3

14 Apr 2020, 09:51

Thank you Clyde for your response. It did help a lot.

The more I am thinking, in fact turning the percentiles into categories of string variables will not be useful. But what might be useful, is coding a command that would say, for example. "how is the value of income in the 5th percentile correlated with education (let's call it here, y_variable, continuous variable)?" in a linear regression.
How would you write a code that would answer that question? I think that I must use the variable cthinc_m (from the second approach code) but I don't know what command to use. Egen command? Or is my question already answered if I use the new variable that is generated with the second approach code in an analysis?

Because with the second approach I find the percentiles that I am interested in, but now I would like to know if they present any association on other variables, and I would like to do that, checking for the 5th percentiles, the 10th and so forth.

Hopefully you will understand me. It's hard to write this down, because of lack of experience.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4545
#4

14 Apr 2020, 11:51

Code:

help qreg
1 like
Comment

Announcement

Creating categories of string variables from a continuous variable

Comment

Comment

Comment