Create categorical variable from multiple continuous variables

Alex Caridyas

Join Date: Jul 2021
Posts: 27

Create categorical variable from multiple continuous variables

09 Jan 2023, 04:33

I would like to create a categorical variable from several continuous variables. In my dataset, I have 10 variables that identify a type of problem a child experiences (abuse, neglect, etc.). These variables include the number of children that fall into those categories for each region in each year.

I am trying to create instead a categorical variable with 10 categories for each of these problems, that way I can then use the data to start producing some graphs (for e.g. pie charts displaying the percentage for each category in each region for each year).

I am not sure where to begin with my code. Please see an example of my dataset. The variables of interest begin with N, I am only showing 3 of them as dataex needed me to specify fewer variables.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int time_period long(reg N1_Abuse_or_neglect N2_Childs_disability_or_illness) int N3_Parents_disability_or_illness
2022  5 13950 2050  230
2022  6 40170 3800  850
2022 11 30690 2140  510
2022  1 15920 2080 1040
2022 10 27350 3470  790
2022  2 18450 3950  550
2022  4 35970 7280 2540
2022  8 29430 5230 1440
2022  9 18900 2170  980
2022  3 13650 2590 1330
2022  7 22330 4690 1210
2021  5 13890 1970  260
2021  6 37700 3750  870
2021 11 29800 2590  480
2021  1 15920 1950 1130
2021 10 25640 3560  790
2021  2 18380 3820  540
2021  4 33940 7320 2790
2021  8 26830 5020 1500
2021  9 17100 2150 1090
2021  3 13390 2720 1480
2021  7 20550 4610 1310
2020  5 13740 2030  230
2020  6 37530 3710  890
2020 11 29600 2500  510
2020  1 15760 1530  970
2020 10 24100 3470  920
2020  2 19080 3750  640
2020  4 33690 7420 2850
2020  8 26800 5310 1410
2020  9 17150 2230 1010
2020  3 13950 2780 1500
2020  7 19740 4640 1350
2019  5 12280 2110  240
2019  6 38360 3900 1050
2019 11 28660 2530  570
2019  1 15980 1820 1090
2019 10 24190 3510 1010
2019  2 19050 3960  640
2019  4 34690 7160 2680
2019  8 23890 5220 1420
2019  9 19190 3390 1190
2019  3 14380 2850 1510
2019  7 20300 4300 1170
2018  5 11240 2220  310
2018  6 36760 3930 1140
2018 11 28180 2600  590
2018  1 15170 1890  980
2018 10 24830 4020 1220
2018  2 19420 3930  720
2018  4 36090 7700 2800
2018  8 24360 5690 1610
2018  9 19220 3250 1130
2018  3 14470 3040 1570
2018  7 21630 4670 1230
end
label values reg reg
label def reg 1 "East Midlands", modify
label def reg 2 "East of England", modify
label def reg 3 "Inner London", modify
label def reg 4 "London", modify
label def reg 5 "North East", modify
label def reg 6 "North West", modify
label def reg 7 "Outer London", modify
label def reg 8 "South East", modify
label def reg 9 "South West", modify
label def reg 10 "West Midlands", modify
label def reg 11 "Yorkshire and The Humber", modify

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

09 Jan 2023, 04:49

I wouldn't bin these data into categorical variables at all. You will just lose information thereby. A more obvious adjustment is that the data look like numbers of cases and the regions are manifestly regions of England (no Wales, let alone Scotland or Northern Ireland) and so some kind of scaling by population at risk could make sense.

Even if you were determined to categorize, it is far from obvious how best to do it, e.g. whether you should use quantile binning, values relative to mean and SD or other summary statistics (calculated for which reference set), and so on.

I don't like being told "That is a bad idea" (who does?) but it's an honest opinion here. For more on the evils of binning, see the references at the start of https://journals.sagepub.com/doi/pdf...867X1801800311
Comment
Alex Caridyas

Join Date: Jul 2021

Posts: 27
#3

09 Jan 2023, 09:30

Thank you for the article, Nick. I will heed your advice and keep the variables as is.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

09 Jan 2023, 10:36

For all I know,

1. Your three counts may overlap.

2. You may have more than three counts.

Nevertheless, it's a guess that the proportions (percents) of each are more interesting than absolute numbers. Accordingly I did this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int time_period long(reg N1_Abuse_or_neglect N2_Childs_disability_or_illness) int N3_Parents_disability_or_illness
2022  5 13950 2050  230
2022  6 40170 3800  850
2022 11 30690 2140  510
2022  1 15920 2080 1040
2022 10 27350 3470  790
2022  2 18450 3950  550
2022  4 35970 7280 2540
2022  8 29430 5230 1440
2022  9 18900 2170  980
2022  3 13650 2590 1330
2022  7 22330 4690 1210
2021  5 13890 1970  260
2021  6 37700 3750  870
2021 11 29800 2590  480
2021  1 15920 1950 1130
2021 10 25640 3560  790
2021  2 18380 3820  540
2021  4 33940 7320 2790
2021  8 26830 5020 1500
2021  9 17100 2150 1090
2021  3 13390 2720 1480
2021  7 20550 4610 1310
2020  5 13740 2030  230
2020  6 37530 3710  890
2020 11 29600 2500  510
2020  1 15760 1530  970
2020 10 24100 3470  920
2020  2 19080 3750  640
2020  4 33690 7420 2850
2020  8 26800 5310 1410
2020  9 17150 2230 1010
2020  3 13950 2780 1500
2020  7 19740 4640 1350
2019  5 12280 2110  240
2019  6 38360 3900 1050
2019 11 28660 2530  570
2019  1 15980 1820 1090
2019 10 24190 3510 1010
2019  2 19050 3960  640
2019  4 34690 7160 2680
2019  8 23890 5220 1420
2019  9 19190 3390 1190
2019  3 14380 2850 1510
2019  7 20300 4300 1170
2018  5 11240 2220  310
2018  6 36760 3930 1140
2018 11 28180 2600  590
2018  1 15170 1890  980
2018 10 24830 4020 1220
2018  2 19420 3930  720
2018  4 36090 7700 2800
2018  8 24360 5690 1610
2018  9 19220 3250 1130
2018  3 14470 3040 1570
2018  7 21630 4670 1230
end
label values reg reg
label def reg 1 "East Midlands", modify
label def reg 2 "East of England", modify
label def reg 3 "Inner London", modify
label def reg 4 "London", modify
label def reg 5 "North East", modify
label def reg 6 "North West", modify
label def reg 7 "Outer London", modify
label def reg 8 "South East", modify
label def reg 9 "South West", modify
label def reg 10 "West Midlands", modify
label def reg 11 "Yorkshire and The Humber", modify

local 1 "Abuse or neglect (%)" 
local 2 "Child's disability or illness (%)"
local 3 "Parent's disability or illness (%)" 

forval j = 1/3 {
    gen P`j' = 100 * N`j' / (N1 + N2 + N3)
    label var P`j' "``j''"
}

*  SJ 21(3):818--837 st0654 
myaxis REGION=reg, sort(mean P1) descending 

line P? time, ysc(log) by(REGION, note("")) lc(red blue black) yla(1 2 5 10 20 50 100, ang(h)) xtitle("") legend(col(1) symxsize(small) size(medsmall)) name(P123, replace)

* SJ 21(2):539--554 gr0087
fabplot line P1 time, xtitle("") by(REGION) name(P1, replace) frontopts(lw(thick))

fabplot line P2 time, xtitle("") by(REGION) name(P2, replace) frontopts(lw(thick))

fabplot line P3 time, xtitle("") by(REGION) name(P3, replace) frontopts(lw(thick))

The general points here are

3. Alphabetical order is good for dictionaries and directories. Otherwise, order the regions of some criterion of interest or importance.

4.. Spaghetti is a good foodstuff, but not so good as graphstuff.

See the 2021 Stata Journal references embedded in the code.

To my mind:

5. The plot of everything at once is quite hard work.

Click image for larger version

Name: P123.png
Views: 1
Size: 48.1 KB
ID: 1696510

6. A "front-and-back plot" of each category is more interesting. Here is one sample.

Click image for larger version

Name: P3.png
Views: 1
Size: 91.1 KB
ID: 1696511

Comment

Alex Caridyas

Join Date: Jul 2021

Posts: 27
#5

10 Jan 2023, 09:13

Thank you for this helpful example, Nick. I've been able to edit my code to include all 10 categories.

I am just slightly confused about the lines on the "front-and-back plot". If the blue line in the above example represents the parent's disability (%) for each of the four years in each region, what do the grey lines represent?

Reading the "Speaking Stata: Front-and-back plots to ease spaghetti and paella problems" article, it says subsets of data is shown separately while other subsets as backdrop. Are these subsets the other variables (family disability, abuse), but if so, should there not be only two grey lines shown in the backdrop?

Last edited by Alex Caridyas; 10 Jan 2023, 09:16.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#6

10 Jan 2023, 09:20

Each blue line is for the region named. The grey lines are for all the other regions. For example, the line for Inner London is shown directly in the last panel and echoed in all the others.

SJ 21(2):539--554 gr0087 explains. https://journals.sagepub.com/doi/pdf...6867X211025838
Comment
Alex Caridyas

Join Date: Jul 2021

Posts: 27
#7

10 Jan 2023, 09:24

Ah, that makes sense now. Thank you very much.
Comment

Announcement

Create categorical variable from multiple continuous variables

Comment

Comment

Comment

Comment

Comment

Comment