Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create categorical variable from multiple continuous variables

    I would like to create a categorical variable from several continuous variables. In my dataset, I have 10 variables that identify a type of problem a child experiences (abuse, neglect, etc.). These variables include the number of children that fall into those categories for each region in each year.

    I am trying to create instead a categorical variable with 10 categories for each of these problems, that way I can then use the data to start producing some graphs (for e.g. pie charts displaying the percentage for each category in each region for each year).

    I am not sure where to begin with my code. Please see an example of my dataset. The variables of interest begin with N, I am only showing 3 of them as dataex needed me to specify fewer variables.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int time_period long(reg N1_Abuse_or_neglect N2_Childs_disability_or_illness) int N3_Parents_disability_or_illness
    2022  5 13950 2050  230
    2022  6 40170 3800  850
    2022 11 30690 2140  510
    2022  1 15920 2080 1040
    2022 10 27350 3470  790
    2022  2 18450 3950  550
    2022  4 35970 7280 2540
    2022  8 29430 5230 1440
    2022  9 18900 2170  980
    2022  3 13650 2590 1330
    2022  7 22330 4690 1210
    2021  5 13890 1970  260
    2021  6 37700 3750  870
    2021 11 29800 2590  480
    2021  1 15920 1950 1130
    2021 10 25640 3560  790
    2021  2 18380 3820  540
    2021  4 33940 7320 2790
    2021  8 26830 5020 1500
    2021  9 17100 2150 1090
    2021  3 13390 2720 1480
    2021  7 20550 4610 1310
    2020  5 13740 2030  230
    2020  6 37530 3710  890
    2020 11 29600 2500  510
    2020  1 15760 1530  970
    2020 10 24100 3470  920
    2020  2 19080 3750  640
    2020  4 33690 7420 2850
    2020  8 26800 5310 1410
    2020  9 17150 2230 1010
    2020  3 13950 2780 1500
    2020  7 19740 4640 1350
    2019  5 12280 2110  240
    2019  6 38360 3900 1050
    2019 11 28660 2530  570
    2019  1 15980 1820 1090
    2019 10 24190 3510 1010
    2019  2 19050 3960  640
    2019  4 34690 7160 2680
    2019  8 23890 5220 1420
    2019  9 19190 3390 1190
    2019  3 14380 2850 1510
    2019  7 20300 4300 1170
    2018  5 11240 2220  310
    2018  6 36760 3930 1140
    2018 11 28180 2600  590
    2018  1 15170 1890  980
    2018 10 24830 4020 1220
    2018  2 19420 3930  720
    2018  4 36090 7700 2800
    2018  8 24360 5690 1610
    2018  9 19220 3250 1130
    2018  3 14470 3040 1570
    2018  7 21630 4670 1230
    end
    label values reg reg
    label def reg 1 "East Midlands", modify
    label def reg 2 "East of England", modify
    label def reg 3 "Inner London", modify
    label def reg 4 "London", modify
    label def reg 5 "North East", modify
    label def reg 6 "North West", modify
    label def reg 7 "Outer London", modify
    label def reg 8 "South East", modify
    label def reg 9 "South West", modify
    label def reg 10 "West Midlands", modify
    label def reg 11 "Yorkshire and The Humber", modify

  • #2
    I wouldn't bin these data into categorical variables at all. You will just lose information thereby. A more obvious adjustment is that the data look like numbers of cases and the regions are manifestly regions of England (no Wales, let alone Scotland or Northern Ireland) and so some kind of scaling by population at risk could make sense.

    Even if you were determined to categorize, it is far from obvious how best to do it, e.g. whether you should use quantile binning, values relative to mean and SD or other summary statistics (calculated for which reference set), and so on.

    I don't like being told "That is a bad idea" (who does?) but it's an honest opinion here. For more on the evils of binning, see the references at the start of https://journals.sagepub.com/doi/pdf...867X1801800311

    Comment


    • #3
      Thank you for the article, Nick. I will heed your advice and keep the variables as is.

      Comment


      • #4
        For all I know,

        1. Your three counts may overlap.

        2. You may have more than three counts.

        Nevertheless, it's a guess that the proportions (percents) of each are more interesting than absolute numbers. Accordingly I did this:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int time_period long(reg N1_Abuse_or_neglect N2_Childs_disability_or_illness) int N3_Parents_disability_or_illness
        2022  5 13950 2050  230
        2022  6 40170 3800  850
        2022 11 30690 2140  510
        2022  1 15920 2080 1040
        2022 10 27350 3470  790
        2022  2 18450 3950  550
        2022  4 35970 7280 2540
        2022  8 29430 5230 1440
        2022  9 18900 2170  980
        2022  3 13650 2590 1330
        2022  7 22330 4690 1210
        2021  5 13890 1970  260
        2021  6 37700 3750  870
        2021 11 29800 2590  480
        2021  1 15920 1950 1130
        2021 10 25640 3560  790
        2021  2 18380 3820  540
        2021  4 33940 7320 2790
        2021  8 26830 5020 1500
        2021  9 17100 2150 1090
        2021  3 13390 2720 1480
        2021  7 20550 4610 1310
        2020  5 13740 2030  230
        2020  6 37530 3710  890
        2020 11 29600 2500  510
        2020  1 15760 1530  970
        2020 10 24100 3470  920
        2020  2 19080 3750  640
        2020  4 33690 7420 2850
        2020  8 26800 5310 1410
        2020  9 17150 2230 1010
        2020  3 13950 2780 1500
        2020  7 19740 4640 1350
        2019  5 12280 2110  240
        2019  6 38360 3900 1050
        2019 11 28660 2530  570
        2019  1 15980 1820 1090
        2019 10 24190 3510 1010
        2019  2 19050 3960  640
        2019  4 34690 7160 2680
        2019  8 23890 5220 1420
        2019  9 19190 3390 1190
        2019  3 14380 2850 1510
        2019  7 20300 4300 1170
        2018  5 11240 2220  310
        2018  6 36760 3930 1140
        2018 11 28180 2600  590
        2018  1 15170 1890  980
        2018 10 24830 4020 1220
        2018  2 19420 3930  720
        2018  4 36090 7700 2800
        2018  8 24360 5690 1610
        2018  9 19220 3250 1130
        2018  3 14470 3040 1570
        2018  7 21630 4670 1230
        end
        label values reg reg
        label def reg 1 "East Midlands", modify
        label def reg 2 "East of England", modify
        label def reg 3 "Inner London", modify
        label def reg 4 "London", modify
        label def reg 5 "North East", modify
        label def reg 6 "North West", modify
        label def reg 7 "Outer London", modify
        label def reg 8 "South East", modify
        label def reg 9 "South West", modify
        label def reg 10 "West Midlands", modify
        label def reg 11 "Yorkshire and The Humber", modify
        
        local 1 "Abuse or neglect (%)" 
        local 2 "Child's disability or illness (%)"
        local 3 "Parent's disability or illness (%)" 
        
        forval j = 1/3 {
            gen P`j' = 100 * N`j' / (N1 + N2 + N3)
            label var P`j' "``j''"
        }
        
        *  SJ 21(3):818--837 st0654 
        myaxis REGION=reg, sort(mean P1) descending 
        
        line P? time, ysc(log) by(REGION, note("")) lc(red blue black) yla(1 2 5 10 20 50 100, ang(h)) xtitle("") legend(col(1) symxsize(small) size(medsmall)) name(P123, replace)
        
        * SJ 21(2):539--554 gr0087
        fabplot line P1 time, xtitle("") by(REGION) name(P1, replace) frontopts(lw(thick))
        
        fabplot line P2 time, xtitle("") by(REGION) name(P2, replace) frontopts(lw(thick))
        
        fabplot line P3 time, xtitle("") by(REGION) name(P3, replace) frontopts(lw(thick))

        The general points here are

        3. Alphabetical order is good for dictionaries and directories. Otherwise, order the regions of some criterion of interest or importance.

        4.. Spaghetti is a good foodstuff, but not so good as graphstuff.

        See the 2021 Stata Journal references embedded in the code.

        To my mind:

        5. The plot of everything at once is quite hard work.

        Click image for larger version

Name:	P123.png
Views:	1
Size:	48.1 KB
ID:	1696510



        6. A "front-and-back plot" of each category is more interesting. Here is one sample.

        Click image for larger version

Name:	P3.png
Views:	1
Size:	91.1 KB
ID:	1696511




        Comment


        • #5
          Thank you for this helpful example, Nick. I've been able to edit my code to include all 10 categories.

          I am just slightly confused about the lines on the "front-and-back plot". If the blue line in the above example represents the parent's disability (%) for each of the four years in each region, what do the grey lines represent?

          Reading the "Speaking Stata: Front-and-back plots to ease spaghetti and paella problems" article, it says subsets of data is shown separately while other subsets as backdrop. Are these subsets the other variables (family disability, abuse), but if so, should there not be only two grey lines shown in the backdrop?
          Last edited by Alex Caridyas; 10 Jan 2023, 09:16.

          Comment


          • #6
            Each blue line is for the region named. The grey lines are for all the other regions. For example, the line for Inner London is shown directly in the last panel and echoed in all the others.

            SJ 21(2):539--554 gr0087 explains. https://journals.sagepub.com/doi/pdf...6867X211025838

            Comment


            • #7
              Ah, that makes sense now. Thank you very much.

              Comment

              Working...
              X