-joy_plot- command: Advices needed for plotting (plot the biggest five values of a variable)

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

-joy_plot- command: Advices needed for plotting (plot the biggest five values of a variable)

03 Oct 2023, 01:41

Hi everyone,

I would like to do a plot (basically I am using the beautiful -joy_plot- by FernandoRios, but if you have other better suggestions, I am obviously more than open to it) and select the biggest 5 values please.
Also, if possible, I would like to receive some plotting feedback, as my graph is not readable, as there is many categories,

If there are more suitable graphs for that, please do not hesitate to correct me.

Code:

. tab tariff_ekon_id_encod

      Elec. |
     Tariff |
     Types, |
    Encoded |      Freq.     Percent        Cum.
------------+-----------------------------------
        20A |     18,384        1.44        1.44
      20DHA |    165,269       12.92       14.36
      20DHS |      1,183        0.09       14.45
       20TD |  1,039,954       81.31       95.77
        20a |         17        0.00       95.77
        21A |      2,345        0.18       95.95
      21DHA |      5,925        0.46       96.41
      21DHS |         43        0.00       96.42
         30 |     11,907        0.93       97.35
       30TD |     31,832        2.49       99.84
         31 |        424        0.03       99.87
        61A |         16        0.00       99.87
       61TD |      1,536        0.12       99.99
        62A |          1        0.00       99.99
       62TD |         29        0.00       99.99
       63TD |         17        0.00      100.00
       64TD |         51        0.00      100.00
         No |          1        0.00      100.00
------------+-----------------------------------
      Total |  1,278,934      100.00

.

Code:

preserve

recode tariff_ekon_id_encod (5 6 10 = 5) // recode these values as they contain only respectively 9, 7 and 1 observations

joy_plot kW_power_p1 if tariff_2 == 1, scheme(white_tableau) over(tariff_ekon_id_encod) color(%50) alegend gap0 range(0 15) ///
    title("{bf}Density Plot", pos(11) size(2.75)) ///
    subtitle("Tariff Types, 1{sup:st} Period", pos(11) size(2)) ///
        ytitle("Density", size(2) orient(horizontal)) ///
        ylabel(, nogrid labsize(2)) ///
        xlabel(0(1)15, labsize(tiny) nogrid format(%9.0fc)) ///
        xtitle("Contracted Powers", size(2))

        
graph export "../figures/distr_tariff_types_kdens.png", replace
graph export "../figures/distr_tariff_types_kdens.pdf", replace        
        
        
restore

Here is the plot:

Click image for larger version

Name: joy_plot-tariffs.png
Views: 1
Size: 158.6 KB
ID: 1728892

Could anyone give me a better suggestion, or advices to improve this graph please?
Thank you in advance!

Michael

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35775
#2

03 Oct 2023, 05:27

The main point of joy plots (the name is better avoided, for reasons documented at length elsewhere) is, in my understanding, to show smoothed density estimates. As you say I can't see that it helps here -- your data seem to have curious minor modes that may result from rounding or reporting conventions and even selecting 8 categories doesn't help -- and I can't see that it will help at all to base density estimates on the largest 5 values. I am probably not understanding your goals, but no one else has jumped in to comment.

You don't show the largest 5 values for those 18 categories. That could be done in a data example.I would just plot them directly.

On "the largest 5" there is a dedicated paper https://journals.sagepub.com/doi/pdf...6867X221106436 that may help.
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#3

03 Oct 2023, 06:13

Good afternoon Nick Cox ,

Thank you for your feedback. I will take a look at the paper that you provided.

All the best,

Michael
Comment

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

03 Oct 2023, 06:15

These are the "5 largest categories" :

Code:

groups tariff_ekon_id_encod if tariff_2 & tariff_2_more_15000_w == 0, select(5
> ) order(h)

  +--------------------------------------+
  | tarif~od     Freq.   Percent     %<= |
  |--------------------------------------|
  |     20TD   1037146     84.30   84.30 |
  |    20DHA    165259     13.43   97.73 |
  |      20A     18380      1.49   99.23 |
  |    21DHA      5924      0.48   99.71 |
  |      21A      2343      0.19   99.90 |
  +--------------------------------------+

And here is a -dataex-:

Comment

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#5

03 Oct 2023, 08:35

Hi Nick Cox,

You're totally right: density functions are not at all suited for my case.
I have the following question:
Would it be better to represent them as histograms (with -graph twoway spike-, as my "tariffs" variable are of discrete type)?
In all cases:

Wouldn't it be a mess to represent all the categories in a single graph (or even all the categories in several separate graphs...)?

Is there a choice to be made? For example, don't show all the tariffs, for greater clarity (in our discussion, for example, the 5 most represented tariffs)?

Thank you again for your help.
Sorry again for the inconvenience.

Best,

Michael
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35775
#6

03 Oct 2023, 09:05

I am still finding it hard to join the dots, as it were, and connect (a) your data and (b) what you want with (c) some code.

If you want to show just the largest five, period, you already have a little table and it could be turned into a bar or dot chart. (groups is from the Stata Journal, as you are asked to explain: FAQ Advice #12.)

If you want the largest 5 in each of several categories, that is a different problem, and your data example doesn't seem good to show technique.

Here is some technique. Much should seem simple but there is some trickery too, and some small points here are matters of taste.

Code:

webuse nlswork, clear egen rank = rank(-ln_wage), unique by(occ_code) gen largest5 = rank <= 5 myaxis yaxis=occ_code, sort(max ln_wage) su yaxis, meanonly scatter yaxis ln_wage if largest5, ms(Oh) yla(1/`r(max)', valuelabel ang(h)) xsc(alt) name(G1, replace) myaxis yaxis2=occ_code, subset(rank==3) sort(max ln_wage) su yaxis2, meanonly scatter yaxis2 ln_wage if largest5, ms(Oh) yla(1/`r(max)', valuelabel ang(h)) xsc(alt) name(G2, replace)

Taking that more slowly:

1. Fundamental: We are looking at the largest 5 outcomes for wage by 13 categories of occupation. The rank() function of egen has all the bells and whistles generally needed.

2. Fundamental: The outcome is already on a log scale (and could be exponentiated). Using a logarithmic scale will usually be a good idea, so long as the (five largest) outcomes are all positive.

Code:

webuse nlswork, clear egen rank = rank(-ln_wage), unique by(occ_code) gen largest5 = rank <= 5

3. Optional: Showing the categories as they arrive in the data may not be the best idea. myaxis is focused on ordering categories by something other than their values, namely the value of something else, here the maximum in each category. myaxis is from the Stata Journal and maps categories to integers 1 up. https://journals.sagepub.com/doi/pdf...6867X211045582
We can find the highest value assigned using summarize if it is not otherwise known to us.

Code:

myaxis yaxis=occ_code, sort(max ln_wage) su yaxis, meanonly

4. Fundamental: occ_code in the nlswork dataset doesn't come with value labels, but most categorical variables deserve value labels. Showing 13 (or more!) value labels on the horizontal axis wouldn't be a good idea, nor would showing them vertically or at a slant. So category should almost always go on the vertical axis.

5. Optional: Whenever graphs have table flavour, I often put the horizontal axis at the top. The point was argued at https://www.stata-journal.com/articl...article=gr0053 but is a matter of taste.

6. Optional: The ticks on the y axis could easily be considered unhelpful if not illogical, and you're welcome if you want to remove them.

Code:

scatter yaxis ln_wage if largest5, ms(Oh) yla(1/`r(max)', valuelabel ang(h)) xsc(alt) name(G1, replace)

7, Optional: Other orderings are possible. Here for example is an ordering on rank 3 (the Bronze Medal position) as a more resistant measure of typical high values. (The value at rank 3 is the median of the values at ranks 1 to 5.)

Code:

myaxis yaxis2=occ_code, subset(rank==3) sort(max ln_wage) su yaxis2, meanonly scatter yaxis2 ln_wage if largest5, ms(Oh) yla(1/`r(max)', valuelabel ang(h)) xsc(alt) name(G2, replace)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35775
#7

03 Oct 2023, 09:21

I was a long time writing what is now #6 and didn't see #5 until now. If #6 doesn't answer #5 by accident, I don't have much to add. But I can't see that histograms will improve on density functions. (Histograms in a sense are density function estimates too.)

I remain focused on the title of the thread -- showing the largest 5 -- whereas you seem to be jumping between that and the quite different problem of showing entire distributions.
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#8

04 Oct 2023, 08:30

Good afternoon Nick Cox,

I have just three small points to make:
First, I apologize for my mistakes. Sometimes it's hard for me to explain what I want.

Second, your post at #6 answers totally to my needs. All is clear now.

Third, and finally, thank you so much for the time devoted to answering me. I am very grateful for that for all your help since my stay on statalist. You have solved many of my headaches.

I wish you a beautiful afternoon's end.

All the best,

Michael
Comment

Announcement

-joy_plot- command: Advices needed for plotting (plot the biggest five values of a variable)

Comment

Comment

Comment

Comment

Comment

Comment

Comment