Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Two-way histogram in percentage

    Hi Everyone,

    I am struggling to draw a categorized histogram. Suppose we have a population of 100 cars. Cars are from Brand 1, Brand 2, Brand 3, and Brand 4 (Encoded numerical variable). The price of the car could be in three bins i.e. 50K-100K (bin 1), 100K-200K (bin 2), >200K (bin 3) (Encoded numerical variable).

    Assume that we have 50 cars in first bin, 30 cars in second bin, and 20 cars in third bin.
    Also, Assume out of 50 cars in first bin, 10 are from Brand 1.

    Problem:
    I want to draw a bar chart to show how much percentage of the cars within a bin are from Brand 1, Brand 2, Brand 3, and Brand 4 respectively. In other words, on x-axis the main variable will be bins and within bin I would like to have four bars. Hence, the first bar of bin 1 for Brand 1 should be 10/50 = 20 percent and so on.

    I would highly appreciate if you can help me to code this by giving some example.

    Regards,

  • #2
    The request is not consistent. On a histogram, it is area that represents probability, so if bars correspond to bins of unequal width, you need to plot in terms of frequency / bin width, except that the third bin is open-ended, so no choice of width, and therefore off height, can be made on this information.

    If these were the data, the best chart would I think be from graph bar or graph hbar and the stance is just that the bars correspond to ordered categories. Then your different brands could just be part of a stacked or divided bar chart, or side by side as you prefer.

    Comment


    • #3
      Here is some sample code (ignore the first bit, which is just to produce a sample dataset)

      Code:
      sysuse auto, clear
      
      set seed 123
      label define BRANDS 1 "Ace" 2 "Brace" 3 "Craze" 4 "Daze"
      gen byte brand: BRANDS = runiformint(1,4)
      replace price = price * 20
      format price %9.0fc
      
      * use the code from here onwards
      
      label define BINS 1 "\$50K to 100K" 2 "\$100K+ to 200K" 3 "\$200K+"
      gen byte price_bins: BINS = cond(inrange(price,50000,100000),1,cond(inrange(price,100001,200000),2,cond(missing(price),.,3)))
      
      #delimit ;
      graph bar (percent), 
          over(brand)
          over(price_bins)
          ytitle(Percentage)
          title(Distribution of brands within price ranges)
          blabel(bar, format(%4.1f)) 
          scheme(s2color)
          ;
      #delimit cr
      which produces:
      Click image for larger version

Name:	Screenshot 2022-11-12 at 3.26.22 PM.png
Views:	1
Size:	792.4 KB
ID:	1689125

      Comment


      • #4
        Thank you so much Nick and Hemanshu. The example is really helpful. However, I want that the sum of percentage be 100 within each bin for-instance 50K to 100K in this example. In other words, I want percentages of Ace, Brace, Craze, and Daze within 50K 5o 100K, and within 100K to 200k etc.

        Comment


        • #5
          How about this?

          Code:
          sysuse auto, clear
          
          set seed 123
          label define BRANDS 1 "Ace" 2 "Brace" 3 "Craze" 4 "Daze"
          gen byte brand: BRANDS = runiformint(1,4)
          replace price = price * 20
          format price %9.0fc
          
          * use the code from here onwards
          
          label define BINS 1 "\$50K to 100K" 2 "\$100K+ to 200K" 3 "\$200K+"
          gen byte price_bins: BINS = cond(inrange(price,50000,100000),1,cond(inrange(price,100001,200000),2,cond(missing(price),.,3)))
          
          bys price_bins brand: egen _freq = count(brand)
          egen byte tag = tag(price_bins brand)
          bys price_bins: egen total = total(_freq*tag)
          gen perc = _freq/total*100
          
          #delimit ;
          graph bar (asis) perc if tag, 
              over(brand)
              over(price_bins)
              ytitle(Percentage)
              title(Distribution of brands within price ranges)
              blabel(bar, format(%4.1f)) 
              scheme(s2color)
              ;
          #delimit cr
          
          drop _freq total perc tag
          which produces:
          Click image for larger version

Name:	Screenshot 2022-11-12 at 7.12.58 PM.png
Views:	1
Size:	827.2 KB
ID:	1689157

          Comment


          • #6
            Thank you so much Hemanshu. That is exactly what I was looking for. Your help is much appreciated.

            Comment


            • #7
              Stealing @Hemanshu Kumar's nice data example, I can get essentially the same graph fairly directly with catplot from SSC.


              Code:
              sysuse auto, clear
              
              set seed 123
              label define BRANDS 1 "Ace" 2 "Brace" 3 "Craze" 4 "Daze"
              gen byte brand: BRANDS = runiformint(1,4)
              replace price = price * 20
              format price %9.0fc
              
              * use the code from here onwards
              
              label define BINS 1 "\$50K to 100K" 2 "\$100K+ to 200K" 3 "\$200K+"
              gen byte price_bins: BINS = cond(inrange(price,50000,100000),1,cond(inrange(price,100001,200000),2,cond(missing(price),.,3)))
              
              * NJC starts here 
              * ssc inst catplot 
              
              catplot brand price_bins, var1opts(label(labsize(small))) percent(price_bins) recast(bar) blabel(bar, format(%2.1f)) scheme(s2color)

              Comment

              Working...
              X