Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to add the symmetry axis of the normal distribution when graphing the histograms by groups

    Code:
    1. * Example generated by -dataex-. To install: ssc install dataex
    2. clear
    3. input double(tenaciousGoalPursuit group)
    4. 2.9 2
    5. 2.1 2
    6. 3.5 2
    7. 2.1 1
    8. 3.9 2
    9. 4 2
    10. 3.4 2
    11. 2.4 1
    12. 3.3 2
    13. 3.2 2
    14. 3.3 1
    15. 3.2 1
    16. 2.2 1
    17. 3.7 1
    18. 4 2
    19. 2.8 1
    20. 2.9 2
    21. 2.8 1
    22. 3.2 2
    23. 3.9 1
    24. 2.8 1
    25. 2.7 1
    26. 3.5 2
    27. 2.9 2
    28. 3.5 2
    29. 2.5 1
    30. 2.6 1
    31. 3.5 1
    32. 3.4 1
    33. 2.8 1
    34. 3.2 2
    35. 3.4 1
    36. 3.3 2
    37. 2.7 1
    38. 2.9 2
    39. 3.1 2
    40. 3.3 1
    41. 2.5 1
    42. 4.5 1
    43. 3.2 1
    44. 4.5 2
    45. 3.5 1
    46. 2 1
    47. 2.1 1
    48. 3.4 2
    49. 2.6 2
    50. 2.6 2
    51. 3.3 2
    52. 2.8 1
    53. 3.5 2
    54. 3.6 1
    55. 3.3 2
    56. 3 2
    57. 3.5 2
    58. 2.5 1
    59. 4.1 2
    60. 2.8 2
    61. 3.1 2
    62. 2.5 1
    63. 3.1 1
    64. 2.9 2
    65. 3.1 2
    66. 3.4 1
    67. 2.6 1
    68. 1.7 1
    69. 3.8 2
    70. 3.3 1
    71. 2.3 1
    72. 3.5 1
    73. 2.3 1
    74. 2.6 1
    75. 2.2 1
    76. 2.7 1
    77. 3.7 1
    78. 2.3 1
    79. 3.2 2
    80. 3.6 2
    81. 3.8 1
    82. 3.8 1
    83. 2.7 1
    84. 3.2 1
    85. 3.5 2
    86. 3.3 1
    87. 4 1
    88. 2.9 2
    89. 3.4 2
    90. 2.4 1
    91. 2.5 2
    92. 2.8 1
    93. 3 1
    94. 3.6 2
    95. 2.2 1
    96. 3.1 1
    97. 2.4 1
    98. 2.9 2
    99. 3.3 1
    100. 3.2 2
    101. 3.4 1
    102. 3.3 1
    103. 3 1
    104. end
    105. hist tenaciousGoalPursuit ,by(group) normal

  • #2
    I use scheme s1color as a default. This is what your histogram looks like:
    Click image for larger version

Name:	fred_lee_1.png
Views:	1
Size:	34.1 KB
ID:	1491115



    It has the strengths and the weaknesses of histograms, the main strength being perhaps that anyone who got through a first course in statistics should have a rough idea of what is being shown and the main weakness being that it's hard to know what is a feature of the data and what is artefact depending on arbitrary bin widths and bin starts. There could also be different views on whether showing density is a good choice, even though it's just the default and the researcher can choose something else. .

    In particular, my eye is drawn to the shorter bar near the mean for group 1. Is that a hint of bimodality or just a quirk in the display to be put down to bin alignment?

    Side-by-side histograms are not the only possibility, but this one makes comparison of groups (presumably an interesting and important part of the exercise!) quite hard. One has to look closely to see that the mean of group 2 is a bit higher. Otherwise, the impression is that the distributions are broadly similar.

    A key point here is that you have, by modern standards, a small dataset, so there's scope to show all the data, and not just a reduction of it.

    What's the role of the normal distribution here? Sometimes there are grounds for thinking that data should be approximately normal. More commonly, it is just a refererence distribution.

    I would like to persuade you to try something else. Here is one possibility, a normal quantile plot. So, we don't discard the idea of a normal distribution as reference, but we see all the individual values

    For this, you need qplot from the Stata Journal.

    Your dataex output in #1 is a little mangled, so I'll repeat it here.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(tenaciousGoalPursuit group)
    2.9 2
    2.1 2
    3.5 2
    2.1 1
    3.9 2
      4 2
    3.4 2
    2.4 1
    3.3 2
    3.2 2
    3.3 1
    3.2 1
    2.2 1
    3.7 1
      4 2
    2.8 1
    2.9 2
    2.8 1
    3.2 2
    3.9 1
    2.8 1
    2.7 1
    3.5 2
    2.9 2
    3.5 2
    2.5 1
    2.6 1
    3.5 1
    3.4 1
    2.8 1
    3.2 2
    3.4 1
    3.3 2
    2.7 1
    2.9 2
    3.1 2
    3.3 1
    2.5 1
    4.5 1
    3.2 1
    4.5 2
    3.5 1
      2 1
    2.1 1
    3.4 2
    2.6 2
    2.6 2
    3.3 2
    2.8 1
    3.5 2
    3.6 1
    3.3 2
      3 2
    3.5 2
    2.5 1
    4.1 2
    2.8 2
    3.1 2
    2.5 1
    3.1 1
    2.9 2
    3.1 2
    3.4 1
    2.6 1
    1.7 1
    3.8 2
    3.3 1
    2.3 1
    3.5 1
    2.3 1
    2.6 1
    2.2 1
    2.7 1
    3.7 1
    2.3 1
    3.2 2
    3.6 2
    3.8 1
    3.8 1
    2.7 1
    3.2 1
    3.5 2
    3.3 1
      4 1
    2.9 2
    3.4 2
    2.4 1
    2.5 2
    2.8 1
      3 1
    3.6 2
    2.2 1
    3.1 1
    2.4 1
    2.9 2
    3.3 1
    3.2 2
    3.4 1
    3.3 1
      3 1
    end
    
    qplot tenacious, over(group) trscale(invnormal(@)) aspect(1) ///
    xtitle(standard normal quantile) ytitle(tenacious goal pursuit) ///
    yla(1.5(0.5)4.5, grid ang(h)) legend(ring(0) order(2 1) pos(11) col(1))
    Click image for larger version

Name:	fred_lee_2.png
Views:	1
Size:	28.0 KB
ID:	1491116


    What do I see here?

    Granularity. All measurements are multiples of 0.1

    Small spikes and gaps. A spike at 2.9 for group 2 and a gap at the same value for group 1. If there's a story, you should be able to tell it. Otherwise, it seems much less alarming than on the histograms. If you have more data than you showed, the gap may well be filled in.

    Approximate normality. A feature of using the horizontal scale is that normal distributions will plot as straight lines.

    Group 2 has higher mean but lower SD.

    Code:
    . tabstat tenacious , by(group) s(n mean sd)
    
    Summary for variables: tenaciousGoalPursuit
         by categories of: group
    
       group |         N      mean        sd
    ---------+------------------------------
           1 |        58  2.943103  .5837303
           2 |        42  3.264286  .4621379
    ---------+------------------------------
       Total |       100     3.078  .5567909
    ----------------------------------------
    I didn't answer the question. If you want a vertical line on each histogram at the position of each mean, you will need to add it explicitly. That is a little hard with your chosen command.
    Last edited by Nick Cox; 01 Apr 2019, 04:00.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      I use scheme s1color as a default. This is what your histogram looks like:
      [ATTACH=CONFIG]n1491115[/ATTACH]


      It has the strengths and the weaknesses of histograms, the main strength being perhaps that anyone who got through a first course in statistics should have a rough idea of what is being shown and the main weakness being that it's hard to know what is a feature of the data and what is artefact depending on arbitrary bin widths and bin starts. There could also be different views on whether showing density is a good choice, even though it's just the default and the researcher can choose something else. .

      In particular, my eye is drawn to the shorter bar near the mean for group 1. Is that a hint of bimodality or just a quirk in the display to be put down to bin alignment?

      Side-by-side histograms are not the only possibility, but this one makes comparison of groups (presumably an interesting and important part of the exercise!) quite hard. One has to look closely to see that the mean of group 2 is a bit higher. Otherwise, the impression is that the distributions are broadly similar.

      A key point here is that you have, by modern standards, a small dataset, so there's scope to show all the data, and not just a reduction of it.

      What's the role of the normal distribution here? Sometimes there are grounds for thinking that data should be approximately normal. More commonly, it is just a refererence distribution.

      I would like to persuade you to try something else. Here is one possibility, a normal quantile plot. So, we don't discard the idea of a normal distribution as reference, but we see all the individual values

      For this, you need qplot from the Stata Journal.

      Your dataex output in #1 is a little mangled, so I'll repeat it here.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input double(tenaciousGoalPursuit group)
      2.9 2
      2.1 2
      3.5 2
      2.1 1
      3.9 2
      4 2
      3.4 2
      2.4 1
      3.3 2
      3.2 2
      3.3 1
      3.2 1
      2.2 1
      3.7 1
      4 2
      2.8 1
      2.9 2
      2.8 1
      3.2 2
      3.9 1
      2.8 1
      2.7 1
      3.5 2
      2.9 2
      3.5 2
      2.5 1
      2.6 1
      3.5 1
      3.4 1
      2.8 1
      3.2 2
      3.4 1
      3.3 2
      2.7 1
      2.9 2
      3.1 2
      3.3 1
      2.5 1
      4.5 1
      3.2 1
      4.5 2
      3.5 1
      2 1
      2.1 1
      3.4 2
      2.6 2
      2.6 2
      3.3 2
      2.8 1
      3.5 2
      3.6 1
      3.3 2
      3 2
      3.5 2
      2.5 1
      4.1 2
      2.8 2
      3.1 2
      2.5 1
      3.1 1
      2.9 2
      3.1 2
      3.4 1
      2.6 1
      1.7 1
      3.8 2
      3.3 1
      2.3 1
      3.5 1
      2.3 1
      2.6 1
      2.2 1
      2.7 1
      3.7 1
      2.3 1
      3.2 2
      3.6 2
      3.8 1
      3.8 1
      2.7 1
      3.2 1
      3.5 2
      3.3 1
      4 1
      2.9 2
      3.4 2
      2.4 1
      2.5 2
      2.8 1
      3 1
      3.6 2
      2.2 1
      3.1 1
      2.4 1
      2.9 2
      3.3 1
      3.2 2
      3.4 1
      3.3 1
      3 1
      end
      
      qplot tenacious, over(group) trscale(invnormal(@)) aspect(1) ///
      xtitle(standard normal quantile) ytitle(tenacious goal pursuit) ///
      yla(1.5(0.5)4.5, grid ang(h)) legend(ring(0) order(2 1) pos(11) col(1))
      [ATTACH=CONFIG]n1491116[/ATTACH]

      What do I see here?

      Granularity. All measurements are multiples of 0.1

      Small spikes and gaps. A spike at 2.9 for group 2 and a gap at the same value for group 1. If there's a story, you should be able to tell it. Otherwise, it seems much less alarming than on the histograms. If you have more data than you showed, the gap may well be filled in.

      Approximate normality. A feature of using the horizontal scale is that normal distributions will plot as straight lines.

      Group 2 has higher mean but lower SD.

      Code:
      . tabstat tenacious , by(group) s(n mean sd)
      
      Summary for variables: tenaciousGoalPursuit
      by categories of: group
      
      group | N mean sd
      ---------+------------------------------
      1 | 58 2.943103 .5837303
      2 | 42 3.264286 .4621379
      ---------+------------------------------
      Total | 100 3.078 .5567909
      ----------------------------------------
      I didn't answer the question. If you want a vertical line on each histogram at the position of each mean, you will need to add it explicitly. That is a little hard with your chosen command.
      Thanks a lot,Nick! I was inspired a lot thanks to your response. Maybe I will give up drawing the vertical line of each mean, since we can estimate the values approximately.

      Comment


      • #4
        Thanks for the thanks; glad to think it was helpful.

        PS: Note that there really is no need to copy all of #2 in replying to it in #3. The point of quotation is to be selective in structuring a reply.

        Comment


        • #5
          Aha, thanks for the tip.

          Comment

          Working...
          X