Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I know cut values of an equal-probability histogram?

    Hello,
    I want to transform a continuous variable to a categorical variable.
    I have only a histogram after doing this:
    eqprhistogram height, bin(5)
    How can I know cutpoint values for each category? I want to know exact ranges.
    If you have other advice for transforming continuous variables to categorical variables, let me know.
    thanks
    Last edited by Mostafa Ahmadi; 08 Feb 2023, 19:00.

  • #2
    If you type
    Code:
     viewsource eqprhistogram.ado
    , you will see that it uses the combination of _pctile and r(min) and r(max) to get the bins.

    Here's an example:

    Code:
    sysuse auto, clear
    drop in 1
    sum price
    scalar min = r(min)
    scalar max = r(max)
    _pctile price, nq(4)
    return list
    eqprhistogram price, bin(4) xlab(`=scalar(min)' `=r(r1)' `=r(r2)' `=r(r3)' `=scalar(max)', alternate)
    scalar list
    These will match the actual values in each if the number of observations is odd, and almost exactly if the number of observations is odd.
    Last edited by Dimitriy V. Masterov; 08 Feb 2023, 19:38.

    Comment


    • #3
      Another option is

      Code:
      . xtile group = price, nq(4)
      
      . table group, stat(min price) stat( max price) stat(count group) nototal
      
      -----------------------------------------------------------------------------------
                           |  Minimum value   Maximum value   Number of nonmissing values
                           |          Price           Price          4 quantiles of price
      ---------------------+-------------------------------------------------------------
      4 quantiles of price |                                                            
        1                  |           3291            4195                            19
        2                  |           4296            4934                            18
        3                  |           5079            6342                            19
        4                  |           6486           15906                            18
      -----------------------------------------------------------------------------------
      This won't match the histogram exactly, but is one line of code.

      Comment


      • #4
        eqprhistogram on SSC is an old program written in response to a thread on Statalist in 2003.

        It's a distraction here. As Dimitriy V. Masterov points out quantile binning is provided by xtile and indeed some community-contributed commands.

        Despite a great deal of enthusiasm in various quarters, this method often disappoints.

        1. (Small print) Equal bin frequencies are only possible if the number of observations is a multiple of the number of bins.

        2. (Larger print) Tied values often frustrate the ideal of equal frequencies. Researchers are often surprised or puzzled by this, but Stata's rules include one that observations with identical values must be assigned to the same bin.

        3. (Largest print) Extreme bins in particular are remarkably, or according to the context, unsurprisingly heterogeneous given (especially) skewness or outliers. Whether this frustrates your goals or is entirely supportive of them has to be thought through in each project.

        Some examples and more importantly some sceptical references can be found within

        https://www.stata-journal.com/articl...article=pr0054 (Section 4)

        https://www.stata-journal.com/articl...article=dm0095 (Section 1 and Section 6)

        Comment


        • #5
          Thank you dear Dimitriy V. Masterov and Thank you dear Nick Cox..

          I am sorry because I do not know about statistics so maybe I can not explain my question properly.. I am looking for a method where the cut points are not exactly in the data. I think it will be about distribution and density.. Equal probability histogram with Kernel density estimation is a good method for this purpose but only when we do Visual interpretation.. I am looking for a more scientific method which I will refer to in my article.

          I want to run a binary logistic regression to understanding (modeling) factors affecting nest-site selection in a bird species.. I think it is better if I transform continuous variables to categorical variables, because for example Nest Height from the ground can have different effects in different ranges (different intervals); for example 0-200cm will have a negative effect in Nest-site selection while 200-250cm have a positive effect and it will be change in next ranges too.. + - - + + - + After running Binary logistic regression we will have a coefficient for each category instead of having a coefficient for a variable.

          Thanks
          Last edited by Mostafa Ahmadi; 09 Feb 2023, 06:06.

          Comment


          • #6
            With more information your question now looks quite different and all to do with a binary response and (e.g.) nest height as a control. A relationship can be quite complicated but that is not always (in my experience rarely) better categorised rather than treated in terms of a continuous control. It is hard for me to say more without seeing what the data look like.

            Comment


            • #7
              I am using Presence/Absence data as response variable (Presence and Absence of nests), 32 observations of presence and 64 observations of absence.. Predictors are both continuous and categorical variables but I want to transform continuous variables to categorical.. Of course I am making Equal probability Histograms by Presence data.. I am using MiniTab software for running Binary logistic regression because it is separates continuous and categorical variables.. In other hand it provides a coefficients for each category. If you think I can use another statistical method I will be grateful to know.

              Now I think Visual interpretation of Equal probability histogram with Kernel density estimation is a good method for Determination of cut points and ranges, particularly when I can run a Pairwise comparison of means to know each category built properly or not.

              I am in doubt about Absence data I collected because these are not collected randomly, Actually I measured the places where I thought birds could build nests.

              Thanks..
              Attached Files
              Last edited by Mostafa Ahmadi; 09 Feb 2023, 11:18.

              Comment


              • #8
                If you want Stata advice, please follow https://www.statalist.org/forums/help#stata (which explains for example why spreadsheets are not a good idea for us).

                For general statistical advice, I recommend https://stats.stackexchange.com/ as the best forum I know (there may be other suggestions) If you ask there about binning continuous predictors, most people will advise against. .

                I've not used Minitab for some decades and can't advise on the best forum for Minitab support.

                Comment


                • #9
                  I think Discriminant Analysis cab be helpful beside Binary logistic regression..

                  Thank you so much.
                  Last edited by Mostafa Ahmadi; 10 Feb 2023, 04:05.

                  Comment

                  Working...
                  X