Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • ranksum test porder option

    Hi all,

    Good morning!
    I have done wilcoxin ranksum test with porder in STATA
    the variable we are testing is : average plot size in male owned plots (male_owned=1) vs female owned plots (male_owned=0)
    in descriptive stat, mean values obtained are:
    male owned plots: 0.5 ha
    female owned plots: 0.6 ha
    But ranksum test results show:
    Two-sample Wilcoxon rank-sum (Mann-Whitney) test

    male_owned | obs rank sum expected
    -------------+---------------------------------
    0 | 975 5914037 6244387.5
    1 | 11833 76114799 75784449
    -------------+---------------------------------
    combined | 12808 82028836 82028836

    unadjusted variance 1.231e+10
    adjustment for ties -24263240
    ----------
    adjusted variance 1.229e+10

    Ho: plotsize(male_owned==0) = plotsize(male_owned==1)
    z = -2.980
    Prob > |z| = 0.0029

    P{plotsize(male_owned==0) > plotsize(male_owned==1)} = 0.471

    I am confused about the interpretation of the results
    As per my understanding the result is significant and in 47 out of 100 times the female owned plots plotsize is greater.
    that means in 53 times male owned plots plotsize is greater.
    But the averages show that female owned plots average plotsize is 0.6 whereas it is 0.5 for male owned plots
    so what is the conclusion from this in terms of direction?
    can we say that plotsize in male owned plots > female owned plots
    or female owned plots plotsize > male owned plots plotsize

    kindly help
    thanks and regards
    Anupama

  • #2
    first, please use CODE delimiters (as explained in the FAQ) to make your posts easier to read

    second, my guess, but you don't provide sufficient information to assess this, is that the distributions across the 2 groups are very different in shape; you could look at graphs to check this; note also that the porder result is exactly equal to 1-c statistic from logistic regression; you could try
    Code:
    logistic male_owned plotsize
    lroc, nog
    to see this

    one way to think about the porder result is as follows: pair each male-owned plot with each female owned plot and count the number of times the female owned plot is larger (and then divide by the total number of pairs) - this is clearly a different question than a comparison of the mean values so the fact that the answer is also different should not be that surprising

    Comment


    • #3
      Thank you Rich for your quick response.
      The sample sizes of male owned plots is 11783 and female owned plots is 924
      I am presenting the descriptive stats
      with average plotsize as one of the variables.
      pooled sample female owned male owned difference direction
      average plotsize 0.6 0.6 0.5 0.1** female<male
      observations 12808 924 11783
      Is the table correct? especially with direction. Please guide

      Comment


      • #4
        The table is incorrect if you are taking ranksum as testing a difference between means (averages). It's not that at all. It is a test of stochastic dominance. of whether typical male-owned plots are bigger (smaller) than typical female-owned plots, typical being made precise by comparing all possible pairs. .

        A graphical interpretation of this probability can be seen through a so-called dominance diagram. See domdiag from SSC. Here's an example:

        Code:
        . sysuse auto, clear
        (1978 Automobile Data)
        
        . ranksum mpg, by(foreign) porder
        
        Two-sample Wilcoxon rank-sum (Mann-Whitney) test
        
             foreign |      obs    rank sum    expected
        -------------+---------------------------------
            Domestic |       52      1688.5        1950
             Foreign |       22      1086.5         825
        -------------+---------------------------------
            combined |       74        2775        2775
        
        unadjusted variance     7150.00
        adjustment for ties      -36.95
                             ----------
        adjusted variance       7113.05
        
        Ho: mpg(foreign==Domestic) = mpg(foreign==Foreign)
                     z =  -3.101
            Prob &gt; |z| =   0.0019
            Exact Prob =   0.0016
        
        P{mpg(foreign==Domestic) &gt; mpg(foreign==Foreign)} = 0.271
        
        . domdiag mpg, by(foreign) yla(, ang(h))
        There are 22 x 52 possible pairs with one foreign car and one domestic car. The probability concerned is in principle calculated from all those pairs. The help for the domdiag command gives more detail and some key references. I particularly recommend

        Newson, R. 2002. Parameters behind "nonparametric" statistics: Kendall's tau, Somers' D and median differences.
        Stata Journal 2: 45-64. http://www.stata-journal.com/sjpdf.h...iclenum=st0007
        Click image for larger version

Name:	domdiag.png
Views:	1
Size:	27.4 KB
ID:	1571255



        That said: in your case, the diagram would not be so clear, as based on 11 million pairs or so. I am mentioning it just as a way of explaining what the porder option calculates.

        The mean plot size is what it is, but I would expect plot size comparisons to make most sense on logarithmic scale. What's highly typical -- in rich countries as well as poor -- is that many people have very small plots and a few have much larger plots. In these circumstances the geometric mean is often a better summary, and indeed -- what is also mentioned in most elementary texts I have encountered -- the median is pertinent too. In my view, geometric means should be used much more often than they are.

        All that said, plotting the data too is the best way to put any summary measures or overall comparisons in full context. Here is one possibility that attempts to show detail that might be important as well as summary measures.



        What we have here, for each group:

        1. All values plotted versus an implied rank, in other words a quantile plot; we can see any outliers easily and some broad features. Here the staircase effect arises from a convention of reporting mpg rounded to integers, which we usually won't care about.

        2. A box with median and quartiles, in this case extended to minimum and maximum. I don't bother with fiddly rules such as plotting points individually if and only if they lie more than 1.5 IQR from the nearer quartile, as my quantile plot shows all the detail.

        3. A reference line showing (in this case) the mean of each group. (Some people want to superimpose a marker symbol for the mean on the box instead.)

        The graph uses stripplot from SSC.

        Code:
        stripplot mpg , over(foreign) vertical cumul box(barw(0.05)) boffset(-0.1)  pctile(0) refline xla(, tlc(none)) yla(, ang(h))


        In your case I would recommend trying the extra options cumprob reflevel(gmean) ysc(log)

        Incidentally, if memory serves me right it was Rich Goldstein and myself who gently pushed Stata (the company) into implementing the porder option, perhaps 20 or 25 years ago.
        Last edited by Nick Cox; 03 Sep 2020, 03:52.

        Comment


        • #5
          Nick Cox 's memory is the same as mine on this issue; note also that in 9/94 issue of the STB I had written up a ranksum2 command that implemented this

          re: #3 above, the table is not relevant, nor is it what I suggested in #2 (you might try overlapping -kdensity- graphs); as I said in #2 and as Nick also said in #4, the porder option of -ranksum- is not comparable to a comparison of means

          Comment


          • #6

            Thanks a lot Nick and Rich for your valuable advice.

            Kindly let me know what is appropriate test to see the equality of means in this case. I know that ttest is not appropriate.
            Is it appropriate to present the descriptive stats in logarithmic values rather than the actual values, or some variables in actual values and some in logarithmic?

            please advice

            Comment


            • #7
              These may seem simple questions but the answer depends on your context and what instructions or expectations may apply to what you do and how you present it. That might well vary depending on whether you are working at first degree, Master's, Ph.D. or postdoc level or are an independent researcher and what outcomes are envisaged for your project, including even the possibility of a presentation to lay audience.

              For myself I would

              0. Always state units of measurement for a variable like area.

              1. Plot the data on logarithmic scale, noting that the median, quartiles and extremes work well with logarithmic scale (but see https://www.stata.com/support/faqs/g...ithmic-scales/ for a warning about details)

              1'. Look at all the information provided by summarize, detail

              2. Apply a t-test circumspectly. With sample sizes such as yours even with marked skewness the results might be clear-cut.

              2'. Use bootstrapping to get a confidence interval for difference of means.

              2''. Apply a t-test to logged data. Now you are comparing geometric means.

              There are yet other possibilities but I will stop there.

              I know that Rich Goldstein is taking a vacation, which affects how likely he is to reply soon.

              Interpretation here very likely also requires social and legal information on land inheritance and ownership, gender roles, and so forth.

              Comment


              • #8
                Thanks a lot for your response and your valuable time. This helps a lot

                Comment


                • #9
                  I agree with what Nick Cox wrote but add the following:

                  many years ago I was interested in this also and wrote at least 2 STB contributions on testing means of skew data; search for and download -johnson- and -obrien-; the help files are very short so you will want to look at the STB write-ups that are freely available at the Stata web site

                  -glm- is certainly useful here with various families and links (e.g., normal and log link); more directly, poisson is a clear alternative (see Bill Gould's blog)

                  Comment


                  • #10
                    Thanks a lot, I shall download and go through them

                    Comment

                    Working...
                    X