ranksum test porder option

Anupama gv

Join Date: Mar 2020

Posts: 28
#1

ranksum test porder option

02 Sep 2020, 18:25

Hi all,

Good morning!
I have done wilcoxin ranksum test with porder in STATA
the variable we are testing is : average plot size in male owned plots (male_owned=1) vs female owned plots (male_owned=0)
in descriptive stat, mean values obtained are:
male owned plots: 0.5 ha
female owned plots: 0.6 ha
But ranksum test results show:
Two-sample Wilcoxon rank-sum (Mann-Whitney) test

male_owned | obs rank sum expected
-------------+---------------------------------
0 | 975 5914037 6244387.5
1 | 11833 76114799 75784449
-------------+---------------------------------
combined | 12808 82028836 82028836

unadjusted variance 1.231e+10
adjustment for ties -24263240
----------
adjusted variance 1.229e+10

Ho: plotsize(male_owned==0) = plotsize(male_owned==1)
z = -2.980
Prob > |z| = 0.0029

P{plotsize(male_owned==0) > plotsize(male_owned==1)} = 0.471

I am confused about the interpretation of the results
As per my understanding the result is significant and in 47 out of 100 times the female owned plots plotsize is greater.
that means in 53 times male owned plots plotsize is greater.
But the averages show that female owned plots average plotsize is 0.6 whereas it is 0.5 for male owned plots
so what is the conclusion from this in terms of direction?
can we say that plotsize in male owned plots > female owned plots
or female owned plots plotsize > male owned plots plotsize

kindly help
thanks and regards
Anupama
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4455
#2

02 Sep 2020, 18:57

first, please use CODE delimiters (as explained in the FAQ) to make your posts easier to read

second, my guess, but you don't provide sufficient information to assess this, is that the distributions across the 2 groups are very different in shape; you could look at graphs to check this; note also that the porder result is exactly equal to 1-c statistic from logistic regression; you could try

Code:

logistic male_owned plotsize lroc, nog

to see this

one way to think about the porder result is as follows: pair each male-owned plot with each female owned plot and count the number of times the female owned plot is larger (and then divide by the total number of pairs) - this is clearly a different question than a comparison of the mean values so the fact that the answer is also different should not be that surprising
Comment
Anupama gv

Join Date: Mar 2020

Posts: 28
#3

02 Sep 2020, 20:10

Thank you Rich for your quick response.
The sample sizes of male owned plots is 11783 and female owned plots is 924
I am presenting the descriptive stats
with average plotsize as one of the variables.

pooled sample female owned male owned difference direction

average plotsize 0.6 0.6 0.5 0.1** female<male

observations 12808 924 11783

Is the table correct? especially with direction. Please guide
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35601
#4

03 Sep 2020, 03:50

The table is incorrect if you are taking ranksum as testing a difference between means (averages). It's not that at all. It is a test of stochastic dominance. of whether typical male-owned plots are bigger (smaller) than typical female-owned plots, typical being made precise by comparing all possible pairs. .

A graphical interpretation of this probability can be seen through a so-called dominance diagram. See domdiag from SSC. Here's an example:

Code:

. sysuse auto, clear (1978 Automobile Data) . ranksum mpg, by(foreign) porder Two-sample Wilcoxon rank-sum (Mann-Whitney) test foreign | obs rank sum expected -------------+--------------------------------- Domestic | 52 1688.5 1950 Foreign | 22 1086.5 825 -------------+--------------------------------- combined | 74 2775 2775 unadjusted variance 7150.00 adjustment for ties -36.95 ---------- adjusted variance 7113.05 Ho: mpg(foreign==Domestic) = mpg(foreign==Foreign) z = -3.101 Prob > |z| = 0.0019 Exact Prob = 0.0016 P{mpg(foreign==Domestic) > mpg(foreign==Foreign)} = 0.271 . domdiag mpg, by(foreign) yla(, ang(h))

There are 22 x 52 possible pairs with one foreign car and one domestic car. The probability concerned is in principle calculated from all those pairs. The help for the domdiag command gives more detail and some key references. I particularly recommend

Newson, R. 2002. Parameters behind "nonparametric" statistics: Kendall's tau, Somers' D and median differences.
Stata Journal 2: 45-64. http://www.stata-journal.com/sjpdf.h...iclenum=st0007

That said: in your case, the diagram would not be so clear, as based on 11 million pairs or so. I am mentioning it just as a way of explaining what the porder option calculates.

The mean plot size is what it is, but I would expect plot size comparisons to make most sense on logarithmic scale. What's highly typical -- in rich countries as well as poor -- is that many people have very small plots and a few have much larger plots. In these circumstances the geometric mean is often a better summary, and indeed -- what is also mentioned in most elementary texts I have encountered -- the median is pertinent too. In my view, geometric means should be used much more often than they are.

All that said, plotting the data too is the best way to put any summary measures or overall comparisons in full context. Here is one possibility that attempts to show detail that might be important as well as summary measures.

What we have here, for each group:

1. All values plotted versus an implied rank, in other words a quantile plot; we can see any outliers easily and some broad features. Here the staircase effect arises from a convention of reporting mpg rounded to integers, which we usually won't care about.

2. A box with median and quartiles, in this case extended to minimum and maximum. I don't bother with fiddly rules such as plotting points individually if and only if they lie more than 1.5 IQR from the nearer quartile, as my quantile plot shows all the detail.

3. A reference line showing (in this case) the mean of each group. (Some people want to superimpose a marker symbol for the mean on the box instead.)

The graph uses stripplot from SSC.

Code:

stripplot mpg , over(foreign) vertical cumul box(barw(0.05)) boffset(-0.1) pctile(0) refline xla(, tlc(none)) yla(, ang(h))

In your case I would recommend trying the extra options cumprob reflevel(gmean) ysc(log)

Incidentally, if memory serves me right it was Rich Goldstein and myself who gently pushed Stata (the company) into implementing the porder option, perhaps 20 or 25 years ago.

Last edited by Nick Cox; 03 Sep 2020, 03:52.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4455
#5

03 Sep 2020, 05:37

Nick Cox 's memory is the same as mine on this issue; note also that in 9/94 issue of the STB I had written up a ranksum2 command that implemented this

re: #3 above, the table is not relevant, nor is it what I suggested in #2 (you might try overlapping -kdensity- graphs); as I said in #2 and as Nick also said in #4, the porder option of -ranksum- is not comparable to a comparison of means
Comment
Anupama gv

Join Date: Mar 2020

Posts: 28
#6

03 Sep 2020, 21:23

Thanks a lot Nick and Rich for your valuable advice.

Kindly let me know what is appropriate test to see the equality of means in this case. I know that ttest is not appropriate.
Is it appropriate to present the descriptive stats in logarithmic values rather than the actual values, or some variables in actual values and some in logarithmic?

please advice
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35601
#7

04 Sep 2020, 03:17

These may seem simple questions but the answer depends on your context and what instructions or expectations may apply to what you do and how you present it. That might well vary depending on whether you are working at first degree, Master's, Ph.D. or postdoc level or are an independent researcher and what outcomes are envisaged for your project, including even the possibility of a presentation to lay audience.

For myself I would

0. Always state units of measurement for a variable like area.

1. Plot the data on logarithmic scale, noting that the median, quartiles and extremes work well with logarithmic scale (but see https://www.stata.com/support/faqs/g...ithmic-scales/ for a warning about details)

1'. Look at all the information provided by summarize, detail

2. Apply a t-test circumspectly. With sample sizes such as yours even with marked skewness the results might be clear-cut.

2'. Use bootstrapping to get a confidence interval for difference of means.

2''. Apply a t-test to logged data. Now you are comparing geometric means.

There are yet other possibilities but I will stop there.

I know that Rich Goldstein is taking a vacation, which affects how likely he is to reply soon.

Interpretation here very likely also requires social and legal information on land inheritance and ownership, gender roles, and so forth.
Comment
Anupama gv

Join Date: Mar 2020

Posts: 28
#8

04 Sep 2020, 04:12

Thanks a lot for your response and your valuable time. This helps a lot
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4455
#9

04 Sep 2020, 04:15

I agree with what Nick Cox wrote but add the following:

many years ago I was interested in this also and wrote at least 2 STB contributions on testing means of skew data; search for and download -johnson- and -obrien-; the help files are very short so you will want to look at the STB write-ups that are freely available at the Stata web site

-glm- is certainly useful here with various families and links (e.g., normal and log link); more directly, poisson is a clear alternative (see Bill Gould's blog)
Comment
Anupama gv

Join Date: Mar 2020

Posts: 28
#10

04 Sep 2020, 05:26

Thanks a lot, I shall download and go through them
Comment

	pooled sample	female owned	male owned	difference	direction
average plotsize	0.6	0.6	0.5	0.1**	female<male
observations	12808	924	11783

Announcement

ranksum test porder option

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment