Sample subset of data with pre defined mean

Vincent Koppelmans

Join Date: Nov 2014

Posts: 14
#1

Sample subset of data with pre defined mean

14 Nov 2014, 15:19

Hi Stata experts,

I have got a group of cases and a group of controls that is x times as large. I want to match the groups on variable A. Variable A is normally distributed in the case group, but follows a bimodel distribution in the control group because it is a combination of two groups that have a normal distribution of variable A, but of which the means are slightly apart. I could use frequency matching using a syntax I found in a previous statalist post: http://www.stata.com/statalist/archi.../msg00326.html

However, I believe I could potentially include more control subjects if I would be able to <<draw a sample from the total control group with predefined mean of variable A>>. Of course, the predefined mean would be the mean of the case group. I realize the standard deviation would likely be different for the control than for the case group (which would not be the case with frequency matching).

Is it possible to draw a sample of an existing data set with a predefined mean?
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

14 Nov 2014, 15:33

The crude way would be to simply lop off the top or bottom x% of cases to force the mean to what you want. Simply sort by A and figure out where the cutpoint is.

I think a more reasonable approach (though not necessarily that good) approach would be to divide it up into pieces, say quintiles. If you want to draw the mean up, randomly throw out two cases from each of the bottom two quintiles and one each from the top three quintiles. Stop when the desired mean is achieved.

Hopefully somebody else will chime in with a better approach, or a good reason why not to do what you intend to do. Feels iffy to me, though I can't definitively say it's a bad thing to do.
Comment

ben earnhart

Join Date: May 2014
Posts: 1027

14 Nov 2014, 16:01

I was curious. See below. I needed to burn through about 50% of the cases to arrive at the new mean; you could be more aggressive about it and do it losing fewer cases (but at a cost of making the rest of the distribution even worse).

Code:

 
clear
set obs 1000

gen A=rnormal(.75 )

*=====want to get down to .5

*=====make quintiles
xtile quint=A, nq(5)


*====to know when we get there
local themean=1

gen randsort=.

while `themean'>.5 {
         replace randsort=uniform()
         sort quint randsort
         *==========drop one from every quintile
         by quint: drop if _n==1
         *===========extra helping from top three quintiles
        by quint : drop if _n==1 & quint>2
sum A 
local themean=r(mean)
}

Comment

ben earnhart

Join Date: May 2014
Posts: 1027

14 Nov 2014, 16:24

This appears a not-so-bad compromise keeps 75-80% of cases, with only a small increase in skew and kurtosis. YMMV (Your Mileage May Vary) -- if the change in mean is more dramatic, or the variable isn't continuous and normal, it could behave a lot differently.

Code:

clear
set obs 1000

gen A=rnormal(.75 )

*=====want to get down to .5

*=====make quintiles
xtile decile=A, nq(10)


*====to know when we get there
local themean=1

gen randsort=.

while `themean'>.5 {
    replace randsort=uniform()
    sort decile randsort
    by decile: drop if _n<3 & decile>5
    by decile: drop if _n<3 & decile>8
        
sum A 
local themean=r(mean)
}

sum A, det
hist A

Comment

Vincent Koppelmans

Join Date: Nov 2014

Posts: 14
#5

15 Nov 2014, 16:02

Thank you, Ben. Your code works like a charm, but indeed, it burned up most of my cases.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#6

15 Nov 2014, 16:26

Well, maybe play with the parameters and conditions for dropping cases. The first version uses quintiles, the second deciles. The first drops from the whole sample, more heavily on the upper tail; the second just from the upper tail. You might figure out a reasonable compromise. Is it a continuous variable or discrete? If it's discrete, then I'd feel even more iffy. If it's continuous, then maybe with the right parameters/conditions...

Good luck. I'm not certain it's the right approach anyway, though; hopefully somebody chimes in with a better approach or simply says "don't do it, and here's why."
Comment

Announcement