Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample subset of data with pre defined mean

    Hi Stata experts,

    I have got a group of cases and a group of controls that is x times as large. I want to match the groups on variable A. Variable A is normally distributed in the case group, but follows a bimodel distribution in the control group because it is a combination of two groups that have a normal distribution of variable A, but of which the means are slightly apart. I could use frequency matching using a syntax I found in a previous statalist post: http://www.stata.com/statalist/archi.../msg00326.html

    However, I believe I could potentially include more control subjects if I would be able to <<draw a sample from the total control group with predefined mean of variable A>>. Of course, the predefined mean would be the mean of the case group. I realize the standard deviation would likely be different for the control than for the case group (which would not be the case with frequency matching).

    Is it possible to draw a sample of an existing data set with a predefined mean?

  • #2
    The crude way would be to simply lop off the top or bottom x% of cases to force the mean to what you want. Simply sort by A and figure out where the cutpoint is.

    I think a more reasonable approach (though not necessarily that good) approach would be to divide it up into pieces, say quintiles. If you want to draw the mean up, randomly throw out two cases from each of the bottom two quintiles and one each from the top three quintiles. Stop when the desired mean is achieved.

    Hopefully somebody else will chime in with a better approach, or a good reason why not to do what you intend to do. Feels iffy to me, though I can't definitively say it's a bad thing to do.

    Comment


    • #3
      I was curious. See below. I needed to burn through about 50% of the cases to arrive at the new mean; you could be more aggressive about it and do it losing fewer cases (but at a cost of making the rest of the distribution even worse).

      Code:
       
      clear
      set obs 1000
      
      gen A=rnormal(.75 )
      
      *=====want to get down to .5
      
      *=====make quintiles
      xtile quint=A, nq(5)
      
      
      *====to know when we get there
      local themean=1
      
      gen randsort=.
      
      while `themean'>.5 {
               replace randsort=uniform()
               sort quint randsort
               *==========drop one from every quintile
               by quint: drop if _n==1
               *===========extra helping from top three quintiles
              by quint : drop if _n==1 & quint>2
      sum A 
      local themean=r(mean)
      }

      Comment


      • #4
        This appears a not-so-bad compromise keeps 75-80% of cases, with only a small increase in skew and kurtosis. YMMV (Your Mileage May Vary) -- if the change in mean is more dramatic, or the variable isn't continuous and normal, it could behave a lot differently.

        Code:
        clear
        set obs 1000
        
        gen A=rnormal(.75 )
        
        *=====want to get down to .5
        
        *=====make quintiles
        xtile decile=A, nq(10)
        
        
        *====to know when we get there
        local themean=1
        
        gen randsort=.
        
        while `themean'>.5 {
            replace randsort=uniform()
            sort decile randsort
            by decile: drop if _n<3 & decile>5
            by decile: drop if _n<3 & decile>8
                
        sum A 
        local themean=r(mean)
        }
        
        sum A, det
        hist A

        Comment


        • #5
          Thank you, Ben. Your code works like a charm, but indeed, it burned up most of my cases.

          Comment


          • #6
            Well, maybe play with the parameters and conditions for dropping cases. The first version uses quintiles, the second deciles. The first drops from the whole sample, more heavily on the upper tail; the second just from the upper tail. You might figure out a reasonable compromise. Is it a continuous variable or discrete? If it's discrete, then I'd feel even more iffy. If it's continuous, then maybe with the right parameters/conditions...

            Good luck. I'm not certain it's the right approach anyway, though; hopefully somebody chimes in with a better approach or simply says "don't do it, and here's why."

            Comment

            Working...
            X