Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I randomly assign scores from a set of observed scores within groups?

    Dear Community,

    I have 106 groups, which I will call "strata groups", with subjects that come from two datasets, which I'll call Dataset A and Dataset B. The strata groups are the product of a stratification procedure, so the subjects are observationally similar across a vector of covariates. The subjects from Dataset A have a score that I am calling the "selection score". The subjects from Dataset B do not have this score.

    Within each strata group, I would like to randomly assign selection scores to subjects from Dataset B (who are missing scores) using the scores from subjects from Dataset A. In other words, I would like to randomly draw a number from a set of selection scores observed for Dataset A and assign that number to each subject in the strata group that is from Dataset B. The distribution of scores in each strata group is uniform (most scores are only observed once).

    The strata groups have different proportions of subjects from each dataset. So, in some strata groups there is only one subject from Dataset A and many subjects from Dataset B. In that case, all of the subjects from B should have the value from A.

    I have attempted to do this a number of different ways, but I have not been able to figure this out. Is someone able to offer a suitable looping code to help me generate these scores? I am using Stata/SE 14.2 on a Mac.

    Stacy

  • #2
    This is too complicated to do with imaginary data. Please use the -dataex- program to post a short sample of your data so we can see how your data is laid out and what it looks like, and experiment with it.

    Comment


    • #3
      Hi Clyde,

      Sure! This is my first time using dataex so please let me know if I am doing this incorrectly. I appreciate your help with this.

      Here is output for subjects in "strata groups" 10 and 65. The selection score is centered on zero, which is why there are negative and positive values. The first two outputs pasted below contain data for each group separately. The third output has all of them together.

      Thank you!

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(subjectid strata_group Dataset_A selection_score)
       36685 10 1  -31
       39158 10 1 -151
       40609 10 1 -101
       40943 10 1  -21
       41262 10 1 -201
       41977 10 1 -174
       42972 10 1  -85
       50007 10 1  -13
      123680 10 0    .
      126954 10 0    .
      127022 10 0    .
      158919 10 0    .
      end
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(subjectid strata_group Dataset_A selection_score)
       29899 65 1  20
       40839 65 1 -11
       44613 65 1   3
       50843 65 1  42
      134743 65 0   .
      135596 65 0   .
      135608 65 0   .
      135614 65 0   .
      135879 65 0   .
      136154 65 0   .
      140548 65 0   .
      159957 65 0   .
      162065 65 0   .
      169702 65 0   .
      170428 65 0   .
      185070 65 0   .
      185220 65 0   .
      185228 65 0   .
      190342 65 0   .
      end
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(subjectid strata_group Dataset_A selection_score)
       29899 65 1   20
       36685 10 1  -31
       39158 10 1 -151
       40609 10 1 -101
       40839 65 1  -11
       40943 10 1  -21
       41262 10 1 -201
       41977 10 1 -174
       42972 10 1  -85
       44613 65 1    3
       50007 10 1  -13
       50843 65 1   42
      123680 10 0    .
      126954 10 0    .
      127022 10 0    .
      134743 65 0    .
      135596 65 0    .
      135608 65 0    .
      135614 65 0    .
      135879 65 0    .
      136154 65 0    .
      140548 65 0    .
      158919 10 0    .
      159957 65 0    .
      162065 65 0    .
      169702 65 0    .
      170428 65 0    .
      185070 65 0    .
      185220 65 0    .
      185228 65 0    .
      190342 65 0    .
      end


      Comment


      • #4
        OK, the trick is to extract from the file a file that just lists the strata_groups and all associated selection_scores. Then we join (not -merge-) that back to the original data, so that each observation in the original data is paired with every observation for the same strata_group. Then we pick one observation at random for each subjectid.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(subjectid strata_group Dataset_A selection_score)
         29899 65 1   20
         36685 10 1  -31
         39158 10 1 -151
         40609 10 1 -101
         40839 65 1  -11
         40943 10 1  -21
         41262 10 1 -201
         41977 10 1 -174
         42972 10 1  -85
         44613 65 1    3
         50007 10 1  -13
         50843 65 1   42
        123680 10 0    .
        126954 10 0    .
        127022 10 0    .
        134743 65 0    .
        135596 65 0    .
        135608 65 0    .
        135614 65 0    .
        135879 65 0    .
        136154 65 0    .
        140548 65 0    .
        158919 10 0    .
        159957 65 0    .
        162065 65 0    .
        169702 65 0    .
        170428 65 0    .
        185070 65 0    .
        185220 65 0    .
        185228 65 0    .
        190342 65 0    .
        end
        
        tempfile original
        save `original'
        
        //    CREATE A FILE THAT CROSSWALKS STRATA_GROUP WITH NON-MISSING
        //    SELECTION_SCORES
        keep if Dataset_A
        keep strata_group selection_score
        rename selection_score selection_score_2
        
        //    FORM ALL POSSIBLE PAIRS WITHIN STRATA_GROUP
        joinby strata_group using `original'
        
        //    SORT RANDOMLY THEN KEEP THE FIRST FOR EACH SUBJECTID
        set seed 1234 // OR YOUR PREFERRED RANDOM NUMBER SEED
        gen double shuffle1 = runiform()
        gen double shuffle2 = runiform()
        by subjectid (shuffle1 shuffle2), sort: keep if _n == 1
        replace selection_score = selection_score_2 if missing(selection_score)
        drop selection_score_2 shuffle*
        sort strata_group Dataset_A subjectid
        Note: I used two double-precision random numbers for the random sorting because, if your data set is large, one might encounter duplicate values of just one random number--which would make the sort order indeterminate and irreproducible. But if your data set is of only moderate size, you can get rid of shuffle2. And if your data set is small (just a few thousand observations) you can even shrink shuffle1 down to a float.

        Comment


        • #5
          Thank you, Clyde! This worked perfectly.

          Comment

          Working...
          X