Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random sample from a large dataset with set mean/ distribution

    Hello, I have a large data set of a population with some baseline characters i.e. mean age, % men, mean BMI, etc. How should I draw a random sample with a smaller population having the same baseline characters across multiple parameters? For example, if in the larger data set the mean(SD) age of the population is 40.3(15.0), with 89% men, and mean (SD) BMI 20.8(3.9), etc. I want to draw a smaller random sample with mean age, %men, mean BMI same as that of the larger dataset.

    Thanks in advance,
    Preeti


  • #2
    Drawing a sample randomly should in principle ensure that the sample has similar characteristics as your full dataset, if the sample is sufficiently large.

    A common way to do it is to generate a random variable, sort on that variable, and choose the first x observations. For example,
    Code:
    set seed 2435 // choose a random seed
    gen double u = runiform()
    sort u
    keep if _n <= 1000 // keep the first 1000 observations
    Last edited by Wouter Wakker; 26 Oct 2020, 13:39.

    Comment


    • #3
      Thanks, Wouter!

      Comment


      • #4
        Stata has a handy command for that matter: sample.

        If you wish to draw a random sample without replacement, the command- sample - will fit in your needs. Shall you wish to draw a random sample with replacement, just ude the command bsample.
        Best regards,

        Marcos

        Comment


        • #5
          I think I thanked too soon. I have to draw a smaller random sample (with certain mean/ distribution) from a larger dataset that has different mean/ distribution. For example, say my larger data set has mean age of 40y and 89% mean. I want to draw a smaller sample with mean age 53y and 49% men. How do I do that?

          Comment


          • #6
            I realized the mistake in my initial query now. Sorry for the inconvenience. I basically want output as described in #5.

            Comment


            • #7
              Please help.

              Comment


              • #8
                I think that your request can not be solved. You can either have a population and use random sampling procedure to create a subpopulation with given charasteristics or choose the particular cases which will create a sample by yourself. What I can think of is to delete particular cases in your dataset by hand (for example if you want higher mean age in your sample than in population you delete those cases with low age) and than use random sampling (but still you don't create sample from the "previous" population this way).
                Last edited by Karel Novak; 30 Oct 2020, 10:18.

                Comment


                • #9
                  I don't have an answer to this, but I do have some comments.

                  First, what you're looking for at this point is not a random sample anymore. A random sample is about random selection, but doesn't say anything about the outcome.It appears that what you want is a sample with a specified outcome, which random selection by definition cannot ensure, unless you are looking for a sample with similar characterisitcs, in which case you can conventiently use the law of large numbers.

                  Second, getting a sample with a certain mean might still be possible. I can imagine using sample in a while loop, checking the means of your specified variables and stop the loop if they are in the right range. This would work best if the means you want are close to the mean of the full dataset. This would't work well if you want means far away from the overall mean, or if you need a large sample. Note that this still wouldn't be a random sample.

                  Getting a sample with a certain distribution is harder I imagine. At this point it could be easier to just simulate data with a certain mean and distribution, if that is good enough for your purposes.

                  In any case, your question is quite a different problem from your question in #1, so my advice would be to start a new topic with another title, which would increase your chances of getting a useful answer.

                  Comment


                  • #10
                    I'd say that the procedure that Preeti wants may not actually be the preferred thing to do in relation to whatever her purpose might be. I'd encourage her to explain the context of her problem, with a particular focus on *why* she wants to do what she describes above, that is, what purpose she wants it to serve.

                    Comment


                    • #11
                      Sharing some background context:

                      I have data collected under a community-based diabetes screening program. This screening was done using telemedicine equipped mobile medical van. Patients who were diagnosed with diabetes or at risk of diabetes complications were referred to a rural diabetic center for follow-up care.

                      Apparently, the rural diabetic center caters to many other patients not referred to by the van.

                      Unfortunately, the patients who were referred to the center were given a new unique ID and there is no way of identifying those screened in the van from the follow-up data recorded in the center.

                      That said, I want to draw a sample population from the follow-up data in a way that the baseline characteristics of the sample (i.e. health profile of patients who visited the center the first time) match the baseline characteristics of those screened.

                      The reason I am doing this to understand the long-term effect of care provided in the mobile-medical van and the diabetes center.

                      I understand that this is not an ideal way; however, due to data paucity on similar delivery care models, I don't have an alternative.

                      Please let me know if further information is required. And also, if I should start a new thread.

                      Regards,
                      Preeti

                      Comment

                      Working...
                      X