Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Split data into samples that are representative of the distribution of the original sample

    Hi there,

    I have a data set containing several variables. I would like to split the data into 10 groups (preferably roughly equal sized). The data within each group should contain a similar distribution of the variables from the original data set. Is there a way to do this in Stata?

    For example, I am using the Cattaneo data set. I have created a binary outcome (i.e.,
    Code:
    lbw
    ). There are multiple variables, but to keep things simple I have chosen one variable (i.e.,
    Code:
    mbsmoke
    ). The distribution of
    Code:
    lbw
    in the original data set is 6.0% (for lbw=1) and 94.0% (for lbw=0).. The distribution of
    Code:
    mbsmoke
    in the original data set is 18.6% smokers and 81.4% non-smokers.

    Code:
    * Load the data
        use "http://www.stata-press.com/data/r14/cattaneo2.dta", clear
    
    * View the data and recode variables
        gen lbw = cond(bweight<2500,1,0.)
        lab var lbw "Low birthweight, <2500 g"
    
        tab mbsmoke
    I have tried using
    Code:
    sample
    command with the
    Code:
    by()
    option, but I am not sure if this is the best way to go about this?

    Any help is much appreciated

  • #2
    A representative sample sounds a good thing but even a glance at

    @article{Kruskal1979RepresentativeSI, title={Representative Sampling, I: Non-Scientific Literature}, author={William H. Kruskal and Frederick Mosteller}, journal={International Statistical Review}, year={1979}, volume={47}, pages={13} }

    and its sequels underlines how elusive such a sample can be both in principle and in practice.. I doubt that you can easily improve on disjoint random samples obtained by


    .
    Code:
    set seed 314159265
    
    . gen sample = runiformint(1, 10)
    where naturally you should choose any seed you like, but do set one choice to make your sampling repeatable.

    Comment

    Working...
    X