Split data into samples that are representative of the distribution of the original sample

Matthew Smith Stata

Join Date: Feb 2022

Posts: 30
#1

Split data into samples that are representative of the distribution of the original sample

19 Feb 2023, 05:17

Hi there,

I have a data set containing several variables. I would like to split the data into 10 groups (preferably roughly equal sized). The data within each group should contain a similar distribution of the variables from the original data set. Is there a way to do this in Stata?

For example, I am using the Cattaneo data set. I have created a binary outcome (i.e.,

Code:

lbw

). There are multiple variables, but to keep things simple I have chosen one variable (i.e.,

Code:

mbsmoke

). The distribution of

Code:

lbw

in the original data set is 6.0% (for lbw=1) and 94.0% (for lbw=0).. The distribution of

Code:

mbsmoke

in the original data set is 18.6% smokers and 81.4% non-smokers.

Code:

* Load the data use "http://www.stata-press.com/data/r14/cattaneo2.dta", clear * View the data and recode variables gen lbw = cond(bweight<2500,1,0.) lab var lbw "Low birthweight, <2500 g" tab mbsmoke

I have tried using

Code:

sample

command with the

Code:

by()

option, but I am not sure if this is the best way to go about this?

Any help is much appreciated
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35775
#2

19 Feb 2023, 05:40

A representative sample sounds a good thing but even a glance at

@article{Kruskal1979RepresentativeSI, title={Representative Sampling, I: Non-Scientific Literature}, author={William H. Kruskal and Frederick Mosteller}, journal={International Statistical Review}, year={1979}, volume={47}, pages={13} }

and its sequels underlines how elusive such a sample can be both in principle and in practice.. I doubt that you can easily improve on disjoint random samples obtained by

.

Code:

set seed 314159265 . gen sample = runiformint(1, 10)

where naturally you should choose any seed you like, but do set one choice to make your sampling repeatable.
1 like
Comment

Announcement

Split data into samples that are representative of the distribution of the original sample

Comment