Creating a subsample with predefined mean values of variables

Marvin Hanisch

Join Date: Oct 2017

Posts: 50
#1

Creating a subsample with predefined mean values of variables

18 Jul 2023, 10:58

Hello StataList community,

I am working with a dataset that contains two groups: "group1" and "group2." My goal is to create a smaller subsample of "group2" observations that have the same mean values of the binary variables "var1" and "var2" as found in "group1."

It's important to note that the mean values of both "var1" and "var2" are higher for "group1" than for "group2" in the full sample. Consequently, a random subsample of "group2" is inappropriate, and I must carefully select "semi-random" observations from group2 to ensure that the means of var1 and var2 in the group2 subsample are equal to (or close to) the variable means of group1.

(As a side note, the size of the group2 subsample should be about 1/100000 of the size of the full "group2" sample.)

The structure of the dataset is as follows:

Code:

clear input byte(group var1 var2) 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 1 0 0 2 0 0 2 1 0 2 0 0 2 0 0 2 1 0 2 1 1 2 0 1 2 0 0 2 0 0 2 1 0 end

I would appreciate any guidance on how to achieve this in Stata. If you could provide me with the necessary code or steps, I would be extremely grateful.

Thank you in advance for your assistance!

Best regards,
Marvin

PS:
I am aware of the commands

Code:

splitsample

and

Code:

rsz

but none of them seem to do the trick.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3187
#2

18 Jul 2023, 12:52

This sounds bad. There are potentially many subsamples that would do that and not sure what any statistical results would mean once you've done it.

Here's one way, though you lose observations in group1.

Code:

tabstat var1 var2, by(group) g group1 = group==1 cem var1 var2, tr(group1) k2k tabstat var1 var2 if cem_matched, by(group) stats(mean N)
Comment
Marvin Hanisch

Join Date: Oct 2017

Posts: 50
#3

19 Jul 2023, 01:54

Thank you, George! That was exactly what I was looking for.
And thanks for the warning - I am aware of the selection issues.
Comment

Announcement

Creating a subsample with predefined mean values of variables

Comment

Comment