Creating sub-sample according to defined distribution, and how to extract summary statistics tables

FitzGerald Blindman

Join Date: Sep 2023

Posts: 41
#1

Creating sub-sample according to defined distribution, and how to extract summary statistics tables

10 Sep 2023, 04:00

Hello everyone, I would like to have your help please.

I have data about 600 scholarship recipients, their income and their parents income. I ran regressions on this sample to study the intertemporal mobility (parents income on the child income).
Now, I have a much bigger data about hundred-thousands people's income and their parents income, but they are just "regular people" which didn't had this scholarship. I want to compare the estimator for the intergeneration mobility between the samples, but of course that for this I need to reduce the bigger sample such that the parents income will have the same distribution as the little sample about the scholars (by mean, standard deviation, minimum and maximum).
How can I wright code that will help me omit observations such that the sample will have the characteristics I want?

BTW - I tried to find an easy way to extract to Excel/Word/PDF file tables of summary statistics after using the command "summary"? after regressions I use "outreg2" but I haven't found yet something parallel for descriptive statistics. I know it silly question and probably have been asked here plenty of times, but I still would like to discover how to do this.

Thank you already

FitzGerald
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3152
#2

10 Sep 2023, 08:48

cem (ssc install cem) is an option. You can match on the variables of interest (either 1:1 or produce weights). the weights are used in regression [aw=cem_weights].

also look at kmatch. some options allows you to create a pscore that you can use in regression. also allows matching on multiple moments.
Comment
FitzGerald Blindman

Join Date: Sep 2023

Posts: 41
#3

11 Sep 2023, 01:38

Originally posted by George Ford View Post

cem (ssc install cem) is an option. You can match on the variables of interest (either 1:1 or produce weights). the weights are used in regression [aw=cem_weights].

also look at kmatch. some options allows you to create a pscore that you can use in regression. also allows matching on multiple moments.

Thank you George!
But I'm not sure I understood you. What suppose to be the file I match with? I want that a certain variable in one database will get the distribution that I define to it (according to the parallel variable in another database, of course).
Anyway, thanks again.
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#4

11 Sep 2023, 09:10

append the two datasets so all the info is in one dta file. you may have to rename variables if they are different.

With all the data in one dta, you can run cem or kmatch on the variables of interest. cem works on the mean, but you usually get variance ratios near 1 when you do so.

Say treat is the treatment indicator and you have x1 x2 x3.

covbal treat x1 x2 x3 // gives you the standardized differences for the full sample. look for SDiff bigger than 0.25 and variance ratios far from 1.

cem x1 x2 x3, tr(treat)
covbal treat x1 x2 x3 , wt(cem_weights) // check covariate balance in matched sample. in some cases, balancing on a few x is ok, but in others it will create imbalance on the excluded x.
reg y treat x1 x2 x3 [aw=cem_weights]

cem x1 x2 x3, tr(treat) k2k
covbal treat x1 x2 x3 if cem_matched, // check covariate balance in 1:1 matched sample
reg y treat x1 x2 x3 if cem_matched

you can create multiple matches (e.g., 3:1) by rerunning the k2k cem and excluding any matched controls from the additional rounds (you'll have to rename cem_matched to keep track and put it all together at the end.
Comment

Announcement

Creating sub-sample according to defined distribution, and how to extract summary statistics tables

Comment

Comment

Comment