How to randomly select control sample for the treatment sample multiple times and run regressions?

Weiwei Wang

Join Date: Dec 2020

Posts: 3
#1

How to randomly select control sample for the treatment sample multiple times and run regressions?

26 Dec 2020, 22:48

Dear all,
I am using Stata 14.0 to solve the questions about randomly selecting control sample for the treatment sample multiple times.
The details about my data can be described as followings:
gvkey: the company id; fyear: the fiscal year; state: the state that the company is located;
indicator: indicator is a dummy variable, it equals one for the treatment firm and equals zero for all control firm;
performance: firm performance; interlock: board interlock measure; and other control variables.
In my dataset, there are 1597 obs for indicator=1; and 9958 obs for indicator=0.

Now, I split the above dataset into two sub-samples: indicator=1 is the treatment sample (1597 obs); and indicator=0 is the control sample (9958 obs).

For each treatment firm, I need to randomly select one control firm which is in the same year (fyear) and same state (state) with the treatment firm, then maybe I can obtain a randomly control sample with 1597 obs (if all the treatment firm can be matched with the control firm), I will run regressions in the control sample: reg performance interlock controls. I need to repeat the above procedure 500 times (randomly select control sample 500 times in the same year and state with the treatment sample and run regressions in the control sample 500 times), then the summary statistics about the 500 regressions (mean, median, sd) should be reported. To be specific, I need to report the coefficient and t-value on the independent variable (interlock) in the mean, median, and sd level of the 500 regressions.

I have thought about this for a long period but have no ideas. Any advice on how I should go about this. Merry Christimas! Many thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#2

27 Dec 2020, 10:23

Please read the Forum FAQ for excellent advice about making the most of your Statalist experience. You have evidently taken care to write a thorough explanation of your project. But you are, in effect, asking the Forum to do this entire project for you. And while your explanation of the your data is pretty complete, you have not shown example data for anybody to work with. Also, the problem of creating case-control matched pairs comes up frequently on the Forum. Have you searched to see how others have solved this problem?

I think if you take some steps yourself, and reach a sticking point, and then come back showing example data and the code you tried using, and illustrating how it is failing to produce what you want, you will be more likely to get a timely and helpful response. If and when you do that, be sure to follow the advice in FAQ #12 on using -dataex- to show the example data, and code delimiters to show code/results.
1 like
Comment
Weiwei Wang

Join Date: Dec 2020

Posts: 3
#3

28 Dec 2020, 07:20

Originally posted by Clyde Schechter View Post

Please read the Forum FAQ for excellent advice about making the most of your Statalist experience. You have evidently taken care to write a thorough explanation of your project. But you are, in effect, asking the Forum to do this entire project for you. And while your explanation of the your data is pretty complete, you have not shown example data for anybody to work with. Also, the problem of creating case-control matched pairs comes up frequently on the Forum. Have you searched to see how others have solved this problem?

I think if you take some steps yourself, and reach a sticking point, and then come back showing example data and the code you tried using, and illustrating how it is failing to produce what you want, you will be more likely to get a timely and helpful response. If and when you do that, be sure to follow the advice in FAQ #12 on using -dataex- to show the example data, and code delimiters to show code/results.

Dear Clyde,

Thank you very much for your suggestions. Following your suggestions, I obtain the preliminary results using the random selection method without replacement using the following codes in Stata 14.0:

Code:

forvalue i=1/500{ use "D:\sample.dta", clear tempfile control preserve drop if indicator==1 rename performance performance1 rename Interlock interlock1 rename size size1 rename lev lev1 rename bm bm1 gen random_digit=runiform() sort fyear state random_digit bys fyear state: gen random_id=_n save `control' restore tempfile regression`i' keep if indicator==1 gen random_digit=runiform() sort fyear state random_digit bys fyear state: gen random_id=_n merge 1:1 random_id fyear state using `control',keepusing (performance1 interlock1 size1 lev1 bm1) keep if _merge==3 drop _merge reg performance interlock1 size1 lev1 bm1 gen b=_b[interlock1] gen se=_se[interlock1] gen t=b/se duplicates drop b,force save regression`i' } use regression1, clear forvalue i=2/500{ append using regression`i' } tabstat b se t , stats(n mean sd min p50 max) c(s) f(%6.2f)

My data can be described as follows:
input str6 gvkey float fyear str2 state byte indicator double(performance Interlock) str4 sic str2 sic2 double(size lev bm)
"020492" 2000 "QC" 1 .9643729289190076 .0222944729030132 "2835" "28" 8.482645223962336 .03654603005406876 .1417826554124976
"065489" 2000 "VA" 1 .9978485251363981 .0306548997759819 "4813" "48" 8.786419674221618 .44036623741599795 .6013082703356336
"028471" 2000 "NM" 1 1 .0274036228656769 "6798" "67" 5.275541197655774 .9519443876512898 1.624266450123726
"004060" 2000 "MI" 1 .9330531189977196 .0190431959927082 "2821" "28" 10.119145048125564 .41425742990980835 .4068745128385739
"117902" 2000 "PA" 1 .8173960070714734 .00905712973326445 "4832" "48" 7.351127738161053 .32150230415348624 .47222299781283456
"004093" 2000 "NC" 1 .40181423236136726 .0174175575375557 "4931" "49" 10.357738856551883 .5905803870980321 .4043825570736392
"062836" 2000 "NC" 1 .9931077740690379 .015327449887991 "2200" "22" 5.1476194690016035 .8816729898358108 1.3151553730202326
"065702" 2000 "VT" 1 .9505424162792266 .00696702254936099 "4953" "49" 5.351823061771014 .6831646088162015 1.0964744735123413
"062005" 2000 "CA" 1 .6371243818942564 .0517882034182549 "7372" "73" 7.512527596609644 .03640516136714749 .0815420147190328
"006335" 2000 "MO" 1 .9900092067472893 .0148629816249013 "4011" "40" 6.377861508123333 .6884973945207545 1.0929769351968641
"012122" 2000 "CA" 1 .4838709677419355 .0336739420890808 "5311" "53" 4.055533045336606 .8338459561372984 2.03709825642561
"005439" 2000 "TX" 1 .8748364816074512 .013934045098722 "1389" "13" 9.647223394529309 .2839133502191689 .25622224016797224
"005087" 2000 "NC" 1 .9634658959404535 .0164886210113764 "3585" "35" 8.095739354595818 .43524658037067965 .19399710200825032
"024747" 2000 "NJ" 1 1 .0232234094291925 "4812" "48" 7.11272540952219 .6157208625064742 -.3034643748975549
"005301" 2000 "NJ" 1 .26059283506092 .0102182999253273 "5411" "54" 5.903217452388136 .8727858669540212 2.1771350264333256
"006403" 2000 "OK" 1 .905607449299442 .0202043652534485 "1311" "13" 8.75218985620343 .44196320160031444 .42011075064913
"061439" 2000 "OR" 1 0 .0285647939890623 "3577" "35" 6.0905997883918825 .25936087913598915 .4060144503026752
"028919" 2000 "OK" 1 .5344488188976482 .0102182999253273 "4922" "49" 6.896125919348718 .5608476327825035 .8299680910689389
"009203" 2000 "WI" 1 .04241396146200357 .0164886210113764 "3620" "36" 8.621710851971475 .40132119986518366 .4808250951426577
"014363" 2000 "MN" 1 .9819834498558281 .0225267075002193 "2870" "28" 7.487811725922331 .6675304430932949 .37813401166790356
"004242" 2000 "TX" 1 .7742062579691339 .0116117047145963 "4922" "49" 9.730097339782795 .5616353531115935 .35085238136556696
"006502" 2000 "OH" 1 1 .0132373431697488 "5411" "54" 9.903899967481392 .43011457094640304 .15438631564479652
"004809" 2000 "GA" 1 1 .0176497902721167 "2050" "20" 7.355288005498907 .40393677822065444 .3211740886159491

The random selection method is new to me, I am afraid there are faults in my code. could you help me clarify the following questions: (1) Do I need to randomly classify the id number (random_id) for the treatment group 500 times? or Just need one time? (2) Can the random number (random_digit) be generated in each fyear-state level? (3) There are some treatment firms are with no match in my above method, how to try the method with replacement? Many thanks.

Attached Files

sample.dta (673.2 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#4

28 Dec 2020, 09:39

So there are some problems. First, your approach is unnecessarily complicated: there is no need to randomly sort both the treatment and control group. Just sort the controls. And, to sample with replacement, just -joinby- the controls with all treatment cases that agree on fyear and state. Then generate a random number, sort on random number within case, and keep the first observation. That gives you a set of case control pairs.

Now, you go on to regress performance in the case on the predictor variables from the matched control. Is that what you meant to do? It's unusual, but I don't know what your ultimate goal is here. The more usual situation, and what I understood you to want to do based on #1, is to assign each case-control pair a pair_id, and then reshape the data to long so that cases and controls are separate observations. Then you have to do the regression including i.pair_id as a predictor so you properly account for the matching.

Finally, storing the coefficient and standard error in the data set and then keeping only one observation, saving 500 separate files and then appending them, is also unnecessarily complicated. You can create a -postfile- (-help postfile- for details) to capture the b and se as you go along into a single file that is built up through the iterations.

I would be happy to illustrate the details with code, but to do that you will have to show example data that includes both treatment and control observations (your current example has only treatment), and includes some that can potentially match with each other. (It may be that your sample.dta file contains suitable data to work with, but I do not download attachments from people I do not know.)
Comment
Weiwei Wang

Join Date: Dec 2020

Posts: 3
#5

28 Dec 2020, 18:22

Originally posted by Clyde Schechter View Post

So there are some problems. First, your approach is unnecessarily complicated: there is no need to randomly sort both the treatment and control group. Just sort the controls. And, to sample with replacement, just -joinby- the controls with all treatment cases that agree on fyear and state. Then generate a random number, sort on random number within case, and keep the first observation. That gives you a set of case control pairs.

Now, you go on to regress performance in the case on the predictor variables from the matched control. Is that what you meant to do? It's unusual, but I don't know what your ultimate goal is here. The more usual situation, and what I understood you to want to do based on #1, is to assign each case-control pair a pair_id, and then reshape the data to long so that cases and controls are separate observations. Then you have to do the regression including i.pair_id as a predictor so you properly account for the matching.

Finally, storing the coefficient and standard error in the data set and then keeping only one observation, saving 500 separate files and then appending them, is also unnecessarily complicated. You can create a -postfile- (-help postfile- for details) to capture the b and se as you go along into a single file that is built up through the iterations.

I would be happy to illustrate the details with code, but to do that you will have to show example data that includes both treatment and control observations (your current example has only treatment), and includes some that can potentially match with each other. (It may be that your sample.dta file contains suitable data to work with, but I do not download attachments from people I do not know.)

Hi, Clyde. Thank you for your helpful comments. I will revise my code following your suggestions.
Comment

Announcement

How to randomly select control sample for the treatment sample multiple times and run regressions?

Comment

Comment

Comment

Comment