Random sample based on year

Sebastian Grein

Join Date: Oct 2023
Posts: 28

Random sample based on year

09 Nov 2023, 02:10

Hi, I want to draw a random sample out of my existing data. In my sample are observations which are deals (deal =1) and observations which are peer companies (deal = 0). I want to to have 5 peer companies per acquiring company of the deal (cusip_a as company identifier). The observations drawn from the overall sample with deal = 0 should only match on the Year. I added the code I'm currently using, but I do not know how to include that the year should be the same.

Code:

quietly count if deal == 1
local num_deals = r(N)*5
display `num_deals'

sample `num_deals' if deal == 0, count

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str6 cusip_a float(Year deal)
"30231G" 2020 0
"084670" 2022 0
"03743Q" 2015 0
"165167" 2015 0
"02376R" 2020 0
"097023" 2020 0
"910047" 2020 0
"165167" 2009 0
"185899" 2014 0
"674599" 2015 0
"247361" 2020 0
"345370" 2006 0
"032511" 2015 0
"76954A" 2022 0
"26875P" 2015 0
"69047Q" 2015 0
"166764" 2016 0
"02376R" 2021 0
"20825C" 2015 0
"42809H" 2017 0
"69047Q" 2020 0
"925550" 2001 0
"53601P" 2015 0
"13057Q" 2015 0
"166764" 2020 0
"24790A" 2015 0
"651290" 2015 0
"844741" 2020 0
"674215" 2020 0
"20825C" 2016 0
"69331C" 2000 0
"549463" 2002 0
"910047" 2021 0
"03743Q" 2020 0
"69047Q" 2012 0
"966387" 2020 0
"25179M" 2009 0
"345370" 2008 0
"268785" 2015 0
"V7780T" 2021 0
"25179M" 2008 0
"92343E" 2001 1
"92343E" 2001 1
"165167" 2016 0
"651639" 2013 0
"166764" 2015 0
"42809H" 2016 0
"92343E" 2000 1
"626717" 2015 0
"V7780T" 2020 0
"903914" 2015 0
"90353T" 2021 0
"20825C" 2020 0
"845467" 2020 0
"25179M" 2020 0
"07134L" 2015 0
"03940R" 2015 0
"152312" 2007 0
"94770V" 2000 0
"565849" 2015 0
"36399D" 2008 0
"03743Q" 2019 0
"526057" 2007 0
"94770V" 2001 0
"02376R" 2002 0
"549498" 2022 0
"G66721" 2021 0
"69367F" 2001 0
"903914" 2012 0
"247361" 2021 0
"23331A" 2008 0
"629377" 2015 0
"42809H" 2020 0
"516806" 2015 0
"35671D" 2016 0
"075896" 2022 0
"42809H" 2015 0
"451100" 2008 0
"858098" 2001 0
"18538Q" 2011 0
"18538Q" 2010 0
"37611X" 2022 0
"858098" 2000 0
"552953" 2020 0
"812350" 2017 0
"477143" 2020 0
"75281A" 2019 0
"03840P" 2002 0
"559663" 2020 0
"723787" 2015 0
"745867" 2007 0
"812350" 2016 0
"G66721" 2020 0
"02376R" 2001 0
"69047Q" 2016 0
"925550" 2002 0
"451100" 2020 0
"69367F" 2000 0
"13057Q" 2020 0
"20605P" 2016 0
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30150
#2

09 Nov 2023, 11:07

Code:

preserve keep if deal == 0 drop deal rename cusip_a cusip_ctrl tempfile controls save `controls' restore keep if deal == 1 joinby Year using `controls' set seed 1234 gen double shuffle = runiform() by cusip_a (shuffle), sort: keep if _n <= 5 drop shuffle

Added: Note that in your example data, there are deals for which there simply don't exist enough non-deals of the same year to come up with 5 matches. Presumably this problem does not arise in the real data.
Comment
Sebastian Grein

Join Date: Oct 2023

Posts: 28
#3

11 Nov 2023, 03:39

Hi, many thanks. Is there maybe a way without using that much memory? I have a dataset of about 60,000 observations. The joinby command is now running already for 15 min and nothing happened yet.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4436
#4

11 Nov 2023, 06:53

Originally posted by Sebastian Grein View Post

Is there maybe a way without using that much memory? I have a dataset of about 60,000 observations. The joinby command is now running already for 15 min and nothing happened yet.

Maybe try something like this:

Code:

assert inlist(deal, 0, 1) preserve quietly keep if deal drop deal quietly expand 5 bysort year (cuspid): generate int row = _n rename cuspid cuspid_deal tempfile deals quietly save `deals' restore quietly keep if !deal drop deal generate double randu = runiform() quietly isid year randu, sort by year: generate int row = _n rename cuspid cuspid_nodeal merge 1:1 year row using `deals', assert(match master) keep(match) /// nogenerate noreport

It's a modifcation of Clyde's suggestion that avoids joinby. On my machine with 60 000 observations with the marginal deal prevalence (3%) that your little snippet of data shows, it takes a fifth of a second start to finish. See the output displayed below below. (Begin at the "Begin here" comment; the top part is to create an artificial dataset for illustration.)

As Clyde mentions, this assumes that in each year there are at least five no-deal firms for every deal firm. (If that's not true, then omit the assert(match master) in the merge.)

For convenience, mine also assumes that the CUSIP is unique in the dataset, but in practice it would have to be unique only within each year. (But if it isn't unique overall, wouldn't that affect the interpretability of your result?)

.ÿ
.ÿversionÿ18.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿ//ÿseedem
.ÿsetÿseedÿ123830903

.ÿ
.ÿlocalÿline_sizeÿ`c(linesize)'

.ÿsetÿlinesizeÿ80

.ÿ
.ÿmata:
-------------------------------------------------ÿmataÿ(typeÿendÿtoÿexit)ÿ------
:ÿmataÿsetÿmatastrictÿon

:ÿ
:ÿrealÿmatrixÿfunctionÿncu(realÿscalarÿn)ÿ{
>ÿ
>ÿÿÿÿÿrealÿmatrixÿMiddle,ÿEnds
>ÿÿÿÿÿMiddleÿ=ÿruniformint(n,ÿ4,ÿ(48),ÿ(57))
>ÿÿÿÿÿEndsÿ=ÿruniformint(floor(nÿ*ÿ0.48),ÿ2,ÿ(48),ÿ(57))ÿ\
>ÿÿÿÿÿÿÿÿÿÿÿÿÿ(runiformint(floor(nÿ*ÿ0.48),ÿ1,ÿ(48),ÿ(57)),ÿ
>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿruniformint(floor(nÿ*ÿ0.48),ÿ1,ÿ(65),ÿ(90)))ÿ\
>ÿÿÿÿÿÿÿÿÿÿÿÿÿ(runiformint(floor(nÿ*ÿ0.02),ÿ1,ÿ(65),ÿ(90)),ÿ
>ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿruniformint(floor(nÿ*ÿ0.02),ÿ1,ÿ(48),ÿ(57)))
>ÿÿÿÿÿEndsÿ=ÿEndsÿ\ÿruniformint(nÿ-ÿrows(Ends),ÿ2,ÿ(65),ÿ(90))
>ÿÿÿÿÿreturn(Ends[.,ÿ1],ÿMiddle,ÿEnds[.,ÿ2])
>ÿ}

:ÿ
:ÿstringÿcolvectorÿfunctionÿcusip(realÿscalarÿn)ÿ{
>ÿÿÿÿÿ
>ÿÿÿÿÿrealÿmatrixÿCodePoints
>ÿÿÿÿÿCodePointsÿ=ÿncu(n)
>ÿ
>ÿÿÿÿÿstringÿcolvectorÿcusip
>ÿÿÿÿÿcusipÿ=ÿJ(n,ÿ1,ÿ"")
>ÿÿÿÿÿ
>ÿÿÿÿÿrealÿscalarÿrow
>ÿÿÿÿÿforÿ(row=1;row<=n;row++)ÿ{
>ÿÿÿÿÿÿÿÿÿcusip[row,ÿ1]ÿ=ÿchar(CodePoints[row,ÿ.])
>ÿÿÿÿÿ}
>ÿÿÿÿÿreturn(cusip)
>ÿ}

:ÿ
:ÿvoidÿfunctionÿup(realÿscalarÿn)ÿ{
>ÿ
>ÿÿÿÿÿst_addobs(max((0,ÿnÿ-ÿst_nobs())))
>ÿ
>ÿÿÿÿÿrealÿscalarÿidx
>ÿÿÿÿÿidxÿ=ÿst_addvar("str6",ÿ"cusip")
>ÿ
>ÿÿÿÿÿst_sstore(.,ÿidx,ÿcusip(n))
>ÿ}

:ÿ
:ÿend
--------------------------------------------------------------------------------

.ÿ
.ÿsetÿlinesizeÿ`line_size'

.ÿ
.ÿmata:ÿup(100000)

.ÿcontractÿcusip,ÿfreq(row)

.ÿquietlyÿkeepÿinÿ1/60000

.ÿquietlyÿreplaceÿrowÿ=ÿ_n

.ÿ
.ÿframeÿcreateÿYears

.ÿframeÿYearsÿ{
.ÿÿÿÿÿquietlyÿsetÿobsÿ60000
.ÿÿÿÿÿgenerateÿintÿyearÿ=ÿ2000ÿ+ÿmod(_n,ÿ23)
.ÿÿÿÿÿgenerateÿlongÿrowÿ=ÿ_n
.ÿ}

.ÿquietlyÿfrlinkÿ1:1ÿrow,ÿframe(Years)

.ÿquietlyÿfrgetÿyear,ÿfrom(Years)

.ÿframeÿdropÿYears

.ÿdropÿrowÿYears

.ÿ
.ÿgenerateÿbyteÿdealÿ=ÿrbinomial(1,ÿ0.03)

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿtimerÿclearÿ1

.ÿtimerÿonÿ1

.ÿcount
ÿÿ60,000

.ÿassertÿinlist(deal,ÿ0,ÿ1)

.ÿ
.ÿpreserve

.ÿquietlyÿkeepÿifÿdeal

.ÿdropÿdeal

.ÿquietlyÿexpandÿ5

.ÿbysortÿyearÿ(cusip):ÿgenerateÿintÿrowÿ=ÿ_n

.ÿrenameÿcusipÿcusip_deal

.ÿtempfileÿdeals

.ÿquietlyÿsaveÿ`deals'

.ÿ
.ÿrestore

.ÿquietlyÿkeepÿifÿ!deal

.ÿdropÿdeal

.ÿgenerateÿdoubleÿranduÿ=ÿruniform()

.ÿquietlyÿisidÿyearÿrandu,ÿsort

.ÿbyÿyear:ÿgenerateÿintÿrowÿ=ÿ_n

.ÿrenameÿcusipÿcusip_nodeal

.ÿmergeÿ1:1ÿyearÿrowÿusingÿ`deals',ÿassert(matchÿmaster)ÿkeep(match)ÿ///
>ÿÿÿÿÿnogenerateÿnoreport

.ÿtimerÿoffÿ1

.ÿ
.ÿtimerÿlist
ÿÿÿ1:ÿÿÿÿÿÿ0.21ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿÿ0.2120

.ÿ
.ÿsortÿyearÿrow

.ÿlistÿyearÿcusip_dealÿcusip_nodealÿ///
>ÿÿÿÿÿifÿinrange(_n,ÿ1,ÿ10)ÿ|ÿinrange(_n,ÿ_N-9,ÿ_N),ÿ///
>ÿÿÿÿÿÿÿÿÿnoobsÿsepby(cusip_deal)ÿabbreviate(20)

ÿÿ+----------------------------------+
ÿÿ|ÿyearÿÿÿcusip_dealÿÿÿcusip_nodealÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ43346Eÿ|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ477104ÿ|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ063706ÿ|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ23025Mÿ|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ21328Nÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ35268Lÿ|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ457659ÿ|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ16571Rÿ|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ18284Jÿ|
ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ381073ÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ063288ÿ|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ168286ÿ|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ18033Jÿ|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ54974Oÿ|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ54123Uÿ|
ÿÿ|----------------------------------|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ366169ÿ|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ210965ÿ|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ46907Vÿ|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ102371ÿ|
ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ390381ÿ|
ÿÿ+----------------------------------+

.ÿ
.ÿexit

endÿofÿdo-file

.
1 like
Comment
Sebastian Grein

Join Date: Oct 2023

Posts: 28
#5

17 Nov 2023, 01:12

Clyde Schechter do you maybe have another solution in your mind.

#4: Unfortunately, I don't think that fits to what I need. First, I need to keep variable deal for further calculations. Second, even when using your command and keeping variable deal, it results in observations which are only peer companies (deal = 0).
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4436
#6

17 Nov 2023, 02:59

Originally posted by Sebastian Grein View Post

#4: Unfortunately, I don't think that fits to what I need. First, I need to keep variable deal for further calculations.

After a standard merge, the variable deal will be superfluous, uninformative (always zero), because that is the value of deal for the master dataset, which absent the update replace options to the merge command will always take precedence. And if you use those options, then the variable deal will still be superfluous, uninformative (always one).

Refer to the help file for merge for further information.

Second, even when using your command and keeping variable deal, it results in observations which are only peer companies (deal = 0).

No it doesn't—it results in observations that contain deal companies (center column) matched against peer companies (rightmost column). See above and read the help file for merge linked to there.

After the merge, regardless of whether Clyde's or mine, you'll have a column of deal-makers and a column of peers. Either way, the deal variable is redundant. I recommend thinking your further calculations through a bit more as to the need for the deal variable.
Comment
Sebastian Grein

Join Date: Oct 2023

Posts: 28
#7

17 Nov 2023, 09:41

I know how the merge command is working. That is why merging the data is potentially not the right option to address my problem. There needs to be a deal observation (deal = 1) and 5 peer companies (deal = 0) respectively, resulting in 6 observations. The variable deal is necessary to distinguish between deals and peer companies and to investigate differences in further investigation.

Okay, I haven't noticed that since I expected a different output. I have additional information per observation like financial data about each deal company and peer company. Therefore, the output of having a year and the deal and peer company in the same line is not helpful for me.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30150
#8

17 Nov 2023, 10:52

Therefore, the output of having a year and the deal and peer company in the same line is not helpful for me.

So all you need to do is -reshape- the result. So after the code I originally proposed:

Code:

rename cusip_a cusip1 rename cusip_ctrl cusip0 drop deal egen `c(obs_t)' tuple = group(cusip1 Year) gen `c(obs_t)' obs_no = _n reshape long cusip, i(obs_no) j(deal) drop obs_no duplicates drop sort tuple cusip Year deal

will get you separate observations for acquirers and their matched controls, along with a deal variable, and also another variable that links all observations that are part of the same tuple.
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4436
#9

17 Nov 2023, 21:11

Originally posted by Sebastian Grein View Post

I know how the merge command is working.
. . . I have additional information per observation like financial data about each deal company and peer company.

OK, sorry—it dawned on me afterward that what you need is what Clyde has posted since.
Comment
Sebastian Grein

Join Date: Oct 2023

Posts: 28
#10

18 Nov 2023, 03:37

Unfortunately, I think only reshaping won't work as well since each observation contains variables with financial information (e.g. total assets). This information would be lost when merging
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4436
#11

18 Nov 2023, 04:08

Originally posted by Sebastian Grein View Post

. . . each observation contains variables with financial information (e.g. total assets). This information would be lost when merging

True, but you can then merge afterward back on the dataset that contains that information. Use the firm's CUSIP and year as the linking variables.
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4436
#12

18 Nov 2023, 20:21

Actually, you can avoid the loss of information altogether by adding a suffix to the variables to distinguish the deal-makers' financial information from the peers'. See below for how you can keep the financial information together with the CUSIPs throughout the data-management process, including the merge step.

You don't show any financial variables above and so I use the auto dataset for illustration (foreign serves as the surrogate for your deal variable), but both the principle and method are identical to what you would use. I identify the method steps in comments in the code section below and in the results posted below the code section I show that the method preserves the ancillary information.

Code:

version 18.0 clear * // seedem set seed 939544781 quietly sysuse auto // deal == 1 preserve quietly keep if foreign drop foreign quietly ds local reshape_varlist `r(varlist)' rename (`reshape_varlist') =1 // <= here generate byte pair = _n tempfile foreign quietly save `foreign' // deal == 0 restore quietly drop if foreign drop foreign rename (`reshape_varlist') =0 // <= and here generate double randu = runiform() quietly isid randu, sort generate byte pair = _n drop randu // merge deal-makers against peers, preserving ancillary information merge 1:1 pair using `foreign', assert(match master) keep(match) nogenerate noreport quietly reshape long `reshape_varlist', i(pair) j(foreign) // <= finally list pair-trunk in 1/4, noobs abbreviate(20) sepby(pair) list pair-foreign weight-gear_ratio in 1/4, noobs abbreviate(20) sepby(pair) exit

.ÿlistÿpair-trunkÿinÿ1/4,ÿnoobsÿabbreviate(20)ÿsepby(pair)

ÿÿ+-----------------------------------------------------------------------------+
ÿÿ|ÿpairÿÿÿforeignÿÿÿmakeÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿpriceÿÿÿmpgÿÿÿrep78ÿÿÿheadroomÿÿÿtrunkÿ|
ÿÿ|-----------------------------------------------------------------------------|
ÿÿ|ÿÿÿÿ1ÿÿÿÿÿÿÿÿÿ0ÿÿÿPlym.ÿHorizonÿÿÿÿÿÿÿ4,482ÿÿÿÿ25ÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿ4.0ÿÿÿÿÿÿ17ÿ|
ÿÿ|ÿÿÿÿ1ÿÿÿÿÿÿÿÿÿ1ÿÿÿAudiÿ5000ÿÿÿÿÿÿÿÿÿÿÿ9,690ÿÿÿÿ17ÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿ3.0ÿÿÿÿÿÿ15ÿ|
ÿÿ|-----------------------------------------------------------------------------|
ÿÿ|ÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ0ÿÿÿLinc.ÿVersaillesÿÿÿ13,466ÿÿÿÿ14ÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿ3.5ÿÿÿÿÿÿ15ÿ|
ÿÿ|ÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ1ÿÿÿAudiÿFoxÿÿÿÿÿÿÿÿÿÿÿÿ6,295ÿÿÿÿ23ÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿ2.5ÿÿÿÿÿÿ11ÿ|
ÿÿ+-----------------------------------------------------------------------------+

.ÿlistÿpair-foreignÿweight-gear_ratioÿinÿ1/4,ÿnoobsÿabbreviate(20)ÿsepby(pair)

ÿÿ+---------------------------------------------------------------------+
ÿÿ|ÿpairÿÿÿforeignÿÿÿweightÿÿÿlengthÿÿÿturnÿÿÿdisplacementÿÿÿgear_ratioÿ|
ÿÿ|---------------------------------------------------------------------|
ÿÿ|ÿÿÿÿ1ÿÿÿÿÿÿÿÿÿ0ÿÿÿÿ2,200ÿÿÿÿÿÿ165ÿÿÿÿÿ36ÿÿÿÿÿÿÿÿÿÿÿÿ105ÿÿÿÿÿÿÿÿÿ3.37ÿ|
ÿÿ|ÿÿÿÿ1ÿÿÿÿÿÿÿÿÿ1ÿÿÿÿ2,830ÿÿÿÿÿÿ189ÿÿÿÿÿ37ÿÿÿÿÿÿÿÿÿÿÿÿ131ÿÿÿÿÿÿÿÿÿ3.20ÿ|
ÿÿ|---------------------------------------------------------------------|
ÿÿ|ÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ0ÿÿÿÿ3,830ÿÿÿÿÿÿ201ÿÿÿÿÿ41ÿÿÿÿÿÿÿÿÿÿÿÿ302ÿÿÿÿÿÿÿÿÿ2.47ÿ|
ÿÿ|ÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ1ÿÿÿÿ2,070ÿÿÿÿÿÿ174ÿÿÿÿÿ36ÿÿÿÿÿÿÿÿÿÿÿÿÿ97ÿÿÿÿÿÿÿÿÿ3.70ÿ|
ÿÿ+---------------------------------------------------------------------+

.ÿ
.ÿexit

endÿofÿdo-file

.

The ancillary information is listed in two batches in order to accommodate line length.
1 like
Comment

Announcement

Random sample based on year

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment