Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random sample based on year

    Hi, I want to draw a random sample out of my existing data. In my sample are observations which are deals (deal =1) and observations which are peer companies (deal = 0). I want to to have 5 peer companies per acquiring company of the deal (cusip_a as company identifier). The observations drawn from the overall sample with deal = 0 should only match on the Year. I added the code I'm currently using, but I do not know how to include that the year should be the same.

    Code:
    quietly count if deal == 1
    local num_deals = r(N)*5
    display `num_deals'
    
    sample `num_deals' if deal == 0, count
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 cusip_a float(Year deal)
    "30231G" 2020 0
    "084670" 2022 0
    "03743Q" 2015 0
    "165167" 2015 0
    "02376R" 2020 0
    "097023" 2020 0
    "910047" 2020 0
    "165167" 2009 0
    "185899" 2014 0
    "674599" 2015 0
    "247361" 2020 0
    "345370" 2006 0
    "032511" 2015 0
    "76954A" 2022 0
    "26875P" 2015 0
    "69047Q" 2015 0
    "166764" 2016 0
    "02376R" 2021 0
    "20825C" 2015 0
    "42809H" 2017 0
    "69047Q" 2020 0
    "925550" 2001 0
    "53601P" 2015 0
    "13057Q" 2015 0
    "166764" 2020 0
    "24790A" 2015 0
    "651290" 2015 0
    "844741" 2020 0
    "674215" 2020 0
    "20825C" 2016 0
    "69331C" 2000 0
    "549463" 2002 0
    "910047" 2021 0
    "03743Q" 2020 0
    "69047Q" 2012 0
    "966387" 2020 0
    "25179M" 2009 0
    "345370" 2008 0
    "268785" 2015 0
    "V7780T" 2021 0
    "25179M" 2008 0
    "92343E" 2001 1
    "92343E" 2001 1
    "165167" 2016 0
    "651639" 2013 0
    "166764" 2015 0
    "42809H" 2016 0
    "92343E" 2000 1
    "626717" 2015 0
    "V7780T" 2020 0
    "903914" 2015 0
    "90353T" 2021 0
    "20825C" 2020 0
    "845467" 2020 0
    "25179M" 2020 0
    "07134L" 2015 0
    "03940R" 2015 0
    "152312" 2007 0
    "94770V" 2000 0
    "565849" 2015 0
    "36399D" 2008 0
    "03743Q" 2019 0
    "526057" 2007 0
    "94770V" 2001 0
    "02376R" 2002 0
    "549498" 2022 0
    "G66721" 2021 0
    "69367F" 2001 0
    "903914" 2012 0
    "247361" 2021 0
    "23331A" 2008 0
    "629377" 2015 0
    "42809H" 2020 0
    "516806" 2015 0
    "35671D" 2016 0
    "075896" 2022 0
    "42809H" 2015 0
    "451100" 2008 0
    "858098" 2001 0
    "18538Q" 2011 0
    "18538Q" 2010 0
    "37611X" 2022 0
    "858098" 2000 0
    "552953" 2020 0
    "812350" 2017 0
    "477143" 2020 0
    "75281A" 2019 0
    "03840P" 2002 0
    "559663" 2020 0
    "723787" 2015 0
    "745867" 2007 0
    "812350" 2016 0
    "G66721" 2020 0
    "02376R" 2001 0
    "69047Q" 2016 0
    "925550" 2002 0
    "451100" 2020 0
    "69367F" 2000 0
    "13057Q" 2020 0
    "20605P" 2016 0
    end

  • #2
    Code:
    preserve
    keep if deal == 0
    drop deal
    rename cusip_a cusip_ctrl
    tempfile controls
    save `controls'
    
    restore
    keep if deal == 1
    joinby Year using `controls'
    set seed 1234
    gen double shuffle = runiform()
    by cusip_a (shuffle), sort: keep if _n <= 5
    drop shuffle
    Added: Note that in your example data, there are deals for which there simply don't exist enough non-deals of the same year to come up with 5 matches. Presumably this problem does not arise in the real data.

    Comment


    • #3
      Hi, many thanks. Is there maybe a way without using that much memory? I have a dataset of about 60,000 observations. The joinby command is now running already for 15 min and nothing happened yet.

      Comment


      • #4
        Originally posted by Sebastian Grein View Post
        Is there maybe a way without using that much memory? I have a dataset of about 60,000 observations. The joinby command is now running already for 15 min and nothing happened yet.
        Maybe try something like this:
        Code:
        assert inlist(deal, 0, 1)
        
        preserve
        quietly keep if deal
        drop deal
        quietly expand 5
        bysort year (cuspid): generate int row = _n
        rename cuspid cuspid_deal
        tempfile deals
        quietly save `deals'
        
        restore
        quietly keep if !deal
        drop deal
        generate double randu = runiform()
        quietly isid year randu, sort
        by year: generate int row = _n
        rename cuspid cuspid_nodeal
        merge 1:1 year row using `deals', assert(match master) keep(match) ///
            nogenerate noreport
        It's a modifcation of Clyde's suggestion that avoids joinby. On my machine with 60 000 observations with the marginal deal prevalence (3%) that your little snippet of data shows, it takes a fifth of a second start to finish. See the output displayed below below. (Begin at the "Begin here" comment; the top part is to create an artificial dataset for illustration.)

        As Clyde mentions, this assumes that in each year there are at least five no-deal firms for every deal firm. (If that's not true, then omit the assert(match master) in the merge.)

        For convenience, mine also assumes that the CUSIP is unique in the dataset, but in practice it would have to be unique only within each year. (But if it isn't unique overall, wouldn't that affect the interpretability of your result?)

        .ÿ
        .ÿversionÿ18.0

        .ÿ
        .ÿclearÿ*

        .ÿ
        .ÿ//ÿseedem
        .ÿsetÿseedÿ123830903

        .ÿ
        .ÿlocalÿline_sizeÿ`c(linesize)'

        .ÿsetÿlinesizeÿ80

        .ÿ
        .ÿmata:
        -------------------------------------------------ÿmataÿ(typeÿendÿtoÿexit)ÿ------
        :ÿmataÿsetÿmatastrictÿon

        :ÿ
        :ÿrealÿmatrixÿfunctionÿncu(realÿscalarÿn)ÿ{
        >ÿ
        >ÿÿÿÿÿrealÿmatrixÿMiddle,ÿEnds
        >ÿÿÿÿÿMiddleÿ=ÿruniformint(n,ÿ4,ÿ(48),ÿ(57))
        >ÿÿÿÿÿEndsÿ=ÿruniformint(floor(nÿ*ÿ0.48),ÿ2,ÿ(48),ÿ(57))ÿ\
        >ÿÿÿÿÿÿÿÿÿÿÿÿÿ(runiformint(floor(nÿ*ÿ0.48),ÿ1,ÿ(48),ÿ(57)),ÿ
        >ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿruniformint(floor(nÿ*ÿ0.48),ÿ1,ÿ(65),ÿ(90)))ÿ\
        >ÿÿÿÿÿÿÿÿÿÿÿÿÿ(runiformint(floor(nÿ*ÿ0.02),ÿ1,ÿ(65),ÿ(90)),ÿ
        >ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿruniformint(floor(nÿ*ÿ0.02),ÿ1,ÿ(48),ÿ(57)))
        >ÿÿÿÿÿEndsÿ=ÿEndsÿ\ÿruniformint(nÿ-ÿrows(Ends),ÿ2,ÿ(65),ÿ(90))
        >ÿÿÿÿÿreturn(Ends[.,ÿ1],ÿMiddle,ÿEnds[.,ÿ2])
        >ÿ}

        :ÿ
        :ÿstringÿcolvectorÿfunctionÿcusip(realÿscalarÿn)ÿ{
        >ÿÿÿÿÿ
        >ÿÿÿÿÿrealÿmatrixÿCodePoints
        >ÿÿÿÿÿCodePointsÿ=ÿncu(n)
        >ÿ
        >ÿÿÿÿÿstringÿcolvectorÿcusip
        >ÿÿÿÿÿcusipÿ=ÿJ(n,ÿ1,ÿ"")
        >ÿÿÿÿÿ
        >ÿÿÿÿÿrealÿscalarÿrow
        >ÿÿÿÿÿforÿ(row=1;row<=n;row++)ÿ{
        >ÿÿÿÿÿÿÿÿÿcusip[row,ÿ1]ÿ=ÿchar(CodePoints[row,ÿ.])
        >ÿÿÿÿÿ}
        >ÿÿÿÿÿreturn(cusip)
        >ÿ}

        :ÿ
        :ÿvoidÿfunctionÿup(realÿscalarÿn)ÿ{
        >ÿ
        >ÿÿÿÿÿst_addobs(max((0,ÿnÿ-ÿst_nobs())))
        >ÿ
        >ÿÿÿÿÿrealÿscalarÿidx
        >ÿÿÿÿÿidxÿ=ÿst_addvar("str6",ÿ"cusip")
        >ÿ
        >ÿÿÿÿÿst_sstore(.,ÿidx,ÿcusip(n))
        >ÿ}

        :ÿ
        :ÿend
        --------------------------------------------------------------------------------

        .ÿ
        .ÿsetÿlinesizeÿ`line_size'

        .ÿ
        .ÿmata:ÿup(100000)

        .ÿcontractÿcusip,ÿfreq(row)

        .ÿquietlyÿkeepÿinÿ1/60000

        .ÿquietlyÿreplaceÿrowÿ=ÿ_n

        .ÿ
        .ÿframeÿcreateÿYears

        .ÿframeÿYearsÿ{
        .ÿÿÿÿÿquietlyÿsetÿobsÿ60000
        .ÿÿÿÿÿgenerateÿintÿyearÿ=ÿ2000ÿ+ÿmod(_n,ÿ23)
        .ÿÿÿÿÿgenerateÿlongÿrowÿ=ÿ_n
        .ÿ}

        .ÿquietlyÿfrlinkÿ1:1ÿrow,ÿframe(Years)

        .ÿquietlyÿfrgetÿyear,ÿfrom(Years)

        .ÿframeÿdropÿYears

        .ÿdropÿrowÿYears

        .ÿ
        .ÿgenerateÿbyteÿdealÿ=ÿrbinomial(1,ÿ0.03)

        .ÿ
        .ÿ*
        .ÿ*ÿBeginÿhere
        .ÿ*
        .ÿtimerÿclearÿ1

        .ÿtimerÿonÿ1

        .ÿcount
        ÿÿ60,000

        .ÿassertÿinlist(deal,ÿ0,ÿ1)

        .ÿ
        .ÿpreserve

        .ÿquietlyÿkeepÿifÿdeal

        .ÿdropÿdeal

        .ÿquietlyÿexpandÿ5

        .ÿbysortÿyearÿ(cusip):ÿgenerateÿintÿrowÿ=ÿ_n

        .ÿrenameÿcusipÿcusip_deal

        .ÿtempfileÿdeals

        .ÿquietlyÿsaveÿ`deals'

        .ÿ
        .ÿrestore

        .ÿquietlyÿkeepÿifÿ!deal

        .ÿdropÿdeal

        .ÿgenerateÿdoubleÿranduÿ=ÿruniform()

        .ÿquietlyÿisidÿyearÿrandu,ÿsort

        .ÿbyÿyear:ÿgenerateÿintÿrowÿ=ÿ_n

        .ÿrenameÿcusipÿcusip_nodeal

        .ÿmergeÿ1:1ÿyearÿrowÿusingÿ`deals',ÿassert(matchÿmaster)ÿkeep(match)ÿ///
        >ÿÿÿÿÿnogenerateÿnoreport

        .ÿtimerÿoffÿ1

        .ÿ
        .ÿtimerÿlist
        ÿÿÿ1:ÿÿÿÿÿÿ0.21ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿÿ0.2120

        .ÿ
        .ÿsortÿyearÿrow

        .ÿlistÿyearÿcusip_dealÿcusip_nodealÿ///
        >ÿÿÿÿÿifÿinrange(_n,ÿ1,ÿ10)ÿ|ÿinrange(_n,ÿ_N-9,ÿ_N),ÿ///
        >ÿÿÿÿÿÿÿÿÿnoobsÿsepby(cusip_deal)ÿabbreviate(20)

        ÿÿ+----------------------------------+
        ÿÿ|ÿyearÿÿÿcusip_dealÿÿÿcusip_nodealÿ|
        ÿÿ|----------------------------------|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ43346Eÿ|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ477104ÿ|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ063706ÿ|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ23025Mÿ|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ01908Wÿÿÿÿÿÿÿÿÿ21328Nÿ|
        ÿÿ|----------------------------------|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ35268Lÿ|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ457659ÿ|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ16571Rÿ|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ18284Jÿ|
        ÿÿ|ÿ2000ÿÿÿÿÿÿÿ02399Aÿÿÿÿÿÿÿÿÿ381073ÿ|
        ÿÿ|----------------------------------|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ063288ÿ|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ168286ÿ|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ18033Jÿ|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ54974Oÿ|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ626835ÿÿÿÿÿÿÿÿÿ54123Uÿ|
        ÿÿ|----------------------------------|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ366169ÿ|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ210965ÿ|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ46907Vÿ|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ102371ÿ|
        ÿÿ|ÿ2022ÿÿÿÿÿÿÿ62877Oÿÿÿÿÿÿÿÿÿ390381ÿ|
        ÿÿ+----------------------------------+

        .ÿ
        .ÿexit

        endÿofÿdo-file


        .

        Comment


        • #5
          Clyde Schechter do you maybe have another solution in your mind.

          #4: Unfortunately, I don't think that fits to what I need. First, I need to keep variable deal for further calculations. Second, even when using your command and keeping variable deal, it results in observations which are only peer companies (deal = 0).

          Comment


          • #6
            Originally posted by Sebastian Grein View Post
            #4: Unfortunately, I don't think that fits to what I need. First, I need to keep variable deal for further calculations.
            After a standard merge, the variable deal will be superfluous, uninformative (always zero), because that is the value of deal for the master dataset, which absent the update replace options to the merge command will always take precedence. And if you use those options, then the variable deal will still be superfluous, uninformative (always one).

            Refer to the help file for merge for further information.

            Second, even when using your command and keeping variable deal, it results in observations which are only peer companies (deal = 0).
            No it doesn't—it results in observations that contain deal companies (center column) matched against peer companies (rightmost column). See above and read the help file for merge linked to there.

            After the merge, regardless of whether Clyde's or mine, you'll have a column of deal-makers and a column of peers. Either way, the deal variable is redundant. I recommend thinking your further calculations through a bit more as to the need for the deal variable.

            Comment


            • #7
              I know how the merge command is working. That is why merging the data is potentially not the right option to address my problem. There needs to be a deal observation (deal = 1) and 5 peer companies (deal = 0) respectively, resulting in 6 observations. The variable deal is necessary to distinguish between deals and peer companies and to investigate differences in further investigation.

              Okay, I haven't noticed that since I expected a different output. I have additional information per observation like financial data about each deal company and peer company. Therefore, the output of having a year and the deal and peer company in the same line is not helpful for me.

              Comment


              • #8
                Therefore, the output of having a year and the deal and peer company in the same line is not helpful for me.
                So all you need to do is -reshape- the result. So after the code I originally proposed:
                Code:
                rename cusip_a cusip1
                rename cusip_ctrl cusip0
                drop deal
                egen `c(obs_t)' tuple = group(cusip1 Year)
                gen `c(obs_t)' obs_no = _n
                reshape long cusip, i(obs_no) j(deal)
                drop obs_no
                duplicates drop
                sort tuple cusip Year deal
                will get you separate observations for acquirers and their matched controls, along with a deal variable, and also another variable that links all observations that are part of the same tuple.

                Comment


                • #9
                  Originally posted by Sebastian Grein View Post
                  I know how the merge command is working.
                  . . . I have additional information per observation like financial data about each deal company and peer company.
                  OK, sorry—it dawned on me afterward that what you need is what Clyde has posted since.

                  Comment


                  • #10
                    Unfortunately, I think only reshaping won't work as well since each observation contains variables with financial information (e.g. total assets). This information would be lost when merging

                    Comment


                    • #11
                      Originally posted by Sebastian Grein View Post
                      . . . each observation contains variables with financial information (e.g. total assets). This information would be lost when merging
                      True, but you can then merge afterward back on the dataset that contains that information. Use the firm's CUSIP and year as the linking variables.

                      Comment


                      • #12
                        Actually, you can avoid the loss of information altogether by adding a suffix to the variables to distinguish the deal-makers' financial information from the peers'. See below for how you can keep the financial information together with the CUSIPs throughout the data-management process, including the merge step.

                        You don't show any financial variables above and so I use the auto dataset for illustration (foreign serves as the surrogate for your deal variable), but both the principle and method are identical to what you would use. I identify the method steps in comments in the code section below and in the results posted below the code section I show that the method preserves the ancillary information.
                        Code:
                        version 18.0
                        
                        clear *
                        
                        // seedem
                        set seed 939544781
                        
                        quietly sysuse auto
                        
                        // deal == 1
                        preserve
                        quietly keep if foreign
                        drop foreign
                        
                        quietly ds
                        local reshape_varlist `r(varlist)'
                        
                        rename (`reshape_varlist') =1 // <= here
                        
                        generate byte pair = _n
                        
                        tempfile foreign
                        quietly save `foreign'
                        
                        // deal == 0
                        restore
                        quietly drop if foreign
                        drop foreign
                        
                        rename (`reshape_varlist') =0 // <= and here
                        
                        generate double randu = runiform()
                        quietly isid randu, sort
                        generate byte pair = _n
                        drop randu
                        
                        // merge deal-makers against peers, preserving ancillary information
                        merge 1:1 pair using `foreign', assert(match master) keep(match) nogenerate noreport
                        
                        quietly reshape long `reshape_varlist', i(pair) j(foreign) // <= finally
                        
                        list pair-trunk in 1/4, noobs abbreviate(20) sepby(pair)
                        list pair-foreign weight-gear_ratio in 1/4, noobs abbreviate(20) sepby(pair)
                        
                        exit
                        .ÿlistÿpair-trunkÿinÿ1/4,ÿnoobsÿabbreviate(20)ÿsepby(pair)

                        ÿÿ+-----------------------------------------------------------------------------+
                        ÿÿ|ÿpairÿÿÿforeignÿÿÿmakeÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿpriceÿÿÿmpgÿÿÿrep78ÿÿÿheadroomÿÿÿtrunkÿ|
                        ÿÿ|-----------------------------------------------------------------------------|
                        ÿÿ|ÿÿÿÿ1ÿÿÿÿÿÿÿÿÿ0ÿÿÿPlym.ÿHorizonÿÿÿÿÿÿÿ4,482ÿÿÿÿ25ÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿ4.0ÿÿÿÿÿÿ17ÿ|
                        ÿÿ|ÿÿÿÿ1ÿÿÿÿÿÿÿÿÿ1ÿÿÿAudiÿ5000ÿÿÿÿÿÿÿÿÿÿÿ9,690ÿÿÿÿ17ÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿ3.0ÿÿÿÿÿÿ15ÿ|
                        ÿÿ|-----------------------------------------------------------------------------|
                        ÿÿ|ÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ0ÿÿÿLinc.ÿVersaillesÿÿÿ13,466ÿÿÿÿ14ÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿ3.5ÿÿÿÿÿÿ15ÿ|
                        ÿÿ|ÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ1ÿÿÿAudiÿFoxÿÿÿÿÿÿÿÿÿÿÿÿ6,295ÿÿÿÿ23ÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿ2.5ÿÿÿÿÿÿ11ÿ|
                        ÿÿ+-----------------------------------------------------------------------------+

                        .ÿlistÿpair-foreignÿweight-gear_ratioÿinÿ1/4,ÿnoobsÿabbreviate(20)ÿsepby(pair)

                        ÿÿ+---------------------------------------------------------------------+
                        ÿÿ|ÿpairÿÿÿforeignÿÿÿweightÿÿÿlengthÿÿÿturnÿÿÿdisplacementÿÿÿgear_ratioÿ|
                        ÿÿ|---------------------------------------------------------------------|
                        ÿÿ|ÿÿÿÿ1ÿÿÿÿÿÿÿÿÿ0ÿÿÿÿ2,200ÿÿÿÿÿÿ165ÿÿÿÿÿ36ÿÿÿÿÿÿÿÿÿÿÿÿ105ÿÿÿÿÿÿÿÿÿ3.37ÿ|
                        ÿÿ|ÿÿÿÿ1ÿÿÿÿÿÿÿÿÿ1ÿÿÿÿ2,830ÿÿÿÿÿÿ189ÿÿÿÿÿ37ÿÿÿÿÿÿÿÿÿÿÿÿ131ÿÿÿÿÿÿÿÿÿ3.20ÿ|
                        ÿÿ|---------------------------------------------------------------------|
                        ÿÿ|ÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ0ÿÿÿÿ3,830ÿÿÿÿÿÿ201ÿÿÿÿÿ41ÿÿÿÿÿÿÿÿÿÿÿÿ302ÿÿÿÿÿÿÿÿÿ2.47ÿ|
                        ÿÿ|ÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ1ÿÿÿÿ2,070ÿÿÿÿÿÿ174ÿÿÿÿÿ36ÿÿÿÿÿÿÿÿÿÿÿÿÿ97ÿÿÿÿÿÿÿÿÿ3.70ÿ|
                        ÿÿ+---------------------------------------------------------------------+

                        .ÿ
                        .ÿexit

                        endÿofÿdo-file


                        .


                        The ancillary information is listed in two batches in order to accommodate line length.

                        Comment

                        Working...
                        X