Sampling and matching on the basis of specific characteristics

Andreas Mueller

Join Date: Apr 2015

Posts: 12
#1

Sampling and matching on the basis of specific characteristics

31 Jul 2015, 01:36

Dear Stata Community

I am trying to use a specific sampling procedure, however, I am currently stuck on how to implement this.

I have two data sets, one with my sample firms inlcuing information regarding, completion date, returns, book-to-market ratio and size (File A) among others and one with potential matching firms including similar information (File B) for each point in time. My goal is to "randomly" sample 1,000 matching firms which are similar to my firms in File A from File B where the similarity is determined on the basis of size and book-to-market ratio. The smaller the difference between both characteristics the higher the similarity. One condition is that a matching firm can only be used once. Thus, the matching procedure should look as follows:
Sample the matched firm out of File B which has the highest similarity to a sample firm in File A at completion date.

Save both information from File A and File B in a new file.

Delete respective matched firm in File B

Repeat step 1 for 1,000 times.

Right now, I perform have implemented the following procedure (see my code below):
Sample 1,000 firms from File A with replacement (i.e. by using -bsample-)

For each of these 1,000 firms, -joinby- the information of File B

Minimize the difference between characteristics of Firm A and Firm B.

Keep those firms where the difference is minimized and drop all others.

Unfortunately, this procedure's drawback is that a sampled firm which is sampled several times due to sampling with replacement (which is necessary) will always have the same matched firms. Thus, it would be great if the next closest firm in terms of similarity is chosen. I know, the right approach to perform this matching would be the first option, however, I have no clue on how I could implement this.

Does anybody have some inputs thereto?

Kind regards
Andreas

Code:

use File_A.dta, clear drop if diff<0 drop if diff>0 // only data of completion date is needed gen match=string(month(date), "%02.0f")+string(year(date), "%02.0f") sort match rename permno permnoa rename size sizea rename bmratio bmratioa rename prc prca rename shrout shrouta rename return returna sort permnoa event date joinby match using File B drop if permnoc==permnoa // drops observations where sample firms are its own matched firms drop if bmratioc==. // drops firms with no book-to-market ratio gen constrainedsize=sizea*0.9 // the size of the matched firm should be not smaller than 90% of the sample firm gen deltasize=abs(sizea-sizec) // absolute difference between size of sample and matched firm gen deltabmratio=abs(bmratioa-bmratioc) // absolute difference between book-to-market ratio of sample and matched firm gen pdiffsize=abs(deltasize/sizea) // calculates the percentage difference gen pdiffbmratio=abs(deltabmratio/bmratioa) // calculates the percentage difference sort permnoa event diff gen x=1 if sizec<constrainedsize // if the size of the matched firm is lower than the constrained size, mark it with x=1 egen minsize=min(deltasize) if x==1, by(permnoa event diff) // if the size is lower than the constrained size, chose the smallest difference gen y=1 if minsize==deltasize & x==1 // mark the firm with the smallest difference if size is lower than constrained size gen z=1 if x==1 & y==. // mark all other firms if size is lower than constrained size but not the smallest difference drop if z==1 // drop those firms gen soapd=(pdiffsize+pdiffbmratio) // generate the sum of the absolut percentage difference of size and book-to-market ratio egen mindiff = min(soapd), by(permnoa event diff) // generate the minimum of the sum outlined above keep if soapd==mindiff // keep those firms where the sum is minimized save matching, replace
Tags: None

Announcement

Sampling and matching on the basis of specific characteristics