sample selection under many restrictions

Matthijs Tichem

Join Date: Apr 2019

Posts: 10
#1

sample selection under many restrictions

27 Apr 2019, 02:33

Hi,

I have a (panel) dataset that contains observations with the following variables: company (about 100) - buyer (7) - year (2012-2017) - procedure (about 4200)- procedure_type (65) - price (can be anything).
Companies differ with respect to which of the 4200 procedures they offer. Each buyer buys at least some of the offered procedures at each company.

1. I would like to select the procedures that are being offered in each of the 6 years, by every company and bought by each buyer.
2. In a later subsample I would like to restrict the sample above (under 1) even further by keeping the 10 different procedures with the highest median price.
3. In another subsample
I would like to restrict the sample above (under 1) by keeping the 10 procedures that appear most often.

Any ideas on how to do this? Thanks in advance!
Tags: data, panel, panel data, sample, sample selection
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#2

27 Apr 2019, 11:29

You are asking for help with code; a description of a data set in words does not provide all of the information needed to help you. Please post back with an example of your data set, using the -dataex- command. Please be sure the example you show contains both some observations that you want to retain in the sample, and some that you don't.

If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Matthijs Tichem

Join Date: Apr 2019

Posts: 10
#3

01 May 2019, 08:31

Dear Clyde,

Thanks for your reply. Below I provided example data, including a short explanation. Besides I had to make some changes to my data and question as well. Please find the new version of my question below as well. I hope you can provide me with some new insights, that can help me with sample selection.

Best,

Matthijs

Rephrased question and background information:

I have a (panel) dataset that contains observations with the following variables: company_id (10) - year (2012-2015) - procedure (1-3) - procedure_type (101 and 111) - price - amount (1 for every obervation - means 1 procedure) - buyer (2).

Companies differ with respect to which of the 3 procedures they offer. Some provide all, some don't. Furthermore, the procedures provided by every company vary by year. Some procedures that a company provides are offered in every year but others in just a few years or just once. As can be seen, there are 2 buyers, that do not necessarily buy the same procedures that a company offers (in each year).

1. I would like to select the procedures that appear at least once for every company-buyer combination (since my unit of analysis is the company-buyer combination).
2. In a later subsample I would like to restrict the sample of (1) further by keeping the 10 different procedures that appear most often
3. In another subsample I would like to restrict the sample of (1) further by keeping the 10 different procedures with the highest median price
4. In another sample I would like to restrict my sample to procedures that appear in every year for every company-buyer combination, instead of 'at least once- under sample 1' (balanced).

Example data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(company_id procedure procedure_type price year amount buyer)
1 1 101 234 2012 1 1
1 1 101 245 2013 1 1
1 1 101 269 2014 1 1
1 1 101 221 2015 1 1
1 1 101 234 2012 1 2
1 1 101 234 2013 1 2
1 1 101 234 2014 1 2
1 1 101 234 2015 1 2
1 2 101 85 2013 1 1
1 2 101 78 2014 1 1
1 2 101 80 2015 1 1
1 2 101 85 2013 1 2
1 3 111 345 2012 1 1
1 3 111 346 2013 1 1
1 3 111 347 2014 1 1
1 3 111 348 2015 1 1
1 3 111 345 2012 1 2
2 1 101 167 2012 1 1
2 1 101 185 2013 1 1
2 1 101 167 2012 1 2
2 2 101 101 2012 1 1
2 2 101 103 2013 1 1
2 2 101 101 2012 1 2
2 3 111 321 2012 1 1
2 3 111 322 2013 1 1
2 3 111 321 2012 1 2
3 1 101 301 2012 1 1
3 1 101 289 2013 1 1
3 1 101 301 2012 1 2
3 2 101 125 2012 1 1
3 2 101 114 2013 1 1
3 2 101 114 2013 1 2
3 3 111 346 2012 1 1
3 3 111 347 2013 1 1
4 1 101 250 2014 1 1
4 1 101 250 2015 1 1
4 2 101 130 2014 1 1
4 2 101 130 2015 1 1
4 2 101 130 2015 1 2
4 3 111 311 2014 1 1
4 3 111 311 2015 1 1
4 3 111 311 2014 1 2
5 1 101 163 2012 1 1
5 1 101 163 2013 1 1
5 1 101 163 2013 1 2
5 2 101 111 2012 1 1
5 2 101 111 2013 1 1
5 2 101 111 2013 1 2
5 3 111 378 2012 1 1
5 3 111 330 2013 1 1
5 3 111 330 2013 1 2
6 1 101 198 2012 1 1
6 1 101 201 2013 1 1
6 1 101 178 2014 1 1
6 1 101 187 2015 1 1
6 1 101 198 2012 1 2
6 2 101 132 2012 1 1
6 2 101 131 2013 1 1
6 2 101 144 2014 1 1
6 2 101 144 2015 1 1
6 2 101 144 2015 1 2
6 3 111 380 2012 1 1
6 3 111 380 2013 1 1
6 3 111 345 2014 1 1
6 3 111 345 2015 1 1
6 3 111 380 2012 1 2
7 1 101 134 2012 1 1
7 1 101 143 2013 1 1
7 1 101 152 2014 1 1
7 2 101 105 2012 1 1
7 2 101 105 2013 1 1
7 2 101 106 2014 1 1
7 2 101 105 2012 1 2
7 2 101 106 2014 1 2
7 3 111 401 2012 1 1
7 3 111 406 2013 1 1
7 3 111 399 2014 1 1
7 3 111 401 2012 1 2
8 1 101 161 2015 1 1
8 2 101 106 2015 1 1
8 2 101 106 2015 1 2
8 3 111 400 2015 1 1
8 3 111 400 2015 1 2
9 1 101 204 2012 1 1
9 1 101 204 2012 1 2
9 2 101 134 2012 1 1
9 2 101 134 2012 1 2
9 3 111 267 2012 1 1
9 3 111 267 2012 1 2
10 1 101 166 2012 1 1
10 1 101 166 2013 1 1
10 1 101 145 2014 1 1
10 1 101 145 2015 1 1
10 1 101 145 2014 1 2
10 3 111 371 2013 1 1
10 3 111 376 2014 1 1
10 3 111 345 2015 1 1
10 3 111 345 2015 1 2
end

*example data - explanation/description:

*
*10 different company id's
* - 1 is active in all years (2012-2015)
* - 2 and 3 are two separate entities in 2012 and 2013. After a merger these companies form one entity that takes on company_id 4 as off 2014
* - 5 and 6 are two separate entities in 2012 and 2013. After a merger these companies form one entity that continues with the existing company_id 6
* - company id 7 applies to a company that is present form 2012-2014. The same company takes on a new id (8) in 2015
* - company id 9 applies to a company that is only present in 2012
* - company id 10 is active in all years
*
*procedure - 3 different ones in this example
*
*procedure_type - 2 different ones (101 & 111) in this example - procedure 1 & 2 belong to 101
*
*some procedures appear in every year for a specific company; f.i. procedure 1 for company(id) 1.
*but for others, one or more years are missing; f.i. procedure 2 for company(id) 1 in 2012
*
*some procedures appear at least once for every company that is ever active; f.i. procedure 1 and 3
*among these procedures there are, that appear in every year for each company; procedure 1
*(note: it may seem that the latter is not the case; f.i. 1 only appears in 2012 and 2013 for company_id 2 and 3; however, company_id 4 replaces both id 2 and 3, therefore procedure 1 appears in every year in for the entity ralting to id 2,3 and 4)
*There are some procedures that do not appear at all for a specific company; f.i. procedure 2 never appears for company_id 10
*
*As can be seen, there are 2 different buyers, that differ with respect to the procedures they buy. In reality there are more than 2, but for simplicity I have buyer (2) in this example data.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#4

01 May 2019, 13:27

I may be misunderstanding what you want here. It appears to me that there is no procedure at all in your example that appears with every company:buyer pair. Procedure 1 never appears with company 4: buyer 2, company 7:buyer 2, nor company 8:buyer 2. Procedure 2 never appears with company 10 at all. And procedure 3 never appears with company 3 buyer 2. So did you intend to post an example where nothing would fit into your subsample? Because that is how the example is, I cannot consider this code to be adequately tested, but I believe it works.

Code:

levelsof procedure, local(procedures) gen byte to_drop = 0 foreach p of local procedures { display `"`p'"' by company_id buyer, sort: egen byte in_this_pair = max(procedure == `p') summ in_this_pair, meanonly if `r(min)' == 0 { replace to_drop = 1 if procedure == `p' } drop in_this_pair } drop if to_drop drop to_drop

will leave your first subsample in memory.

And this slight modification of the above, applied to the original data, will produce your 4th (balanced) sample:

Code:

fillin company_id buyer year levelsof procedure, local(procedures) gen to_drop = 0 foreach p of local procedures { by company_id buyer year, sort: egen in_this_pair_this_year = max(procedure == `p') summ in_this_pair_this_year, meanonly if `r(min)' == 0 { replace to_drop = 1 if procedure == `p' } drop in_this_pair_this_year } drop if to_drop drop to_drop

As for refining the first subsample to the ten most frequent or the ten with the highest median price:

Code:

// NOW FURTHER RESTRICFT TO TOP TEN MOST FREQUENT PROCEDURES by procedure, sort: gen frequency = _N replace frequency = -frequency // SO IT WILL SORT PROPERLY by frequency procedure, sort: gen freq_rank = 1 if _n == 1 replace freq_rank = sum(freq_rank) keep if freq_rank <= 10 // RESTRICT THE SAMPLE TO THE TOP TEN MEDIAN PRICES by procedure, sort: egen median_price = median(price) replace median_price = -median_price by median_price procedure, sort: gen price_rank = 1 if _n == 1 replace price_rank = sum(price_rank) keep if price_rank <= 10

(Don't run these consecutively: you go back to the original data for each one.)

Note that the notion of top 10 is, strictly speaking, ill defined as there can be ties for the tenth place. In this code, any such ties are broken by arbitrarily choosing as tenth the procedure which is the lowest number in the coding. So, if procedures 5 and 19 are both tied for tenth place, 5 will be selected.

It is possible that these blocks of code contain errors, as I could not adequately test them using your example data. If you find problems, when posting back, please show a new data example, one which demonstrates whatever problem you find.
Comment
Matthijs Tichem

Join Date: Apr 2019

Posts: 10
#5

02 May 2019, 02:33

Dear Clyde,

First off all, thanks for the advice/ solutions!

With respect to the example data, I indeed made a mistake. I accidentally dropped the observations for procedure 1: c
ompany 4: buyer 2, company 7:buyer 2, company 8: buyer 2. However, when I include these observations, the first sample shows all obervations belonging to procedure 1. Hence, the first code does in fact provide me with the sample I want. Besides, I also tested the codes that should provide me with the 4th sample, and they seem to work fine as well.

Once again, many thanks for your help!

Best,

Matthijs
Comment

Announcement

sample selection under many restrictions

Comment

Comment

Comment

Comment