alternative to keep if for a large list of firm id s

Ama Perera

Join Date: Mar 2019

Posts: 43
#1

alternative to keep if for a large list of firm id s

28 Oct 2021, 19:45

Hi all,

I have a list of firm ids that contains 17,000 firm ids. I have a big dataset which contains over 5 million observations of an even bigger set of firm ids (Dataset A). For dataset A, I want to only keep the firm ids that are mentioned in my list of firm ids (or drop the firm ids that aren't in the list). Can you please help me with this?

Thank you.
Tags: firmid, keep if, match
Joseph Coveney

Join Date: Apr 2014

Posts: 4383
#2

28 Oct 2021, 23:24

You can put into a separate dataset your list of firm IDs that you want, and then use -merge- to select those firms from the big dataset. I illustrate the basic method below. (Begin at the "Begin here" comment; the top part is just to create a dataset of five million observations, called Big, and a subset of that to represent your list of seventeen thousand members. If your form ID's variable names differ between the dataset and your list, then you could use -frames- to do the selection. It's a little slower on unsorted datasets, but it's more convenient if you want to keep the full dataset handy and just operate on the selected subset. I've illustrated its use, too, below.

.ÿ
.ÿversionÿ17.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿsetÿseedÿ1354377003

.ÿ
.ÿsetÿobsÿ5000000
Numberÿofÿobservationsÿ(_N)ÿwasÿ0,ÿnowÿ5,000,000.

.ÿ
.ÿ//ÿFirmÿID
.ÿgenerateÿstrÿfidÿ=ÿstring(runiform(),ÿ"%16H")

.ÿisidÿfid

.ÿ
.ÿ//ÿData
.ÿforvaluesÿiÿ=ÿ1/10ÿ{
ÿÿ2.ÿÿÿÿÿÿÿÿÿgenerateÿdoubleÿvar`i'ÿ=ÿruniform()
ÿÿ3.ÿ}

.ÿquietlyÿsaveÿBig

.ÿ
.ÿsortÿvar1

.ÿquietlyÿkeepÿinÿ1/17000

.ÿkeepÿfid

.ÿ
.ÿquietlyÿsaveÿWant

.ÿ
.ÿclearÿ*

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿtimerÿclearÿ1

.ÿtimerÿonÿ1

.ÿ
.ÿuseÿWant

.ÿmergeÿ1:1ÿfidÿusingÿBig,ÿassert(matchÿusing)ÿkeep(match)

ÿÿÿÿResultÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobs
ÿÿÿÿ-----------------------------------------
ÿÿÿÿNotÿmatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ0
ÿÿÿÿMatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ17,000ÿÿ(_merge==3)
ÿÿÿÿ-----------------------------------------

.ÿ
.ÿtimerÿoffÿ1

.ÿtimerÿlistÿ1
ÿÿÿ1:ÿÿÿÿÿ17.12ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ17.1200

.ÿ
.ÿclearÿ*

.ÿ
.ÿtimerÿclearÿ2

.ÿtimerÿonÿ2

.ÿ
.ÿframeÿcreateÿWant

.ÿframeÿWant:ÿuseÿWant

.ÿ
.ÿuseÿBig

.ÿfrlinkÿ1:1ÿfid,ÿframe(Want)
ÿÿ(4,983,000ÿobservationsÿinÿframeÿdefaultÿunmatched)

.ÿquietlyÿkeepÿifÿ!mi(Want)

.ÿ
.ÿtimerÿoffÿ2

.ÿtimerÿlistÿ2
ÿÿÿ2:ÿÿÿÿÿ23.28ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ23.2790

.ÿ
.ÿeraseÿBig.dta

.ÿeraseÿWant.dta

.ÿ
.ÿexit

endÿofÿdo-file

.

I've assumed a 1:1 cardinality in the join, based upon your description, but the same principle would hold for a many-to-one relationship of firm IDs in the big dataset to your selection list.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35505
#3

29 Oct 2021, 00:54

See also https://www.stata.com/support/faqs/d...s-for-subsets/ for an FAQ by Kit Baum.
1 like
Comment
Ama Perera

Join Date: Mar 2019

Posts: 43
#4

31 Oct 2021, 21:39

Originally posted by Joseph Coveney View Post

You can put into a separate dataset your list of firm IDs that you want, and then use -merge- to select those firms from the big dataset. I illustrate the basic method below. (Begin at the "Begin here" comment; the top part is just to create a dataset of five million observations, called Big, and a subset of that to represent your list of seventeen thousand members. If your form ID's variable names differ between the dataset and your list, then you could use -frames- to do the selection. It's a little slower on unsorted datasets, but it's more convenient if you want to keep the full dataset handy and just operate on the selected subset. I've illustrated its use, too, below.

.ÿ
.ÿversionÿ17.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿsetÿseedÿ1354377003

.ÿ
.ÿsetÿobsÿ5000000
Numberÿofÿobservationsÿ(_N)ÿwasÿ0,ÿnowÿ5,000,000.

.ÿ
.ÿ//ÿFirmÿID
.ÿgenerateÿstrÿfidÿ=ÿstring(runiform(),ÿ"%16H")

.ÿisidÿfid

.ÿ
.ÿ//ÿData
.ÿforvaluesÿiÿ=ÿ1/10ÿ{
ÿÿ2.ÿÿÿÿÿÿÿÿÿgenerateÿdoubleÿvar`i'ÿ=ÿruniform()
ÿÿ3.ÿ}

.ÿquietlyÿsaveÿBig

.ÿ
.ÿsortÿvar1

.ÿquietlyÿkeepÿinÿ1/17000

.ÿkeepÿfid

.ÿ
.ÿquietlyÿsaveÿWant

.ÿ
.ÿclearÿ*

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿtimerÿclearÿ1

.ÿtimerÿonÿ1

.ÿ
.ÿuseÿWant

.ÿmergeÿ1:1ÿfidÿusingÿBig,ÿassert(matchÿusing)ÿkeep(match)

ÿÿÿÿResultÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobs
ÿÿÿÿ-----------------------------------------
ÿÿÿÿNotÿmatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ0
ÿÿÿÿMatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ17,000ÿÿ(_merge==3)
ÿÿÿÿ-----------------------------------------

.ÿ
.ÿtimerÿoffÿ1

.ÿtimerÿlistÿ1
ÿÿÿ1:ÿÿÿÿÿ17.12ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ17.1200

.ÿ
.ÿclearÿ*

.ÿ
.ÿtimerÿclearÿ2

.ÿtimerÿonÿ2

.ÿ
.ÿframeÿcreateÿWant

.ÿframeÿWant:ÿuseÿWant

.ÿ
.ÿuseÿBig

.ÿfrlinkÿ1:1ÿfid,ÿframe(Want)
ÿÿ(4,983,000ÿobservationsÿinÿframeÿdefaultÿunmatched)

.ÿquietlyÿkeepÿifÿ!mi(Want)

.ÿ
.ÿtimerÿoffÿ2

.ÿtimerÿlistÿ2
ÿÿÿ2:ÿÿÿÿÿ23.28ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ23.2790

.ÿ
.ÿeraseÿBig.dta

.ÿeraseÿWant.dta

.ÿ
.ÿexit

endÿofÿdo-file

.

I've assumed a 1:1 cardinality in the join, based upon your description, but the same principle would hold for a many-to-one relationship of firm IDs in the big dataset to your selection list.

Thank you very much Joseph. It worked perfectly.
Comment
Ama Perera

Join Date: Mar 2019

Posts: 43
#5

31 Oct 2021, 21:40

Originally posted by Nick Cox View Post

See also https://www.stata.com/support/faqs/d...s-for-subsets/ for an FAQ by Kit Baum.

Thanks Nick. Very helpful.
Comment

Announcement

alternative to keep if for a large list of firm id s

Comment

Comment

Comment

Comment