Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • alternative to keep if for a large list of firm id s

    Hi all,

    I have a list of firm ids that contains 17,000 firm ids. I have a big dataset which contains over 5 million observations of an even bigger set of firm ids (Dataset A). For dataset A, I want to only keep the firm ids that are mentioned in my list of firm ids (or drop the firm ids that aren't in the list). Can you please help me with this?

    Thank you.

  • #2
    You can put into a separate dataset your list of firm IDs that you want, and then use -merge- to select those firms from the big dataset. I illustrate the basic method below. (Begin at the "Begin here" comment; the top part is just to create a dataset of five million observations, called Big, and a subset of that to represent your list of seventeen thousand members. If your form ID's variable names differ between the dataset and your list, then you could use -frames- to do the selection. It's a little slower on unsorted datasets, but it's more convenient if you want to keep the full dataset handy and just operate on the selected subset. I've illustrated its use, too, below.

    .ÿ
    .ÿversionÿ17.0

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿsetÿseedÿ1354377003

    .ÿ
    .ÿsetÿobsÿ5000000
    Numberÿofÿobservationsÿ(_N)ÿwasÿ0,ÿnowÿ5,000,000.

    .ÿ
    .ÿ//ÿFirmÿID
    .ÿgenerateÿstrÿfidÿ=ÿstring(runiform(),ÿ"%16H")

    .ÿisidÿfid

    .ÿ
    .ÿ//ÿData
    .ÿforvaluesÿiÿ=ÿ1/10ÿ{
    ÿÿ2.ÿÿÿÿÿÿÿÿÿgenerateÿdoubleÿvar`i'ÿ=ÿruniform()
    ÿÿ3.ÿ}

    .ÿquietlyÿsaveÿBig

    .ÿ
    .ÿsortÿvar1

    .ÿquietlyÿkeepÿinÿ1/17000

    .ÿkeepÿfid

    .ÿ
    .ÿquietlyÿsaveÿWant

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿ*
    .ÿ*ÿBeginÿhere
    .ÿ*
    .ÿtimerÿclearÿ1

    .ÿtimerÿonÿ1

    .ÿ
    .ÿuseÿWant

    .ÿmergeÿ1:1ÿfidÿusingÿBig,ÿassert(matchÿusing)ÿkeep(match)

    ÿÿÿÿResultÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobs
    ÿÿÿÿ-----------------------------------------
    ÿÿÿÿNotÿmatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ0
    ÿÿÿÿMatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ17,000ÿÿ(_merge==3)
    ÿÿÿÿ-----------------------------------------

    .ÿ
    .ÿtimerÿoffÿ1

    .ÿtimerÿlistÿ1
    ÿÿÿ1:ÿÿÿÿÿ17.12ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ17.1200

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿtimerÿclearÿ2

    .ÿtimerÿonÿ2

    .ÿ
    .ÿframeÿcreateÿWant

    .ÿframeÿWant:ÿuseÿWant

    .ÿ
    .ÿuseÿBig

    .ÿfrlinkÿ1:1ÿfid,ÿframe(Want)
    ÿÿ(4,983,000ÿobservationsÿinÿframeÿdefaultÿunmatched)

    .ÿquietlyÿkeepÿifÿ!mi(Want)

    .ÿ
    .ÿtimerÿoffÿ2

    .ÿtimerÿlistÿ2
    ÿÿÿ2:ÿÿÿÿÿ23.28ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ23.2790

    .ÿ
    .ÿeraseÿBig.dta

    .ÿeraseÿWant.dta

    .ÿ
    .ÿexit

    endÿofÿdo-file


    .


    I've assumed a 1:1 cardinality in the join, based upon your description, but the same principle would hold for a many-to-one relationship of firm IDs in the big dataset to your selection list.

    Comment


    • #3
      See also https://www.stata.com/support/faqs/d...s-for-subsets/ for an FAQ by Kit Baum.

      Comment


      • #4
        Originally posted by Joseph Coveney View Post
        You can put into a separate dataset your list of firm IDs that you want, and then use -merge- to select those firms from the big dataset. I illustrate the basic method below. (Begin at the "Begin here" comment; the top part is just to create a dataset of five million observations, called Big, and a subset of that to represent your list of seventeen thousand members. If your form ID's variable names differ between the dataset and your list, then you could use -frames- to do the selection. It's a little slower on unsorted datasets, but it's more convenient if you want to keep the full dataset handy and just operate on the selected subset. I've illustrated its use, too, below.

        .ÿ
        .ÿversionÿ17.0

        .ÿ
        .ÿclearÿ*

        .ÿ
        .ÿsetÿseedÿ1354377003

        .ÿ
        .ÿsetÿobsÿ5000000
        Numberÿofÿobservationsÿ(_N)ÿwasÿ0,ÿnowÿ5,000,000.

        .ÿ
        .ÿ//ÿFirmÿID
        .ÿgenerateÿstrÿfidÿ=ÿstring(runiform(),ÿ"%16H")

        .ÿisidÿfid

        .ÿ
        .ÿ//ÿData
        .ÿforvaluesÿiÿ=ÿ1/10ÿ{
        ÿÿ2.ÿÿÿÿÿÿÿÿÿgenerateÿdoubleÿvar`i'ÿ=ÿruniform()
        ÿÿ3.ÿ}

        .ÿquietlyÿsaveÿBig

        .ÿ
        .ÿsortÿvar1

        .ÿquietlyÿkeepÿinÿ1/17000

        .ÿkeepÿfid

        .ÿ
        .ÿquietlyÿsaveÿWant

        .ÿ
        .ÿclearÿ*

        .ÿ
        .ÿ*
        .ÿ*ÿBeginÿhere
        .ÿ*
        .ÿtimerÿclearÿ1

        .ÿtimerÿonÿ1

        .ÿ
        .ÿuseÿWant

        .ÿmergeÿ1:1ÿfidÿusingÿBig,ÿassert(matchÿusing)ÿkeep(match)

        ÿÿÿÿResultÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobs
        ÿÿÿÿ-----------------------------------------
        ÿÿÿÿNotÿmatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ0
        ÿÿÿÿMatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ17,000ÿÿ(_merge==3)
        ÿÿÿÿ-----------------------------------------

        .ÿ
        .ÿtimerÿoffÿ1

        .ÿtimerÿlistÿ1
        ÿÿÿ1:ÿÿÿÿÿ17.12ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ17.1200

        .ÿ
        .ÿclearÿ*

        .ÿ
        .ÿtimerÿclearÿ2

        .ÿtimerÿonÿ2

        .ÿ
        .ÿframeÿcreateÿWant

        .ÿframeÿWant:ÿuseÿWant

        .ÿ
        .ÿuseÿBig

        .ÿfrlinkÿ1:1ÿfid,ÿframe(Want)
        ÿÿ(4,983,000ÿobservationsÿinÿframeÿdefaultÿunmatched)

        .ÿquietlyÿkeepÿifÿ!mi(Want)

        .ÿ
        .ÿtimerÿoffÿ2

        .ÿtimerÿlistÿ2
        ÿÿÿ2:ÿÿÿÿÿ23.28ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ23.2790

        .ÿ
        .ÿeraseÿBig.dta

        .ÿeraseÿWant.dta

        .ÿ
        .ÿexit

        endÿofÿdo-file


        .


        I've assumed a 1:1 cardinality in the join, based upon your description, but the same principle would hold for a many-to-one relationship of firm IDs in the big dataset to your selection list.
        Thank you very much Joseph. It worked perfectly.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Thanks Nick. Very helpful.

          Comment

          Working...
          X