Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • need a elegant sample matching

    Hi guys,

    I have a problem when matching my sample using "joinby"

    I have two data sets, one is the case data set and the other is control data set, they all have the same variables,
    and I want to find some matches(can be more than one) for each individual in case data set.
    I searched on statalist and there are several thread about this topic, I chose to use "joinby" given what I want to do .
    but the problem is when I use "joinby", STATA get extremely slow, even stuck, and two files I use both smaller than 50MB,

    WHEN I USE THE WHOLE DATASET, each CONTAINS 42 variables and no more than 10000 observation, STATA give me this:

    Code:
    op. sys. refuses to provide memory
    then I only keep four variables I need in both data sets, and stata get kind of stuck,

    below is a sketch of my data and code
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    this a sketch from case data set (privatize-sample1215)
    clear
    input str9 firmid float industrycode2 double inventory float IR
    "001584837" 2600   647   .09568175
    "001584837" 2600  1037    .1985069
    "001584837" 2600  1990    .0987593
    "001584837" 2600  3570   .04087942
    "001596563" 1300  1200   .14754704
    "001596563" 1300  3942   .07206186
    "001596563" 1300 14566   .20515493
    "001596563" 1300  4287    .0958846
    "001596563" 1300 14287     .906864
    "001596563" 1300  1206   .08642064
    "001596694" 1500 13239    .9516929
    "001596694" 1500 15056   .24014674
    "001596694" 1500 16560    .4221582
    "001596694" 1500 17328    .5035599
    "001596694" 1500 14320    .5021566
    "00159885X" 1500   101   .01288594
    "00159885X" 1500   372  .015686937
    "00159885X" 1500   517  .016111441
    "00159885X" 1500  2457   .08355153
    "00159885X" 1500     0 .0025296365
    end
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    this a sketch from control data set (privatize-control1009)
    clear
    input str9 firmid float industrycode2 double inventory float IR
    "016016226" 3100 1034    .6454432
    "016016226" 3100 1012   1.1269488
    "016016226" 3100  896     .688701
    "016016226" 3100  653     .449415
    "016016226" 3100 1685    .6443595
    "016016226" 3100 2036    .7181658
    "100003401" 3300 1091    .4313958
    "100003401" 3300  945    .2691541
    "100003401" 3300 1134    .2837838
    "100003401" 3300 1470   .23374145
    "101109750" 3900 2608    .9709606
    "101109750" 3900 1952    .4393428
    "101109750" 3900 2806   .54034275
    "101109750" 3900 4310     .611521
    "101131499" 4100  296     .186398
    "101131499" 4100  228   .11149144
    "101131499" 4100  220   .15182884
    "101131499" 4100    0 .0025296365
    "101133400" 3600  450   .11792453
    "101133400" 3600  507   .26120555
    end
    Code:
    clear
    use "E:\Research\privatization\privatize-sample1215.dta"
    
    joinby industrycode2 using "E:\Research\privatization\privatize-control1009.dta"
    any suggestions?
    Last edited by 高佳; 28 Jun 2016, 07:07.

  • #2
    I can think of two approaches.

    1. Break up each of your data sets into subsets defined by groups of industrycode2. For example, the first batch might consist of only observations with industrycode2 having a leading digit of 0, the second only those with industrycode2's leading digit 1, etc. Then run -joinby- separately on each batch, and append all the results together.

    2. It appears that industrycode2, although for most purposes best handled as a string variable since it is not used for calculations, is actually a 9-digit number. If that is so, you could convert it to a numeric variable stored as a -long- or -double- (-float- will not have enough precision). Then you could use -rangejoin-, a new program written by Robert Picard, and available from SSC. -rangejoin- is faster and uses less memory than joinby. If you take this approach, I would recommend that you convert the industrycode2 back to a string variable once the join is done to prevent accidentally doing something that changes its data type to float and loses information.

    Comment


    • #3
      Along with Clyde's helpful approaches, let me step back and ask what your objective is. I notice that within each industrycode2 you have multiple observations for each firmid. Similar questions on Statalist in the past have sought a list of each distinct combination of sample firmid and control firmid within the same industrycode2. If that is your objective, you should create temporary datasets containing just firmid and industrycode2, and then use the duplicates command to reduce each dataset to distinct observations, and then use joinby on these reduced datasets.

      Comment


      • #4
        thank you two Clyde and William for helpful suggestions.

        for Clyde, what you suggested seems good, I will try both of them, and post my result here once I get them for others to use.

        for William, I sorry for not stating my objective very clearly. You are right, what I want to do is first match cases with controls(one cases may have multiple matches, and one control can be matched to multiple cases) by industrycode2, one thing I didn't mention before is that for each firmid in the case data set have observations in year 1998,2000,2001,2002,2003 , there is a particular "event" occurred during in one of these five years. next step I need to do after I matched the cases with controls is to keep those controls which have an IR in the range of 90%-110% at the year the particular "event" occurred of these controls matched case.

        Comment


        • #5
          And is it the case that each observation in the case and control datasets also includes the year variable? That makes a HUGE difference in what you're doing.

          If indeed both the case and control datasets also include a year variable, then it seems possible to me that what you want may be something like
          Code:
          joinby industrycode2 year
          which would substantially reduce the size of the resulting dataset. As it stands, you are pairing case data from 1998 with control data from each of the 5 years, which is unlikely to be what you want and ultimately makes your output dataset 5 times the size it needs to be.

          Comment


          • #6
            thanks you two, and for others to know, the RANGEJOIN mentioned by Clyde earlier in this post works and it's very fast for my problem.

            now I complete the first step of my matching process, but here comes another problem, every firm in my case data set be matched with several firms in the control data set, and every case firm has a particular year. I need at least one of those matched firms in the control data set matched with the given case firm have at least one more year of observation.

            example: A, firm in the case data set has a particular year 2000, firm B,C in the control data set matched with firm A , each has observation in year 1999, 2000, 2001 and 1998 1999 2000, since firm C has no observation after 2000, so I only keep firm B as a match for A.
            Code:
            * Example generated by -dataex-. To install: ssc install dataex variables with _U in it is from control data set.
            clear
            input float year str9 firmid float industrycode2 str9 firmid_U float year_U
            2003 "101101628" 3300 "715910825" 2003
            2003 "101101628" 3300 "160236012" 2003
            2003 "101101628" 3300 "180963807" 2003
            1999 "101105573" 1300 "159622442" 1999
            1999 "101105573" 1300 "709600222" 1999
            1999 "101105573" 1300 "835542078" 1999
            1999 "101105573" 1300 "169441434" 1999
            1999 "101105573" 1300 "146019724" 1999
            1999 "101105573" 1300 "158664536" 1999
            1999 "101105573" 1300 "113751150" 1999
            1999 "101105573" 1300 "176927619" 1999
            1999 "101105573" 1300 "138894721" 1999
            1999 "101105573" 1300 "20245131X" 1999
            1999 "101105573" 1300 "898221693" 1999
            1999 "101105573" 1300 "165265752" 1999
            1999 "101105573" 1300 "215070508" 1999
            1999 "101105573" 1300 "177169346" 1999
            1999 "101105573" 1300 "138292827" 1999
            1999 "101105573" 1300 "860892619" 1999
            1999 "101105573" 1300 "215300908" 1999
            1999 "101105573" 1300 "109790675" 1999
            1999 "101105573" 1300 "175990106" 1999
            1999 "101105573" 1300 "101164960" 1999
            1999 "101105573" 1300 "700959167" 1999
            1999 "101105573" 1300 "113690669" 1999
            1999 "101105573" 1300 "172913869" 1999
            1999 "101105573" 1300 "161772052" 1999
            1999 "101105573" 1300 "162640537" 1999
            1999 "101105573" 1300 "614140155" 1999
            1999 "101105573" 1300 "168130108" 1999
            end

            and below is the code for first step matching
            Code:
            clear
            use "E:\Research\privatization\privatize-sample1215.dta"
            by firmid: keep if year ==priva_year
            gen lower_IR = IR*0.9
            gen upper_IR = IR*1.1
            // matching
            rangejoin IR lower_IR upper_IR using "E:\Research\privatization\privatize-control1009.dta", by(industrycode2) // industry and IR match
            bys firmid: keep if year ==year_U

            Comment

            Working...
            X