Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using data not sorted (nearmrg)

    Hello,

    I am trying to use nearmrg on my data files and I keep getting the same error: "Using data not sorted". As I wanted to break down the problem I used very simple test data instead of my real data and the error message still shows up. Now I have the following:

    Master.dta
    Group Date
    A 15.01.2012
    A 15.02.2012
    B 15.01.2012
    B 15.02.2012
    C 15.01.2012
    C 15.02.2012

    Using.dta
    Group Date SVarOfInterest1 SVarOfInterest2
    A 01.01.2012 1 201
    A 15.01.2012 2 202
    A 03.02.2012 3 203
    A 23.02.2012 4 204
    B 03.01.2012 11 211
    B 19.01.2012 12 212
    B 03.02.2012 13 213
    C 20.01.2012 21 221
    C 25.01.2012 22 222
    C 04.02.2012 23 223
    C 03.01.2012 24 224

    This is the code:

    nearmrg Group using Using.dta, nearvar(Date) genmatch(SourceDate) lower
    using data not sorted
    r(5);

    Thanks in advance!
    Maryna

  • #2
    Cross-posted http://stackoverflow.com/questions/3...sorted-nearmrg

    Comment


    • #3
      Here's a solution using rangejoin, a new user-written program available from SSC. To install, type in Stata's Command window:

      Code:
      ssc install rangejoin
      Here's an example that uses the data you posted with a +/- 15 days window. Note that with rangejoin, more than 1 observation from the using dataset can match. And the same using observation may match more than one observation from the master. It's up to you what you want to do with the paired observations. I added code at the end to reduce to the closest matching observation by date.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str1 Group str10 Date int mdate
      "A" "15.01.2012" 19007
      "A" "15.02.2012" 19038
      "B" "15.01.2012" 19007
      "B" "15.02.2012" 19038
      "C" "15.01.2012" 19007
      "C" "15.02.2012" 19038
      end
      format %td mdate
      save "master.dta", replace
      
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str1 Group str10 Date byte SVarOfInterest1 int(SVarOfInterest2 udate)
      "A" "01.01.2012"  1 201 18993
      "A" "15.01.2012"  2 202 19007
      "A" "03.02.2012"  3 203 19026
      "A" "23.02.2012"  4 204 19046
      "B" "03.01.2012" 11 211 18995
      "B" "19.01.2012" 12 212 19011
      "B" "03.02.2012" 13 213 19026
      "C" "20.01.2012" 21 221 19012
      "C" "25.01.2012" 22 222 19017
      "C" "04.02.2012" 23 223 19027
      "C" "03.01.2012" 24 224 18995
      end
      format %td udate
      save "using.dta", replace
      
      * pair obs in master with any obs in the using that is +/- 15 days
      use "master.dta", clear
      gen low = mdate - 15
      gen high = mdate + 15
      rangejoin udate low high using "using.dta", by(Group)
      
      * if desired, reduce to the obs from the using that is the closest in terms of
      * dates, in case of ties, pick the earlier one
      bysort Group mdate: gen diff = abs(mdate - udate)
      bysort Group mdate (diff udate): keep if _n == 1

      Comment


      • #4
        Thanks for this suggestion! It works perfectly with this data. In my actual data I have to match bond transaction prices with the most current bond rating. Maybe this approach will produce a lot of outcomes but it if my computer can take this amount of data, is definately worth a try, thanks!

        Comment


        • #5
          Thanks again, this seems to work well for my actual data, too. I have another short question, thought. If I wanted to leave any observations that had the closest match but also those that had no match, what would I write instead of keep if _n == 1? Sorry for this basic question.

          Comment


          • #6
            Glad to hear that this works for you. If an observation from the master has no match, it will remain in the resulting dataset, even with the keep if _n == 1 statement. For example, if you change the interval bounds to -5 and 5, some observations from the master will not find a match in the using:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str1 Group str10 Date int mdate
            "A" "15.01.2012" 19007
            "A" "15.02.2012" 19038
            "B" "15.01.2012" 19007
            "B" "15.02.2012" 19038
            "C" "15.01.2012" 19007
            "C" "15.02.2012" 19038
            end
            format %td mdate
            save "master.dta", replace
            
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str1 Group str10 Date byte SVarOfInterest1 int(SVarOfInterest2 udate)
            "A" "01.01.2012"  1 201 18993
            "A" "15.01.2012"  2 202 19007
            "A" "03.02.2012"  3 203 19026
            "A" "23.02.2012"  4 204 19046
            "B" "03.01.2012" 11 211 18995
            "B" "19.01.2012" 12 212 19011
            "B" "03.02.2012" 13 213 19026
            "C" "20.01.2012" 21 221 19012
            "C" "25.01.2012" 22 222 19017
            "C" "04.02.2012" 23 223 19027
            "C" "03.01.2012" 24 224 18995
            end
            format %td udate
            save "using.dta", replace
            
            * pair obs in master with any obs in the using that is +/- 15 days
            use "master.dta", clear
            gen low = mdate - 5
            gen high = mdate + 5
            rangejoin udate low high using "using.dta", by(Group)
            
            * if desired, reduce to the obs from the using that is the closest in terms of
            * dates, in case of ties, pick the earlier one
            bysort Group mdate: gen diff = abs(mdate - udate)
            bysort Group mdate (diff udate): keep if _n == 1
            list, sepby(Group) noobs
            
            
              +------------------------------------------------------------------------------------------------------+
              | Group         Date       mdate     low    high       Date_U   SVarOf~1   SVarOf~2       udate   diff |
              |------------------------------------------------------------------------------------------------------|
              |     A   15.01.2012   15jan2012   19002   19012   15.01.2012          2        202   15jan2012      0 |
              |     A   15.02.2012   15feb2012   19033   19043                       .          .           .      . |
              |------------------------------------------------------------------------------------------------------|
              |     B   15.01.2012   15jan2012   19002   19012   19.01.2012         12        212   19jan2012      4 |
              |     B   15.02.2012   15feb2012   19033   19043                       .          .           .      . |
              |------------------------------------------------------------------------------------------------------|
              |     C   15.01.2012   15jan2012   19002   19012   20.01.2012         21        221   20jan2012      5 |
              |     C   15.02.2012   15feb2012   19033   19043                       .          .           .      . |
              +------------------------------------------------------------------------------------------------------+
            As you can see, all observations from the master remain, the ones that did not match have missing values for variables that come from the using dataset.

            Comment


            • #7
              Indeed, they are all saved. Somehow, with my data, the resulting file only has about 15% of the original observations.

              Comment


              • #8
                That could happen if your master has multiple observations per Group and date. If that's the case with your real data, you need an extra identifier (variable) that can be used to uniquely identify observations in the master. This is something that you can check using:

                Code:
                isid Group mdate obs_id, sort
                Here's a reworked example that incorporates this check and then reduces the data to one observation per original observation:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str1 Group str10 Date int mdate
                "A" "15.01.2012" 19007
                "A" "15.02.2012" 19038
                "A" "15.02.2012" 19038
                "B" "15.01.2012" 19007
                "B" "15.02.2012" 19038
                "C" "15.01.2012" 19007
                "C" "15.02.2012" 19038
                end
                format %td mdate
                gen obs_id = _n
                isid Group mdate obs_id, sort
                save "master.dta", replace
                
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str1 Group str10 Date byte SVarOfInterest1 int(SVarOfInterest2 udate)
                "A" "01.01.2012"  1 201 18993
                "A" "15.01.2012"  2 202 19007
                "A" "03.02.2012"  3 203 19026
                "A" "23.02.2012"  4 204 19046
                "B" "03.01.2012" 11 211 18995
                "B" "19.01.2012" 12 212 19011
                "B" "03.02.2012" 13 213 19026
                "C" "20.01.2012" 21 221 19012
                "C" "25.01.2012" 22 222 19017
                "C" "04.02.2012" 23 223 19027
                "C" "03.01.2012" 24 224 18995
                end
                format %td udate
                save "using.dta", replace
                
                * pair obs in master with any obs in the using that is +/- 15 days
                use "master.dta", clear
                gen low = mdate - 5
                gen high = mdate + 5
                rangejoin udate low high using "using.dta", by(Group)
                
                * if desired, reduce to the obs from the using that is the closest in terms of
                * dates, in case of ties, pick the earlier one
                gen diff = abs(mdate - udate)
                bysort Group mdate obs_id (diff udate): keep if _n == 1
                list, sepby(Group) noobs

                Comment


                • #9
                  You were right, I have multiple observations per Group and Date, which get lost in this step. Thanks a lot for your suggestion!

                  Comment

                  Working...
                  X