Using data not sorted (nearmrg)

Maryna Gulenko

Join Date: Apr 2016

Posts: 9
#1

Using data not sorted (nearmrg)

13 Apr 2016, 10:41

Hello,

I am trying to use nearmrg on my data files and I keep getting the same error: "Using data not sorted". As I wanted to break down the problem I used very simple test data instead of my real data and the error message still shows up. Now I have the following:

Master.dta
Group Date
A 15.01.2012
A 15.02.2012
B 15.01.2012
B 15.02.2012
C 15.01.2012
C 15.02.2012

Using.dta
Group Date SVarOfInterest1 SVarOfInterest2
A 01.01.2012 1 201
A 15.01.2012 2 202
A 03.02.2012 3 203
A 23.02.2012 4 204
B 03.01.2012 11 211
B 19.01.2012 12 212
B 03.02.2012 13 213
C 20.01.2012 21 221
C 25.01.2012 22 222
C 04.02.2012 23 223
C 03.01.2012 24 224

This is the code:

nearmrg Group using Using.dta, nearvar(Date) genmatch(SourceDate) lower
using data not sorted
r(5);

Thanks in advance!
Maryna
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 36058
#2

13 Apr 2016, 11:04

Cross-posted http://stackoverflow.com/questions/3...sorted-nearmrg
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

13 Apr 2016, 12:08

Here's a solution using rangejoin, a new user-written program available from SSC. To install, type in Stata's Command window:

Code:

ssc install rangejoin

Here's an example that uses the data you posted with a +/- 15 days window. Note that with rangejoin, more than 1 observation from the using dataset can match. And the same using observation may match more than one observation from the master. It's up to you what you want to do with the paired observations. I added code at the end to reduce to the closest matching observation by date.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Group str10 Date int mdate
"A" "15.01.2012" 19007
"A" "15.02.2012" 19038
"B" "15.01.2012" 19007
"B" "15.02.2012" 19038
"C" "15.01.2012" 19007
"C" "15.02.2012" 19038
end
format %td mdate
save "master.dta", replace

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Group str10 Date byte SVarOfInterest1 int(SVarOfInterest2 udate)
"A" "01.01.2012"  1 201 18993
"A" "15.01.2012"  2 202 19007
"A" "03.02.2012"  3 203 19026
"A" "23.02.2012"  4 204 19046
"B" "03.01.2012" 11 211 18995
"B" "19.01.2012" 12 212 19011
"B" "03.02.2012" 13 213 19026
"C" "20.01.2012" 21 221 19012
"C" "25.01.2012" 22 222 19017
"C" "04.02.2012" 23 223 19027
"C" "03.01.2012" 24 224 18995
end
format %td udate
save "using.dta", replace

* pair obs in master with any obs in the using that is +/- 15 days
use "master.dta", clear
gen low = mdate - 15
gen high = mdate + 15
rangejoin udate low high using "using.dta", by(Group)

* if desired, reduce to the obs from the using that is the closest in terms of
* dates, in case of ties, pick the earlier one
bysort Group mdate: gen diff = abs(mdate - udate)
bysort Group mdate (diff udate): keep if _n == 1

Comment

Maryna Gulenko

Join Date: Apr 2016

Posts: 9
#4

14 Apr 2016, 07:11

Thanks for this suggestion! It works perfectly with this data. In my actual data I have to match bond transaction prices with the most current bond rating. Maybe this approach will produce a lot of outcomes but it if my computer can take this amount of data, is definately worth a try, thanks!
Comment
Maryna Gulenko

Join Date: Apr 2016

Posts: 9
#5

14 Apr 2016, 09:45

Thanks again, this seems to work well for my actual data, too. I have another short question, thought. If I wanted to leave any observations that had the closest match but also those that had no match, what would I write instead of keep if _n == 1? Sorry for this basic question.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

14 Apr 2016, 09:59

Glad to hear that this works for you. If an observation from the master has no match, it will remain in the resulting dataset, even with the keep if _n == 1 statement. For example, if you change the interval bounds to -5 and 5, some observations from the master will not find a match in the using:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Group str10 Date int mdate
"A" "15.01.2012" 19007
"A" "15.02.2012" 19038
"B" "15.01.2012" 19007
"B" "15.02.2012" 19038
"C" "15.01.2012" 19007
"C" "15.02.2012" 19038
end
format %td mdate
save "master.dta", replace

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Group str10 Date byte SVarOfInterest1 int(SVarOfInterest2 udate)
"A" "01.01.2012"  1 201 18993
"A" "15.01.2012"  2 202 19007
"A" "03.02.2012"  3 203 19026
"A" "23.02.2012"  4 204 19046
"B" "03.01.2012" 11 211 18995
"B" "19.01.2012" 12 212 19011
"B" "03.02.2012" 13 213 19026
"C" "20.01.2012" 21 221 19012
"C" "25.01.2012" 22 222 19017
"C" "04.02.2012" 23 223 19027
"C" "03.01.2012" 24 224 18995
end
format %td udate
save "using.dta", replace

* pair obs in master with any obs in the using that is +/- 15 days
use "master.dta", clear
gen low = mdate - 5
gen high = mdate + 5
rangejoin udate low high using "using.dta", by(Group)

* if desired, reduce to the obs from the using that is the closest in terms of
* dates, in case of ties, pick the earlier one
bysort Group mdate: gen diff = abs(mdate - udate)
bysort Group mdate (diff udate): keep if _n == 1
list, sepby(Group) noobs


  +------------------------------------------------------------------------------------------------------+
  | Group         Date       mdate     low    high       Date_U   SVarOf~1   SVarOf~2       udate   diff |
  |------------------------------------------------------------------------------------------------------|
  |     A   15.01.2012   15jan2012   19002   19012   15.01.2012          2        202   15jan2012      0 |
  |     A   15.02.2012   15feb2012   19033   19043                       .          .           .      . |
  |------------------------------------------------------------------------------------------------------|
  |     B   15.01.2012   15jan2012   19002   19012   19.01.2012         12        212   19jan2012      4 |
  |     B   15.02.2012   15feb2012   19033   19043                       .          .           .      . |
  |------------------------------------------------------------------------------------------------------|
  |     C   15.01.2012   15jan2012   19002   19012   20.01.2012         21        221   20jan2012      5 |
  |     C   15.02.2012   15feb2012   19033   19043                       .          .           .      . |
  +------------------------------------------------------------------------------------------------------+

As you can see, all observations from the master remain, the ones that did not match have missing values for variables that come from the using dataset.

Comment

Maryna Gulenko

Join Date: Apr 2016

Posts: 9
#7

14 Apr 2016, 10:18

Indeed, they are all saved. Somehow, with my data, the resulting file only has about 15% of the original observations.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

14 Apr 2016, 10:40

That could happen if your master has multiple observations per Group and date. If that's the case with your real data, you need an extra identifier (variable) that can be used to uniquely identify observations in the master. This is something that you can check using:

Code:

isid Group mdate obs_id, sort

Here's a reworked example that incorporates this check and then reduces the data to one observation per original observation:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Group str10 Date int mdate
"A" "15.01.2012" 19007
"A" "15.02.2012" 19038
"A" "15.02.2012" 19038
"B" "15.01.2012" 19007
"B" "15.02.2012" 19038
"C" "15.01.2012" 19007
"C" "15.02.2012" 19038
end
format %td mdate
gen obs_id = _n
isid Group mdate obs_id, sort
save "master.dta", replace

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Group str10 Date byte SVarOfInterest1 int(SVarOfInterest2 udate)
"A" "01.01.2012"  1 201 18993
"A" "15.01.2012"  2 202 19007
"A" "03.02.2012"  3 203 19026
"A" "23.02.2012"  4 204 19046
"B" "03.01.2012" 11 211 18995
"B" "19.01.2012" 12 212 19011
"B" "03.02.2012" 13 213 19026
"C" "20.01.2012" 21 221 19012
"C" "25.01.2012" 22 222 19017
"C" "04.02.2012" 23 223 19027
"C" "03.01.2012" 24 224 18995
end
format %td udate
save "using.dta", replace

* pair obs in master with any obs in the using that is +/- 15 days
use "master.dta", clear
gen low = mdate - 5
gen high = mdate + 5
rangejoin udate low high using "using.dta", by(Group)

* if desired, reduce to the obs from the using that is the closest in terms of
* dates, in case of ties, pick the earlier one
gen diff = abs(mdate - udate)
bysort Group mdate obs_id (diff udate): keep if _n == 1
list, sepby(Group) noobs

Comment

Maryna Gulenko

Join Date: Apr 2016

Posts: 9
#9

19 Apr 2016, 08:12

You were right, I have multiple observations per Group and Date, which get lost in this step. Thanks a lot for your suggestion!
Comment

Announcement