Comparing each observation from one variable to few hundred thousand in another

Shad Merhej

Join Date: Mar 2019

Posts: 5
#1

Comparing each observation from one variable to few hundred thousand in another

01 Mar 2019, 14:21

Hi all,

I am wondering if anyone has any ideas on how I can compare each observation from one variable to every observation in another variable. Essentially, imagine you have 10 observations in one variable A, and then another 1000 in another variable B. For each observation in variable A, I would like to compare it to every value in variable B. In reality, I have over 300,000 observations for each variable, so the computation becomes cumbersome quickly.

I have currently figured out the problem in Python, but it takes over 2 minutes to run through 300 observations (or roughly 10ish hours for the whole dataset). The algorithm is straightforward enough in Python - fix variable 1, compare to every observation in variable 2; fix variable 2, compare to every observation in variable 2, etc., etc... Is there anything in Stata a bit more sophisticated?

I am currently using StataIC 15 on MacOS.
Tags: data, foreach, loop
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

01 Mar 2019, 14:30

please read the FAQ and show us, using -dataex- a sample of your data

my guess is that the fastest way is to split the data into different data sets and try a merge; however, a lot depends on what your data actually look like
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

01 Mar 2019, 14:56

Following on Rich's advice, please be clearer about your objective. What is your objective? When you compare the 7th observation of A to the 23rd observation of B, what do you intend to do if they are equal? What do you intend to do if they are not equal?
1 like
Comment

Shad Merhej

Join Date: Mar 2019
Posts: 5

01 Mar 2019, 15:25

To be more clear, the first dataset looks like:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
lat_pV long_pV id_house
53.63084 -113.42851   1
53.61684 -113.50037   2
53.50153  -113.6725   3
53.61197 -113.37948   4
53.63088 -113.45086   5
 53.5078 -113.50989   6
53.53728 -113.49847   7
end

Then we also have these trees:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
lat_t long_t id_tree
53.42834 -113.59988   1
 53.4733 -113.67162   2
53.48313   -113.645   3
53.47302 -113.67377   4
 53.4744  -113.6316   5
53.52736 -113.48496   6
53.42819  -113.5982   7
53.42513   -113.625   8
53.42472 -113.43206   9
53.63993   -113.509  10
53.47438 -113.63228  11
53.64042 -113.50605  12
 53.6399  -113.5102  13
53.52481 -113.41901  14
53.48009 -113.60895  15
 53.4246  -113.4201  16
53.42474 -113.43127  17
53.47242 -113.67254  18
 53.6326 -113.46173  19
53.48316 -113.64454  20
53.44591  -113.5503  21
 53.6321 -113.41103  22
53.64055 -113.50427  23
53.63245 -113.41138  24
53.63988  -113.5108  25
53.63943 -113.50077  26
53.63988 -113.51573  27
 53.4746 -113.63153  28
53.47489 -113.63205  29
53.42503  -113.6213  30
53.64054 -113.50667  31
53.64027  -113.5075  32
53.63268 -113.46015  33
53.63978 -113.51205  34
53.47506 -113.63287  35
53.47413 -113.63136  36
end

For each line in the first set, we need to compare the latitudes and longitudes to every latitude and longitude in the second. Something simple like the absolute difference between the two. Ideally, the number that fall below a certain threshold (say 0.01) are counted, and then appended to another column for the first dataset. So after running once, the first row of that first dataset would contain an additional number `x' which denotes the number of trees the program found to be within the threshold 0.01. The objective is to have a column added to the first dataset giving the number of trees that fell below the threshold using the difference in coordinates (or alternatively just a simple binary that says such a tree exists).

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30111

01 Mar 2019, 15:45

This will do it:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input lat_pV long_pV id_house
53.63084 -113.42851   1
53.61684 -113.50037   2
53.50153  -113.6725   3
53.61197 -113.37948   4
53.63088 -113.45086   5
 53.5078 -113.50989   6
53.53728 -113.49847   7
end
tempfile houses
save `houses'

* Example generated by -dataex-. To install: ssc install dataex
clear
input lat_t long_t id_tree
53.42834 -113.59988   1
 53.4733 -113.67162   2
53.48313   -113.645   3
53.47302 -113.67377   4
 53.4744  -113.6316   5
53.52736 -113.48496   6
53.42819  -113.5982   7
53.42513   -113.625   8
53.42472 -113.43206   9
53.63993   -113.509  10
53.47438 -113.63228  11
53.64042 -113.50605  12
 53.6399  -113.5102  13
53.52481 -113.41901  14
53.48009 -113.60895  15
 53.4246  -113.4201  16
53.42474 -113.43127  17
53.47242 -113.67254  18
 53.6326 -113.46173  19
53.48316 -113.64454  20
53.44591  -113.5503  21
 53.6321 -113.41103  22
53.64055 -113.50427  23
53.63245 -113.41138  24
53.63988  -113.5108  25
53.63943 -113.50077  26
53.63988 -113.51573  27
 53.4746 -113.63153  28
53.47489 -113.63205  29
53.42503  -113.6213  30
53.64054 -113.50667  31
53.64027  -113.5075  32
53.63268 -113.46015  33
53.63978 -113.51205  34
53.47506 -113.63287  35
53.47413 -113.63136  36
end
tempfile trees
save `trees'

use `houses', clear
isid id_house
des using `trees'
local threshold 0.01
gen high = lat_pV + `threshold'
gen low = lat_pV - `threshold'
rangejoin lat_t low high using `trees'
keep if abs(long_pV - long_t) <= `threshold'

collapse (count) n_trees = id_tree (first) lat_pV long_pV, by(id_house)

//    BRING BACK THE ORIGINAL DATA, DROP DUPLICATES AND SET COUNT TO 0 WHERE APPROPRIATE
append using `houses'
by id_house (n_trees), sort: keep if _n == 1
replace n_trees = 0 if missing(n_trees)

Notes:

1. To use this code you must install -rangejoin-, by Robert Picard, available from SSC. To use -rangejoin- you must also have -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

2. The 0.01 threshold you propose is very stringent, and in your exampls data, only one house has any trees that close, and that one has only one. But the code does work more generally, as you can see by changing the threshold to, say, 0.1.

3. Somehow you mangled your -dataex- output when bringing it here. The -input- command was lost. Not sure how that would have happened.

Note: Although the -rangejoin- command is really well written and attempts to optimize both speed and use of memory to some extent, if your data sets are very large, you may have trouble running out of memory, or it may take a long time. If the latter, my best advice is patience. If the former, you can break up the house data set into smaller subsets, running each of those separately (with the complete tree data set) and then appending all the results together.

Comment

Shad Merhej

Join Date: Mar 2019

Posts: 5
#6

01 Mar 2019, 15:56

Thank you Clyde for the response. The dataset I am using contains over 300K observations (each) - so the threshold is more suitable with the full dataset. I'll get to work implementing the method you've shown here.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

01 Mar 2019, 16:30

Given that latitude and longitude are involved here, I would think that -geonear- (by Robert Picard, available at ssc) could be helpful. It will find the N closest neighbors in file B for observations in file A, or all the neighbors within some specified distance. I have used it to find the 5 closest neighbors (lat/long) for each of 30,000 observations, with those neighbors to be drawn from a file of 400,000 neighbors, and it did the job in a few minutes.
2 likes
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30111

01 Mar 2019, 16:45

With 300K observations in each data set, I think you are likely to run out of memory trying to run the code shown in #5, unless you have a rather gargantuan computer. I suggest splitting the house data into subsets and running each separately, then appending the results. The code below illustrates an efficient way of doing this. Given that the example data set is small, I illustrate it for 3 subsets, but just change the definition of local macro n_subsets to something more appropriate (maybe 100).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input lat_pV long_pV id_house
53.63084 -113.42851   1
53.61684 -113.50037   2
53.50153  -113.6725   3
53.61197 -113.37948   4
53.63088 -113.45086   5
 53.5078 -113.50989   6
53.53728 -113.49847   7
end
tempfile houses
save `houses'

* Example generated by -dataex-. To install: ssc install dataex
clear
input lat_t long_t id_tree
53.42834 -113.59988   1
 53.4733 -113.67162   2
53.48313   -113.645   3
53.47302 -113.67377   4
 53.4744  -113.6316   5
53.52736 -113.48496   6
53.42819  -113.5982   7
53.42513   -113.625   8
53.42472 -113.43206   9
53.63993   -113.509  10
53.47438 -113.63228  11
53.64042 -113.50605  12
 53.6399  -113.5102  13
53.52481 -113.41901  14
53.48009 -113.60895  15
 53.4246  -113.4201  16
53.42474 -113.43127  17
53.47242 -113.67254  18
 53.6326 -113.46173  19
53.48316 -113.64454  20
53.44591  -113.5503  21
 53.6321 -113.41103  22
53.64055 -113.50427  23
53.63245 -113.41138  24
53.63988  -113.5108  25
53.63943 -113.50077  26
53.63988 -113.51573  27
 53.4746 -113.63153  28
53.47489 -113.63205  29
53.42503  -113.6213  30
53.64054 -113.50667  31
53.64027  -113.5075  32
53.63268 -113.46015  33
53.63978 -113.51205  34
53.47506 -113.63287  35
53.47413 -113.63136  36
end
tempfile trees
save `trees'

use `houses', clear
isid id_house

//    IDENTIFY SMALLER DATA SETS TO WORK ON
local n_subsets 3
gen long subset = floor(_n/`n_subsets')
gen treefile = `"`trees'"'


capture program drop one_subset
program define one_subset
    local threshold 0.01
    gen high = lat_pV + `threshold'
    gen low = lat_pV - `threshold'
   local trees = treefile[1]
    rangejoin lat_t low high using `trees'
    keep if abs(long_pV - long_t) <= `threshold'
    if _N > 0 {
        collapse (count) n_trees = id_tree (first) lat_pV long_pV, ///
            by(id_house)
    }
    exit
end

runby one_subset, by(subset) status

//    BRING BACK THE ORIGINAL DATA, DROP DUPLICATES AND SET COUNT TO 0 WHERE APPROPRIATE
append using `houses'
by id_house (n_trees), sort: keep if _n == 1
replace n_trees = 0 if missing(n_trees)

Notes:

1. -runby- is written by Robert Picard and me, and is available from SSC.

2. The dance with local macro trees and variable treefile is probably not necessary in your actual situation. I presume that your house and tree data sets are permanent Stata data sets, not tempfile. The italicized commands were just a kludgy way of passing the name of the trees tempfile (a local macro) down to a program that does not accept arguments or options. So you can delete them, and replace the bolded reference to `trees' by the actual name of your tree data set.

Added: Crossed with #7. As I myself do not work with geographic data this way, I didn't think about -geonear-. It may well be better optimized to the task at hand.

Last edited by Clyde Schechter; 01 Mar 2019, 16:47.

Comment

shem shen

Join Date: Mar 2016

Posts: 136
#9

02 Mar 2019, 06:18

Mike Lacy suggested me to use the "cross" command for a similar task.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#10

02 Mar 2019, 07:28

-cross- could work, but -geonear- is likely to be a *lot* faster, as the latter involves a "divide and conquer" algorithm that avoids comparing all possible pairs.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#11

02 Mar 2019, 08:24

shem shen
Actually, with 300K observations in each data set, -cross- would produce a data set with just under 10¹¹ observations--typical computers won't have enough available memory for that. Moreover, even in a sufficietly large computer, -cross- would take a very long time. -geonear- or the -rangejoin- based approach I have shown are far more efficient and feasible.

-cross- would be workable with a smaller data set. Even then, the other approaches would be faster.
1 like
Comment
Shad Merhej

Join Date: Mar 2019

Posts: 5
#12

02 Mar 2019, 17:11

Originally posted by Mike Lacy View Post

Given that latitude and longitude are involved here, I would think that -geonear- (by Robert Picard, available at ssc) could be helpful. It will find the N closest neighbors in file B for observations in file A, or all the neighbors within some specified distance. I have used it to find the 5 closest neighbors (lat/long) for each of 30,000 observations, with those neighbors to be drawn from a file of 400,000 neighbors, and it did the job in a few minutes.

Thanks Mike - I'll take a look into this method as well.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#13

03 Mar 2019, 09:43

Looks like you have a dataset of house locations and you want to count the number of trees within a certain distance from each house. The first thing to know is that a difference of .01 degree of latitude is not the same as a difference of .01 degree of longitude in terms of distance. Lines of longitude converge at the poles while lines of latitudes are parallel. You can test this using geodist (from SSC), the arguments are lat1 lon1 lat2 lon2:

Code:

. geodist 53.50 -113.67 53.51 -113.67
WGS 1984 ellipsoid(6378137,298.257223563) distance = 1.1129575 km

. geodist 53.50 -113.67 53.50 -113.68
WGS 1984 ellipsoid(6378137,298.257223563) distance = .66359054 km

As Mike has pointed out, geonear (from SSC) is the most efficient tool to calculate the measure you want. It would go something like this (using the data example posted in #4 with a target distance of 3km):

Code:

use "statalist_houses.dta", clear
geonear id_house lat_pV long_pV using "statalist_trees.dta", ///
    neighbor(id_tree lat_t long_t) long within(3)
bysort id_house (km_to_id_tree): gen wanted = _N
list, sepby(id_house)
by id_house: keep if _n == 1

and the list command results:

Code:

. list, sepby(id_house)

     +-----------------------------------------+
     | id_house   id_tree   km_to_i~e   wanted |
     |-----------------------------------------|
  1. |        1        24   1.1439413        4 |
  2. |        1        22   1.1609611        4 |
  3. |        1        33   2.0961361        4 |
  4. |        1        19   2.1989861        4 |
     |-----------------------------------------|
  5. |        2        26   2.5121058       10 |
  6. |        2        10    2.629932       10 |
  7. |        2        13   2.6448687       10 |
  8. |        2        32   2.6474065       10 |
  9. |        2        12   2.6484795       10 |
 10. |        2        23   2.6488013       10 |
 11. |        2        25   2.6528447       10 |
 12. |        2        34   2.6643258       10 |
 13. |        2        31   2.6679597       10 |
 14. |        2        27   2.7550948       10 |
     |-----------------------------------------|
 15. |        3         3   2.7378414        2 |
 16. |        3        20   2.7558569        2 |
     |-----------------------------------------|
 17. |        4        22   3.0563427        1 |
     |-----------------------------------------|
 18. |        5        33   .64459365        4 |
 19. |        5        19   .74193154        4 |
 20. |        5        24   2.6091402        4 |
 21. |        5        22   2.6294348        4 |
     |-----------------------------------------|
 22. |        6         6   2.7285711        1 |
     |-----------------------------------------|
 23. |        7         6   1.4190729        1 |
     +-----------------------------------------+

geonear should be able to handle the whole 300K * 300K problem in one go. The only possible problem is if the target distance picks up too many trees per house (say > 1000 trees). If that's an issue, you can split the house dataset into smaller parts and do each separately. You can even automate this with runby (from SSC).

Comment

Shad Merhej

Join Date: Mar 2019

Posts: 5
#14

03 Mar 2019, 11:01

Originally posted by Robert Picard View Post

Looks like you have a dataset of house locations and you want to count the number of trees within a certain distance from each house. The first thing to know is that a difference of .01 degree of latitude is not the same as a difference of .01 degree of longitude in terms of distance.

Thanks Robert - that output looks like a very elegant solution to the problem.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#15

05 Mar 2019, 11:28

You may want to take a look at this post for a fully worked out example of how to break down the problem into smaller groups using runby (from SSC).
Comment

Announcement