Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing each observation from one variable to few hundred thousand in another

    Hi all,

    I am wondering if anyone has any ideas on how I can compare each observation from one variable to every observation in another variable. Essentially, imagine you have 10 observations in one variable A, and then another 1000 in another variable B. For each observation in variable A, I would like to compare it to every value in variable B. In reality, I have over 300,000 observations for each variable, so the computation becomes cumbersome quickly.

    I have currently figured out the problem in Python, but it takes over 2 minutes to run through 300 observations (or roughly 10ish hours for the whole dataset). The algorithm is straightforward enough in Python - fix variable 1, compare to every observation in variable 2; fix variable 2, compare to every observation in variable 2, etc., etc... Is there anything in Stata a bit more sophisticated?

    I am currently using StataIC 15 on MacOS.

  • #2
    please read the FAQ and show us, using -dataex- a sample of your data

    my guess is that the fastest way is to split the data into different data sets and try a merge; however, a lot depends on what your data actually look like

    Comment


    • #3
      Following on Rich's advice, please be clearer about your objective. What is your objective? When you compare the 7th observation of A to the 23rd observation of B, what do you intend to do if they are equal? What do you intend to do if they are not equal?

      Comment


      • #4
        To be more clear, the first dataset looks like:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        lat_pV long_pV id_house
        53.63084 -113.42851   1
        53.61684 -113.50037   2
        53.50153  -113.6725   3
        53.61197 -113.37948   4
        53.63088 -113.45086   5
         53.5078 -113.50989   6
        53.53728 -113.49847   7
        end
        Then we also have these trees:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        lat_t long_t id_tree
        53.42834 -113.59988   1
         53.4733 -113.67162   2
        53.48313   -113.645   3
        53.47302 -113.67377   4
         53.4744  -113.6316   5
        53.52736 -113.48496   6
        53.42819  -113.5982   7
        53.42513   -113.625   8
        53.42472 -113.43206   9
        53.63993   -113.509  10
        53.47438 -113.63228  11
        53.64042 -113.50605  12
         53.6399  -113.5102  13
        53.52481 -113.41901  14
        53.48009 -113.60895  15
         53.4246  -113.4201  16
        53.42474 -113.43127  17
        53.47242 -113.67254  18
         53.6326 -113.46173  19
        53.48316 -113.64454  20
        53.44591  -113.5503  21
         53.6321 -113.41103  22
        53.64055 -113.50427  23
        53.63245 -113.41138  24
        53.63988  -113.5108  25
        53.63943 -113.50077  26
        53.63988 -113.51573  27
         53.4746 -113.63153  28
        53.47489 -113.63205  29
        53.42503  -113.6213  30
        53.64054 -113.50667  31
        53.64027  -113.5075  32
        53.63268 -113.46015  33
        53.63978 -113.51205  34
        53.47506 -113.63287  35
        53.47413 -113.63136  36
        end
        For each line in the first set, we need to compare the latitudes and longitudes to every latitude and longitude in the second. Something simple like the absolute difference between the two. Ideally, the number that fall below a certain threshold (say 0.01) are counted, and then appended to another column for the first dataset. So after running once, the first row of that first dataset would contain an additional number `x' which denotes the number of trees the program found to be within the threshold 0.01. The objective is to have a column added to the first dataset giving the number of trees that fell below the threshold using the difference in coordinates (or alternatively just a simple binary that says such a tree exists).

        Comment


        • #5
          This will do it:
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input lat_pV long_pV id_house
          53.63084 -113.42851   1
          53.61684 -113.50037   2
          53.50153  -113.6725   3
          53.61197 -113.37948   4
          53.63088 -113.45086   5
           53.5078 -113.50989   6
          53.53728 -113.49847   7
          end
          tempfile houses
          save `houses'
          
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input lat_t long_t id_tree
          53.42834 -113.59988   1
           53.4733 -113.67162   2
          53.48313   -113.645   3
          53.47302 -113.67377   4
           53.4744  -113.6316   5
          53.52736 -113.48496   6
          53.42819  -113.5982   7
          53.42513   -113.625   8
          53.42472 -113.43206   9
          53.63993   -113.509  10
          53.47438 -113.63228  11
          53.64042 -113.50605  12
           53.6399  -113.5102  13
          53.52481 -113.41901  14
          53.48009 -113.60895  15
           53.4246  -113.4201  16
          53.42474 -113.43127  17
          53.47242 -113.67254  18
           53.6326 -113.46173  19
          53.48316 -113.64454  20
          53.44591  -113.5503  21
           53.6321 -113.41103  22
          53.64055 -113.50427  23
          53.63245 -113.41138  24
          53.63988  -113.5108  25
          53.63943 -113.50077  26
          53.63988 -113.51573  27
           53.4746 -113.63153  28
          53.47489 -113.63205  29
          53.42503  -113.6213  30
          53.64054 -113.50667  31
          53.64027  -113.5075  32
          53.63268 -113.46015  33
          53.63978 -113.51205  34
          53.47506 -113.63287  35
          53.47413 -113.63136  36
          end
          tempfile trees
          save `trees'
          
          use `houses', clear
          isid id_house
          des using `trees'
          local threshold 0.01
          gen high = lat_pV + `threshold'
          gen low = lat_pV - `threshold'
          rangejoin lat_t low high using `trees'
          keep if abs(long_pV - long_t) <= `threshold'
          
          collapse (count) n_trees = id_tree (first) lat_pV long_pV, by(id_house)
          
          //    BRING BACK THE ORIGINAL DATA, DROP DUPLICATES AND SET COUNT TO 0 WHERE APPROPRIATE
          append using `houses'
          by id_house (n_trees), sort: keep if _n == 1
          replace n_trees = 0 if missing(n_trees)
          Notes:

          1. To use this code you must install -rangejoin-, by Robert Picard, available from SSC. To use -rangejoin- you must also have -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

          2. The 0.01 threshold you propose is very stringent, and in your exampls data, only one house has any trees that close, and that one has only one. But the code does work more generally, as you can see by changing the threshold to, say, 0.1.

          3. Somehow you mangled your -dataex- output when bringing it here. The -input- command was lost. Not sure how that would have happened.

          Note: Although the -rangejoin- command is really well written and attempts to optimize both speed and use of memory to some extent, if your data sets are very large, you may have trouble running out of memory, or it may take a long time. If the latter, my best advice is patience. If the former, you can break up the house data set into smaller subsets, running each of those separately (with the complete tree data set) and then appending all the results together.

          Comment


          • #6
            Thank you Clyde for the response. The dataset I am using contains over 300K observations (each) - so the threshold is more suitable with the full dataset. I'll get to work implementing the method you've shown here.

            Comment


            • #7
              Given that latitude and longitude are involved here, I would think that -geonear- (by Robert Picard, available at ssc) could be helpful. It will find the N closest neighbors in file B for observations in file A, or all the neighbors within some specified distance. I have used it to find the 5 closest neighbors (lat/long) for each of 30,000 observations, with those neighbors to be drawn from a file of 400,000 neighbors, and it did the job in a few minutes.

              Comment


              • #8
                With 300K observations in each data set, I think you are likely to run out of memory trying to run the code shown in #5, unless you have a rather gargantuan computer. I suggest splitting the house data into subsets and running each separately, then appending the results. The code below illustrates an efficient way of doing this. Given that the example data set is small, I illustrate it for 3 subsets, but just change the definition of local macro n_subsets to something more appropriate (maybe 100).

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input lat_pV long_pV id_house
                53.63084 -113.42851   1
                53.61684 -113.50037   2
                53.50153  -113.6725   3
                53.61197 -113.37948   4
                53.63088 -113.45086   5
                 53.5078 -113.50989   6
                53.53728 -113.49847   7
                end
                tempfile houses
                save `houses'
                
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input lat_t long_t id_tree
                53.42834 -113.59988   1
                 53.4733 -113.67162   2
                53.48313   -113.645   3
                53.47302 -113.67377   4
                 53.4744  -113.6316   5
                53.52736 -113.48496   6
                53.42819  -113.5982   7
                53.42513   -113.625   8
                53.42472 -113.43206   9
                53.63993   -113.509  10
                53.47438 -113.63228  11
                53.64042 -113.50605  12
                 53.6399  -113.5102  13
                53.52481 -113.41901  14
                53.48009 -113.60895  15
                 53.4246  -113.4201  16
                53.42474 -113.43127  17
                53.47242 -113.67254  18
                 53.6326 -113.46173  19
                53.48316 -113.64454  20
                53.44591  -113.5503  21
                 53.6321 -113.41103  22
                53.64055 -113.50427  23
                53.63245 -113.41138  24
                53.63988  -113.5108  25
                53.63943 -113.50077  26
                53.63988 -113.51573  27
                 53.4746 -113.63153  28
                53.47489 -113.63205  29
                53.42503  -113.6213  30
                53.64054 -113.50667  31
                53.64027  -113.5075  32
                53.63268 -113.46015  33
                53.63978 -113.51205  34
                53.47506 -113.63287  35
                53.47413 -113.63136  36
                end
                tempfile trees
                save `trees'
                
                use `houses', clear
                isid id_house
                
                //    IDENTIFY SMALLER DATA SETS TO WORK ON
                local n_subsets 3
                gen long subset = floor(_n/`n_subsets')
                gen treefile = `"`trees'"'
                
                
                capture program drop one_subset
                program define one_subset
                    local threshold 0.01
                    gen high = lat_pV + `threshold'
                    gen low = lat_pV - `threshold'
                   local trees = treefile[1]
                    rangejoin lat_t low high using `trees'
                    keep if abs(long_pV - long_t) <= `threshold'
                    if _N > 0 {
                        collapse (count) n_trees = id_tree (first) lat_pV long_pV, ///
                            by(id_house)
                    }
                    exit
                end
                
                runby one_subset, by(subset) status
                
                //    BRING BACK THE ORIGINAL DATA, DROP DUPLICATES AND SET COUNT TO 0 WHERE APPROPRIATE
                append using `houses'
                by id_house (n_trees), sort: keep if _n == 1
                replace n_trees = 0 if missing(n_trees)
                Notes:

                1. -runby- is written by Robert Picard and me, and is available from SSC.

                2. The dance with local macro trees and variable treefile is probably not necessary in your actual situation. I presume that your house and tree data sets are permanent Stata data sets, not tempfile. The italicized commands were just a kludgy way of passing the name of the trees tempfile (a local macro) down to a program that does not accept arguments or options. So you can delete them, and replace the bolded reference to `trees' by the actual name of your tree data set.

                Added: Crossed with #7. As I myself do not work with geographic data this way, I didn't think about -geonear-. It may well be better optimized to the task at hand.
                Last edited by Clyde Schechter; 01 Mar 2019, 16:47.

                Comment


                • #9
                  Mike Lacy suggested me to use the "cross" command for a similar task.

                  Comment


                  • #10
                    -cross- could work, but -geonear- is likely to be a *lot* faster, as the latter involves a "divide and conquer" algorithm that avoids comparing all possible pairs.

                    Comment


                    • #11
                      shem shen
                      Actually, with 300K observations in each data set, -cross- would produce a data set with just under 1011 observations--typical computers won't have enough available memory for that. Moreover, even in a sufficietly large computer, -cross- would take a very long time. -geonear- or the -rangejoin- based approach I have shown are far more efficient and feasible.

                      -cross- would be workable with a smaller data set. Even then, the other approaches would be faster.

                      Comment


                      • #12
                        Originally posted by Mike Lacy View Post
                        Given that latitude and longitude are involved here, I would think that -geonear- (by Robert Picard, available at ssc) could be helpful. It will find the N closest neighbors in file B for observations in file A, or all the neighbors within some specified distance. I have used it to find the 5 closest neighbors (lat/long) for each of 30,000 observations, with those neighbors to be drawn from a file of 400,000 neighbors, and it did the job in a few minutes.
                        Thanks Mike - I'll take a look into this method as well.

                        Comment


                        • #13
                          Looks like you have a dataset of house locations and you want to count the number of trees within a certain distance from each house. The first thing to know is that a difference of .01 degree of latitude is not the same as a difference of .01 degree of longitude in terms of distance. Lines of longitude converge at the poles while lines of latitudes are parallel. You can test this using geodist (from SSC), the arguments are lat1 lon1 lat2 lon2:

                          Code:
                          . geodist 53.50 -113.67 53.51 -113.67
                          WGS 1984 ellipsoid(6378137,298.257223563) distance = 1.1129575 km
                          
                          . geodist 53.50 -113.67 53.50 -113.68
                          WGS 1984 ellipsoid(6378137,298.257223563) distance = .66359054 km
                          As Mike has pointed out, geonear (from SSC) is the most efficient tool to calculate the measure you want. It would go something like this (using the data example posted in #4 with a target distance of 3km):

                          Code:
                          use "statalist_houses.dta", clear
                          geonear id_house lat_pV long_pV using "statalist_trees.dta", ///
                              neighbor(id_tree lat_t long_t) long within(3)
                          bysort id_house (km_to_id_tree): gen wanted = _N
                          list, sepby(id_house)
                          by id_house: keep if _n == 1
                          and the list command results:
                          Code:
                          . list, sepby(id_house)
                          
                               +-----------------------------------------+
                               | id_house   id_tree   km_to_i~e   wanted |
                               |-----------------------------------------|
                            1. |        1        24   1.1439413        4 |
                            2. |        1        22   1.1609611        4 |
                            3. |        1        33   2.0961361        4 |
                            4. |        1        19   2.1989861        4 |
                               |-----------------------------------------|
                            5. |        2        26   2.5121058       10 |
                            6. |        2        10    2.629932       10 |
                            7. |        2        13   2.6448687       10 |
                            8. |        2        32   2.6474065       10 |
                            9. |        2        12   2.6484795       10 |
                           10. |        2        23   2.6488013       10 |
                           11. |        2        25   2.6528447       10 |
                           12. |        2        34   2.6643258       10 |
                           13. |        2        31   2.6679597       10 |
                           14. |        2        27   2.7550948       10 |
                               |-----------------------------------------|
                           15. |        3         3   2.7378414        2 |
                           16. |        3        20   2.7558569        2 |
                               |-----------------------------------------|
                           17. |        4        22   3.0563427        1 |
                               |-----------------------------------------|
                           18. |        5        33   .64459365        4 |
                           19. |        5        19   .74193154        4 |
                           20. |        5        24   2.6091402        4 |
                           21. |        5        22   2.6294348        4 |
                               |-----------------------------------------|
                           22. |        6         6   2.7285711        1 |
                               |-----------------------------------------|
                           23. |        7         6   1.4190729        1 |
                               +-----------------------------------------+
                          geonear should be able to handle the whole 300K * 300K problem in one go. The only possible problem is if the target distance picks up too many trees per house (say > 1000 trees). If that's an issue, you can split the house dataset into smaller parts and do each separately. You can even automate this with runby (from SSC).

                          Comment


                          • #14
                            Originally posted by Robert Picard View Post
                            Looks like you have a dataset of house locations and you want to count the number of trees within a certain distance from each house. The first thing to know is that a difference of .01 degree of latitude is not the same as a difference of .01 degree of longitude in terms of distance.
                            Thanks Robert - that output looks like a very elegant solution to the problem.

                            Comment


                            • #15
                              You may want to take a look at this post for a fully worked out example of how to break down the problem into smaller groups using runby (from SSC).

                              Comment

                              Working...
                              X