Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Conditional sum

    Hi,

    I have a dataset that looks like this (also see attachment):

    ID area x_koord y_koord x_min x_max y_min y_max
    100m_61570_4423 115 4423 61570 4422 4424 61569 61571
    100m_61571_4423 291 4424 61571 4423 4425 61570 61572

    The four min/max variables are calculated as follows:

    gen x_min = x_koord-1
    gen x_max = x_koord+1
    gen y_min = y_koord-1
    gen y_max = y_koord+1

    Now I want to create a new variable that sums the column "area" conditional on the four calculated variables listed above.

    The new variable should take the sum of the variable "area" if (x_min <= x_koord <= x_max) & (y_min <= y_koord <= y_max)

    I tried using this formula (but it didn't work):

    egen total = sum(area) if (x_min <= x_koord <= x_max) & (y_min <= y_koord <= y_max)

    (I am more used to working in Excel, where it could have been solved using a SUMIFS formula)

    Any suggestions on how to make this work in Stata?

    Thanks in advance.

    /Soeren
    Attached Files

  • #2
    I do not use Excel so I have no idea what SUMIFS does but I assume that you want more than just add area since the condition for each observation will always be true as you defined it. It looks like you want to add, for each observation, the value of area for all observations that are within a certain distance of that observation. If that's the case, then the following should work

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str15 id float(area x_koord y_koord)
    "100m_61570_4423" 115 4423 61570 
    "100m_61570_4424" 200 4424 61570 
    "100m_61570_4425" 267 4425 61570 
    "100m_61571_4423" 291 4423 61571 
    "100m_61571_4424" 509 4424 61571 
    "100m_61571_4425" 154 4425 61571 
    "100m_61572_4424" 70 4424 61572 
    "100m_61572_4425" 183 4425 61572 
    end
    tempfile f
    save "`f'"
    
    * verify assumptions about the data
    isid id, sort
    
    * form all pairwise combinations
    rename * =_0
    cross using "`f'"
    
    * reduce to observations within the desired distance
    bysort id: keep if inrange(x_koord_0, x_koord-1, x_koord+1) & ///
        inrange(y_koord_0, y_koord-1, y_koord+1)
        
    * now you can calculate the sum of area 
    by id: egen area_sum = total(area_0)
    
    * and finally return to one obs per id
    by id: keep if _n == 1
    drop *_0

    Comment


    • #3
      Soeren, welcome to Statalist! Robert has suggested an approach that should yield the results you want. Let me add that your code
      Code:
      x_min <= x_koord <= x_max​
      is SAS syntax, not Stata syntax, which would be
      Code:
      x_min <= x_koord & x_koord <= x_max
      That may have been what caused your attempt to fail, although since you did not tell us what Stata reported in response to your attempt, we cannot know with certainty.

      As a general rule, we can better help you if we know what commands you have tried and what Stata told you to indicate that there was a problem. Please review the Statalist FAQ linked to from the top of every page, especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata log window and paste them into your Statalist post using CODE delimiters, as described in section 12 of the FAQ.

      Comment


      • #4
        Thanks for your help, but the suggested code, doesn't seem to work for me.

        I am unable to run the line "rename * =_0" and if I run the rest of the do.file, then the result is 8 * area for each line.

        The outcome I was hoping for was is this for each row:

        row #1: egen area_sum_row1 = sum(area) if [(4423-1) <= x_koord] & [x_koord <= (4423+1)] & [(61570-1) <= y_koord] & [y_koord <= (61570+1)]
        row #2: egen area_sum_row2 = sum(area) if [(4424-1) <= x_koord] & [x_koord <= (4424+1)] & [(61570-1) <= y_koord] & [y_koord <= (61570+1)]
        etc.

        and then combined into one single variable, e.g. by:

        gen area_sum = 0

        replace area_sum = area_sum_row1 if count == 1
        replace area_sum = area_sum_row2 if count == 2
        etc.

        However, my full data set has 300,000 observations, so I would prefer using another approach.

        Hope you can help.

        /Soren

        Comment


        • #5
          I am unable to run the line "rename * =_0"
          Robert Picard is using a new powerful syntax of the rename command, (announcement of Stata 12)which is not supported in the older versions of Stata. Statalist requires you to specify in your query the version of Stata you are using, otherwise the list members will assume you are using the most recent version (at this time 14).

          The command as used by Robert renames variables to have indices _0: for example id is turned into id_0. You can do the same in an older Stata with a simple loop.

          Soren has posted an example portion of data. It can be directly imported into Stata with a command:
          Code:
          insheet using "http://www.statalist.org/forums/filedata/fetch?id=1305585"
          Best, Sergiy Radyakin

          Comment


          • #6
            You are asked in the Statalist FAQ to

            11. State the version of Stata used

            The current version of Stata is 14.0. Please specify if you are using an earlier version; otherwise, the answer to your question is likely to refer to commands or features unavailable to you. Moreover, as bug fixes and new features are issued frequently by StataCorp, make sure that you update your Stata before posting a query, as your problem may already have been solved.
            As Sergiy has explained, you can simply rename the variables one by one. However, if you really have 300K observations and you do not have any means of reducing the size of the problem, you'll have to look for another solution.

            Are these geographic coordinates (latitude and longitude)? If so, you could use geonear (from SSC) to find all neighbors within a certain distance.

            Comment


            • #7
              Here's a revised example, the way Sergiy likes to do it

              Code:
              do http://robertpicard.com/statalist/2015/08/11/Soeren_Staal.do
              For the record, I don't think this is an desirable way to post a solution on Statalist. I does confer several advantages to the poster:
              • the code can be changed/corrected/updated outside the normal 1 hour time window for each post without having to notify readers
              • the poster can monitor the number of users who download the do-file
              • the poster can track the IP address of people who download
              • the poster can remove the do-file at any time in the future
              • this reduces the chance that less interested viewers will notice the code
              • this hides the code from Statalist's search engine
              For the readers (not just the OP), they have the choice of running the do-file to "discover" the solution or open a new browser window and download the do-file. Mind you this would not be much less of an issue if the poster included the contents of the do-file in their post. In this case, it would look like

              Code:
              clear
              insheet using "http://www.statalist.org/forums/filedata/fetch?id=1305585"
              tempfile f
              save "`f'"
              
              * verify assumptions about the data
              isid id, sort
              
              * form all pairwise combinations
              rename id id_0
              rename area area_0
              rename x_koord x_koord_0
              rename y_koord y_koord_0
              cross using "`f'"
              
              * reduce to observations within the desired distance
              bysort id: keep if inrange(x_koord_0, x_koord-1, x_koord+1) & ///
                  inrange(y_koord_0, y_koord-1, y_koord+1)
                  
              * now you can calculate the sum of area 
              by id: egen area_sum = total(area_0)
              
              * and finally return to one obs per id
              by id: keep if _n == 1
              drop *_0
              Again however, the solution does not include all the information needed to understand it because the nature of the data is not revealed without the extra step of downloading a copy of the data, running the example, or running the part of the example that loads the data in memory. The whole point of dataex is to prepare a small data example that people can play with in view of presenting a self-contained solution, with everything needed to understand it in the post. There is no need to use Stata or any additional downloading steps to understand the following solution:

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str15 id float(area x_koord y_koord)
              "100m_61570_4423" 115 4423 61570 
              "100m_61570_4424" 200 4424 61570 
              "100m_61570_4425" 267 4425 61570 
              "100m_61571_4423" 291 4423 61571 
              "100m_61571_4424" 509 4424 61571 
              "100m_61571_4425" 154 4425 61571 
              "100m_61572_4424" 70 4424 61572 
              "100m_61572_4425" 183 4425 61572 
              end
              tempfile f
              save "`f'"
              
              * verify assumptions about the data
              isid id, sort
              
              * form all pairwise combinations
              rename id id_0
              rename area area_0
              rename x_koord x_koord_0
              rename y_koord y_koord_0
              cross using "`f'"
              
              * reduce to observations within the desired distance
              bysort id: keep if inrange(x_koord_0, x_koord-1, x_koord+1) & ///
                  inrange(y_koord_0, y_koord-1, y_koord+1)
                  
              * now you can calculate the sum of area 
              by id: egen area_sum = total(area_0)
              
              * and finally return to one obs per id
              by id: keep if _n == 1
              drop *_0
              Of course, I'm just one member and this is just an opinion.

              Comment


              • #8
                First of all thank you for your help the code is returning the result I want.

                However, it takes quite long to perform the required calculations. Therefore I will have another look at my data, to see if I can split the dataset into smaller parts.

                @Sergyi: It is correct that I am working in an older version (11). And now it is clear to me, that I should've mentioned this in my initial post.

                Robert #1: They are coordinates, but not in a standard projection format. Geonear will not return the solution that I need but I will try to contact some of my colleagues, who are working with GIS to see if they can help me.

                Robert #2: I get your point and will post code differently in the future.

                Again thank you for your help.

                /Soeren

                Comment


                • #9
                  If these are projected coordinates (as opposed to latitude and longitude in decimal degrees), you can indeed convert them back to lat/lon (using some GIS software) and then use geonear (from SSC) to find the nearest neighbors. That would require that you know what map projection was used to generate your coordinates. If you can't get the coordinates back to lat/lon, then you will essentially be measuring distances on a map and not on the surface of the earth. These are invariably inaccurate as you cannot represent a spheroid on a plane in a way that respect distances between points.

                  If you end up having to work with projected coordinates however, you can still use geonear to find the nearest neighbors. The trick is to scale your coordinates so that they appear to be near the equator. At those latitudes, the distance between one degree of latitude is about the same as the distance between one degree of longitude. Here's an example of how this would work with points that have coordinates with values that are similar to those you showed in your example.

                  Code:
                  * generate 10 points with coordinates similar to original post
                  version 11
                  set seed 43241
                  clear
                  set obs 10
                  gen id1 = _n
                  gen x1 = int(4000 + runiform() * 1000)
                  gen y1 = int(55000 + runiform() * 10000)
                  tempfile f
                  save "`f'"
                  
                  * map projection bounds
                  sum x1, meanonly
                  local xmin = r(min)
                  local xmax = r(max)
                  sum y1, meanonly
                  local ymin = r(min)
                  local ymax = r(max)
                  
                  * form all pairwise combinations to be able to calculate all distances
                  rename id1 id2
                  rename x1 x2
                  rename y1 y2
                  cross using "`f'"
                  sort id1 id2
                  
                  * cartesian distance between points
                  gen dc = sqrt((x1-x2)^2 + (y1-y2)^2)
                  
                  * now rescale so that coordinates are near the equator
                  gen lat1 = y1 / 100000
                  gen lon1 = x1 / 100000
                  gen lat2 = y2 / 100000
                  gen lon2 = x2 / 100000
                  
                  * calculate geographic distances; -geodist- is from SSC
                  * since these are not meaningful coordinates, use a radius of 1 
                  * to generate distances in radians
                  geodist lat1 lon1 lat2 lon2, gen(dg) radius(1)
                  
                  * use a point in the middle of the map to calculate a scaling factor 
                  * to convert back the geographic distance to units that are proportional
                  * to those used in the map
                  local xmid1 = (`xmax' - `xmin') / 2
                  local ymid1 = (`ymax' - `ymin') / 2
                  
                  * locate a point that is 1 unit away from the midpoint in both direction
                  local xmid2 = `xmid1' + 1
                  local ymid2 = `ymid2' + 1
                  
                  * the cartesian distance between the two points
                  local dmidc = sqrt((`xmid1'-`xmid2')^2 + (`ymid1'-`ymid2')^2)
                  
                  * repeat the scaling and calculate geographic distances
                  local xmid1 = `xmid1' / 100000
                  local ymid1 = `ymid1' / 100000
                  local xmid2 = `xmid2' / 100000
                  local ymid2 = `ymid2' / 100000
                  geodist `ymid1' `xmid1' `ymid2' `xmid2', radius(1)
                  local dmidg = r(distance)
                  
                  * show that the distances are very close once scaled properly
                  gen dgscaled = dg / `dmidg' * `dmidc'
                  gen error = abs(dgscaled - dc)
                  The use of cross to form all pairwise combination shows the brute force approach to find the nearest neighbors. You need to calculate for each point the distance to every other points in the data to determine the nearest neighbors of that point. This works well for small problems but quickly becomes unmanageable if the are more than 10K points (100 million distances). geonear was designed to tackle large problems with millions of points (which would require more than a trillion distances to be calculated). It uses a divide and conquer approach that dramatically reduces the number of distances that need to be calculated.

                  Comment


                  • #10
                    I just created a version of geonear (from SSC) that uses cartesian distances. I called it geonear2d. Here's an example that shows how it could be applied to the problem at hand:

                    Code:
                    * generate points with coordinates similar to the following post:
                    * http://www.statalist.org/forums/forum/general-stata-discussion/general/1305584-conditional-sum
                    
                    version 11
                    set seed 43241
                    clear
                    set obs 1000
                    gen id1 = _n
                    gen x1 = int(4000 + runiform() * 1000)
                    gen y1 = int(55000 + runiform() * 10000)
                    sum
                    tempfile f
                    qui save "`f'"
                    
                    * form all pairwise combinations to be able to calculate all distances
                    rename id1 id
                    rename x1 x
                    rename y1 y
                    cross using "`f'"
                    sort id id1
                    
                    * cartesian distance between points
                    gen double dc = sqrt((x1-x)^2 + (y1-y)^2)
                    
                    * reduce to neighbors within a specific distance
                    bysort id (dc id1): keep if dc <= 100
                    
                    tempfile cross
                    qui save "`cross'"
                    
                    * redo using -geonear2d-
                    use "`f'", clear
                    rename id1 id
                    rename x1 x
                    rename y1 y
                    geonear2d id y x using "`f'", n(id1 y1 x1) within(100)
                    
                    * the number of neighbors within 10 units of distance
                    bysort id: gen N = _N
                    sum N
                    
                    * confirm that -geonear2d- results match exactly those from using -cross-
                    sort id id1
                    merge 1:1 id id1 using "`cross'", nogen
                    
                    assert dc == d_to_id1
                    I rerun the geonear2d part of the example above with 800K points and it took less than 3 minutes to finish on my computer.

                    Code:
                    . clear
                    
                    . set obs 800000
                    number of observations (_N) was 0, now 800,000
                    
                    . gen id1 = _n
                    
                    . gen x1 = int(4000 + runiform() * 1000)
                    
                    . gen y1 = int(55000 + runiform() * 10000)
                    
                    . sum
                    
                        Variable |        Obs        Mean    Std. Dev.       Min        Max
                    -------------+---------------------------------------------------------
                             id1 |    800,000    400000.5    230940.3          1     800000
                              x1 |    800,000    4499.845     288.758       4000       4999
                              y1 |    800,000       59996    2886.523      55000      64999
                    
                    . tempfile f
                    
                    . qui save "`f'"
                    
                    . 
                    . * get all neighbors within 10 units of distance using cartesian distances
                    . rename id1 id
                    
                    . rename x1 x
                    
                    . rename y1 y
                    
                    . geonear2d id y x using "`f'", n(id1 y1 x1) within(10)
                    Warning: This is a stripped down version of -geonear- (from SSC).
                    It only works in long mode and calculates cartesian distances
                    using d = sqrt((y_n - y_b)^2 + (x_n - x_b)^2)
                    
                    current   base    nbor      ops in       cumul   cumul %     remaining
                     region   locs    locs      region     ops (M)    % done    (hh:mm:ss)
                      1311      25     151       3,775         4.7       4.1      00:03:54
                      3567      20     121       2,420        12.8      11.1      00:02:40
                      5764      26     173       4,498        20.5      17.9      00:02:18
                      8013      18     113       2,034        28.7      25.0      00:02:01
                     10167      28     154       4,312        36.4      31.7      00:01:49
                     12485      29     162       4,698        44.7      38.9      00:01:35
                     14813      33     154       5,082        53.0      46.2      00:01:22
                     16669      23     129       2,967        59.7      52.0      00:01:14
                     19004      31     125       3,875        67.9      59.2      00:01:02
                     21260      30     168       5,040        76.0      66.2      00:00:51
                     23675      26     130       3,380        84.7      73.8      00:00:39
                     25832      32     129       4,128        92.5      80.5      00:00:29
                     28061      29     110       3,190       100.4      87.5      00:00:19
                     30327      30     165       4,950       108.7      94.6      00:00:08
                    
                    -------------------------------------------------------------------------------
                    Unique base locations   = 800,000        Unique neighbor locations = 800,000   
                    Bases * Neighbors  (M)  = 640,000.0      Number of regions         = 32,046    
                    Computed distances (M)  = 114.89         Total run time (seconds)  = 162.582
                    -------------------------------------------------------------------------------
                    
                    . 
                    . * the number of neighbors within 10 units of distance
                    . bysort id: gen N = _N
                    
                    . sum N
                    
                        Variable |        Obs        Mean    Std. Dev.       Min        Max
                    -------------+---------------------------------------------------------
                               N | 20,990,398    27.24447    5.118558          4         54
                    I'll probably roll this functionality in a future version of geonear. In the mean time, anyone interested can email me at [email protected] to get geonear2d.

                    Comment

                    Working...
                    X