Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Working with two datasets simultaneously.

    Hi,

    I have two datasets and would need to work with them simultaneously. The first dataset has the coordinates of the center of grid cells (let's call them "stations"), and the latitude and longitude of the surrounding border of the grid cells. The second dataset (shown on the right) has the coordinates of households (call them "households").

    The first dataset is much bigger than the second one, and I would now like to only keep those stations that have at least one household in their grid. I thought, if I could relate these datasets, I could tell Stata to only keep those stations for which a household falls inside their grid. I managed to do it on QGIS, but I wondered if it is also possible to do it on Stata.

    "Stations" dataset:
    Code:
    input long v1 float(lat_center lon_center lat_south lat_north lon_west lon_east)
    4018224 37.958332 -24.958334    37.875 38.041664 -25.04167 -24.875
    4018225    37.875 -24.958334 37.791668 37.958332 -25.04167 -24.875
    4018226 37.791668 -24.958334 37.708336    37.875 -25.04167 -24.875
    4018227 37.708332 -24.958334    37.625 37.791664 -25.04167 -24.875
    4018228    37.625 -24.958334 37.541668 37.708332 -25.04167 -24.875
    4018229 37.541668 -24.958334 37.458336    37.625 -25.04167 -24.875
    4018230 37.458332 -24.958334    37.375 37.541664 -25.04167 -24.875
    4018231    37.375 -24.958334 37.291668 37.458332 -25.04167 -24.875
    4018232 37.291668 -24.958334 37.208336    37.375 -25.04167 -24.875
    4018233 37.208332 -24.958334    37.125 37.291664 -25.04167 -24.875
    "Household" dataset:
    Code:
    input float uniquecluster double(lat_household lon_household)
      1 -12.169283 13.859255
      2 -12.167753 13.845458
      3 -12.747312 13.607121
      4 -11.502919 14.060841
      5 -13.130185 13.109428
      6  -12.42766 13.573896
      7 -13.435833 14.287285
      8   -12.3617 14.778201
      9 -10.827451 14.342114
     10  -9.991311 14.898295
    For visualization, this is how it looks like on QGIS: i.e., I'd like to keep only those green points (stations) that have a household in their cell (here, all these cells are already marked in orange).

    Click image for larger version

Name:	Screenshot 2022-09-13 at 07.05.20.png
Views:	1
Size:	1.83 MB
ID:	1681686



    Does anyone know whether it's possible to do that in Stata, and if yes, how to do it?
    Thank you!

  • #2
    Your example data does not include any households that lie within the grids of any station, so this code is not thoroughly tested. But I believe it works properly.
    Code:
    clear
    input long v1 float(lat_center lon_center lat_south lat_north lon_west lon_east)
    4018224 37.958332 -24.958334    37.875 38.041664 -25.04167 -24.875
    4018225    37.875 -24.958334 37.791668 37.958332 -25.04167 -24.875
    4018226 37.791668 -24.958334 37.708336    37.875 -25.04167 -24.875
    4018227 37.708332 -24.958334    37.625 37.791664 -25.04167 -24.875
    4018228    37.625 -24.958334 37.541668 37.708332 -25.04167 -24.875
    4018229 37.541668 -24.958334 37.458336    37.625 -25.04167 -24.875
    4018230 37.458332 -24.958334    37.375 37.541664 -25.04167 -24.875
    4018231    37.375 -24.958334 37.291668 37.458332 -25.04167 -24.875
    4018232 37.291668 -24.958334 37.208336    37.375 -25.04167 -24.875
    4018233 37.208332 -24.958334    37.125 37.291664 -25.04167 -24.875
    end
    tempfile stations
    save `stations'
    
    clear
    input float uniquecluster double(lat_household lon_household)
      1 -12.169283 13.859255
      2 -12.167753 13.845458
      3 -12.747312 13.607121
      4 -11.502919 14.060841
      5 -13.130185 13.109428
      6  -12.42766 13.573896
      7 -13.435833 14.287285
      8   -12.3617 14.778201
      9 -10.827451 14.342114
     10  -9.991311 14.898295
     end
     tempfile households
     save `households'
     
     use `households', clear
     cross using `stations'
     keep if inrange(lat_household, lat_south, lat_north) ///
        & inrange(lon_household, lon_west, lon_east)
    Evidently, you would not need to save your data sets as tempfiles, and you would replace the references to those tempfiles by the names of your actual data sets.

    If your data sets are large, this code will be slow and may blow through the limits of available memory. You could improve both the efficiency and memory considerations if there is some other variable available in both data sets that could restrict the number of potential combinations, e.g. a county. Then instead of -cross- you could use -joinby county- and Stata would not waste time or memory matching up households with stations in a different county.

    The code shown will leave you with a data set containing all combinations of station and household where the household is within the station's grid. To get from there to just a data set of stations that contain such a household you could then do:
    Code:
    drop uniquecluster lat_household lon_household
    duplicates drop

    Comment


    • #3
      Hi Clyde, thank you very much!

      Yes, the example I provided is just a very small subset of the whole sample, but I just added it so that you know how the structure of the datasets is. I'm afraid the data sets are indeed very large, so the code will run slowly, but I don't really see an option to restrict the number of possible combinations. But I will think about it!

      Thank you very much again.

      Comment

      Working...
      X