Working with two datasets simultaneously.

Nora Fingado

Join Date: Jan 2022

Posts: 15
#1

Working with two datasets simultaneously.

12 Sep 2022, 23:10

Hi,

I have two datasets and would need to work with them simultaneously. The first dataset has the coordinates of the center of grid cells (let's call them "stations"), and the latitude and longitude of the surrounding border of the grid cells. The second dataset (shown on the right) has the coordinates of households (call them "households").

The first dataset is much bigger than the second one, and I would now like to only keep those stations that have at least one household in their grid. I thought, if I could relate these datasets, I could tell Stata to only keep those stations for which a household falls inside their grid. I managed to do it on QGIS, but I wondered if it is also possible to do it on Stata.

"Stations" dataset:

Code:

input long v1 float(lat_center lon_center lat_south lat_north lon_west lon_east) 4018224 37.958332 -24.958334 37.875 38.041664 -25.04167 -24.875 4018225 37.875 -24.958334 37.791668 37.958332 -25.04167 -24.875 4018226 37.791668 -24.958334 37.708336 37.875 -25.04167 -24.875 4018227 37.708332 -24.958334 37.625 37.791664 -25.04167 -24.875 4018228 37.625 -24.958334 37.541668 37.708332 -25.04167 -24.875 4018229 37.541668 -24.958334 37.458336 37.625 -25.04167 -24.875 4018230 37.458332 -24.958334 37.375 37.541664 -25.04167 -24.875 4018231 37.375 -24.958334 37.291668 37.458332 -25.04167 -24.875 4018232 37.291668 -24.958334 37.208336 37.375 -25.04167 -24.875 4018233 37.208332 -24.958334 37.125 37.291664 -25.04167 -24.875

"Household" dataset:

Code:

input float uniquecluster double(lat_household lon_household) 1 -12.169283 13.859255 2 -12.167753 13.845458 3 -12.747312 13.607121 4 -11.502919 14.060841 5 -13.130185 13.109428 6 -12.42766 13.573896 7 -13.435833 14.287285 8 -12.3617 14.778201 9 -10.827451 14.342114 10 -9.991311 14.898295

For visualization, this is how it looks like on QGIS: i.e., I'd like to keep only those green points (stations) that have a household in their cell (here, all these cells are already marked in orange).

Does anyone know whether it's possible to do that in Stata, and if yes, how to do it?
Thank you!
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30354

13 Sep 2022, 09:15

Your example data does not include any households that lie within the grids of any station, so this code is not thoroughly tested. But I believe it works properly.

Code:

clear
input long v1 float(lat_center lon_center lat_south lat_north lon_west lon_east)
4018224 37.958332 -24.958334    37.875 38.041664 -25.04167 -24.875
4018225    37.875 -24.958334 37.791668 37.958332 -25.04167 -24.875
4018226 37.791668 -24.958334 37.708336    37.875 -25.04167 -24.875
4018227 37.708332 -24.958334    37.625 37.791664 -25.04167 -24.875
4018228    37.625 -24.958334 37.541668 37.708332 -25.04167 -24.875
4018229 37.541668 -24.958334 37.458336    37.625 -25.04167 -24.875
4018230 37.458332 -24.958334    37.375 37.541664 -25.04167 -24.875
4018231    37.375 -24.958334 37.291668 37.458332 -25.04167 -24.875
4018232 37.291668 -24.958334 37.208336    37.375 -25.04167 -24.875
4018233 37.208332 -24.958334    37.125 37.291664 -25.04167 -24.875
end
tempfile stations
save `stations'

clear
input float uniquecluster double(lat_household lon_household)
  1 -12.169283 13.859255
  2 -12.167753 13.845458
  3 -12.747312 13.607121
  4 -11.502919 14.060841
  5 -13.130185 13.109428
  6  -12.42766 13.573896
  7 -13.435833 14.287285
  8   -12.3617 14.778201
  9 -10.827451 14.342114
 10  -9.991311 14.898295
 end
 tempfile households
 save `households'
 
 use `households', clear
 cross using `stations'
 keep if inrange(lat_household, lat_south, lat_north) ///
    & inrange(lon_household, lon_west, lon_east)

Evidently, you would not need to save your data sets as tempfiles, and you would replace the references to those tempfiles by the names of your actual data sets.

If your data sets are large, this code will be slow and may blow through the limits of available memory. You could improve both the efficiency and memory considerations if there is some other variable available in both data sets that could restrict the number of potential combinations, e.g. a county. Then instead of -cross- you could use -joinby county- and Stata would not waste time or memory matching up households with stations in a different county.

The code shown will leave you with a data set containing all combinations of station and household where the household is within the station's grid. To get from there to just a data set of stations that contain such a household you could then do:

Code:

drop uniquecluster lat_household lon_household
duplicates drop

Comment

Nora Fingado

Join Date: Jan 2022

Posts: 15
#3

14 Sep 2022, 23:48

Hi Clyde, thank you very much!

Yes, the example I provided is just a very small subset of the whole sample, but I just added it so that you know how the structure of the datasets is. I'm afraid the data sets are indeed very large, so the code will run slowly, but I don't really see an option to restrict the number of possible combinations. But I will think about it!

Thank you very much again.
Comment

Announcement