Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spatial matching with geonear

    Hi all,

    can someone confirm if my code is doing what I think it is? I've tried to run it using little extracts of the data to check and it seems to be working the way I expect it to, but I'm a liitle sceptical of the results I get for my entire dataset.

    I have two datasets, A and B. Each contain coordinates for most observations and a unique identifier, id_A or id_B. There is a value for start_date and end_date for most observations in data_B. In data_A I want to create:
    - a dummy "active", equal to 1 if in data_B there is an observation within 50km of distance for which start_date <= 2008 <= end_date
    - a dummy "inactive", equal to 1 if in data_B there is an observation within 50km of distance for which start_date > 2008

    Further, I want to drop all observations in data_A for which end_date < 2008.

    I have already been helped in this forum to construct the parts of this code, but I'm unsure if I put it all together correctly.

    Code:
    cd "..."
    
    * Find all pairs of observations between A and B that are within 50km of each other
    use "data_A.dta", clear
    geonear id_A latitude longitude using "data_B.dta", n(id_B latitude longitude) long within(50) near(0)
    save "geonear.dta", replace
    
    * Merge datasets using geonear file
    use "data_A.dta", clear
    
    merge 1:m id_A using "geonear.dta", keep(master match) keepusing(id_B)
    drop _merge
    merge m:1 id_B using "data_B", keep(master match) keepusing(start_date end_date)
    drop _merge
    
    Drop all observations with a certain id_A if one of them has end_date < 2008
    
    gen closed=0
    replace closed = 1 if !missing(end_date) & end_date < 2008
    bysort id_A: egen closed1 = max(closed)
    drop closed
    drop if closed1 == 1
    drop closed1
    
    * Create dummies
    
    gen active1= 0
    replace active1 = 1 if start_date <= 2008 & 2008 <= end_date & !missing(start_date)
    * (I know I could just set the both =2008, but I will want to vary the years)
    gen inactive1 = 0
    replace inactive1 = 1 if start_date > 2008 & !missing(start_date)
    sort id_A
    
    * Make active = 1 if any of active1 equals 1
    by id_a: egen active = max(active1)
    drop active1
    by id_A: egen inactive = max(inactive1)
    drop inactive1
    
    * Keep only one of each id_A
    bysort RESPNO (active): keep if _n == _N
    Thank you very much in advance!

    Best regards,
    Marco

  • #2
    You are leaving us guessing about your variables so it's hard to say if you are doing the right thing throughout. I've created mock datasets that conform to your description of your variables. It looks like you are doing the geonear part correctly, including the merging of its results with both starting datasets. After that, I'm not sure what you are trying to do but I've simplified your code and that may help you decide if it is doing what you want. Note that a boolean is true if it is not 0. The way you calculate active, it is constant per id_A group so which observation you keep at the end is random because the sort order is indeterminate.

    Code:
    clear
    set seed 123456
    set obs 1000
    gen id_A = _n
    gen hascoor = runiform() < .95
    gen double lat_A = runiform(37,45) if hascoor
    gen double lon_A = runiform(-111,-102) if hascoor
    drop hascoor
    save "statalist_data_A.dta", replace
    
    keep if mi(lat_A)
    list in 1/10
    
    clear
    set obs 1000
    gen id_B = _n
    gen hascoor = runiform() < .95
    gen double lat_B = runiform(37,41) if hascoor
    gen double lon_B = runiform(-109,-102) if hascoor
    gen start_date = runiformint(2005,2011)
    gen end_date = start_date + runiformint(1,7) if runiform() < .95
    drop hascoor
    save "statalist_data_B.dta", replace
    sum
    
    * Find all pairs of observations between A and B that are within 50km of each other
    use "statalist_data_A.dta", clear
    geonear id_A lat_A lon_A using "statalist_data_B.dta", n(id_B lat_B lon_B) long within(50) near(0)
    save "statalist_geonear.dta", replace
    
    * Merge datasets using geonear file
    use "statalist_data_A.dta", clear
    merge 1:m id_A using "statalist_geonear", keep(master match) keepusing(id_B km_to_id_B) nogen
    merge m:1 id_B using "statalist_data_B.dta", keep(master match) keepusing(start_date end_date) nogen
    
    * this is the full id_A dataset, with all matches of id_B within 50km, by increasing distance
    sort id_A km_to_id_B id_B
    by id_A: gen one = _n == 1
    count if one
    
    * Drop all observations with a certain id_A if one of them has end_date < 2008
    by id_A: egen ignore = total(end_date < 2008)
    
    * Create dummies
    gen in2008 = inrange(2008, start_date, end_date)
    by id_A: egen active1 = total(in2008)
    by id_A: egen active2 = total(start_date > 2008)

    Comment


    • #3
      Robert Picard Thank you for your answer!

      In the second last paragraph I would need to add a line of the kind
      Code:
      drop if ignore>0
      , right?

      Also, this way, both active1 and active2 falsely register observations lacking the dates as being in the desired range. I don't quite understand why, but that is why I added the !missing() parts in my code.

      Lastly, if I just care about whether or not there is an "active" or and "inactive" observation in data_B,
      Code:
       bysort RESPNO (active): keep if _n == _N
      will do the job, right?
      Last edited by Marco Gallo; 11 Jun 2017, 04:06.

      Comment


      • #4
        It's important to understand that all the egen calls here create a constant for all observations within a group of id_A. With respect to missing dates, a missing numeric value is stored as a number that is larger than any number that can be stored in that data type. So end_date < 2008 cannot be true if end_date is missing so the following statement correctly counts the number of observations with a date prior to 2008:
        Code:
        by id_A: egen ignore = total(end_date < 2008)
        One question here is if in your data a missing end_date means something: in other words, is it used to indicate that the object is still "active"? If so, you must account for that and treat those missing values differently from those in start_date. That's up to you to formulate the conditions. You are correct that the inrange() statement I used will be true if start_date is missing and the end_date is later than 2008 (or missing) but the fix depends on what to do if end_date is missing as well.

        With respect to your last question, I do not know because I cannot see your data and do not understand what you are doing. My spidey sense tells me that it's wrong because active is constant within id_A groups so there must be more that one observation per RESPNO active group (that's why you are trying to reduce it to one observation per group), in which case you are picking an observation at random within the group.

        Comment


        • #5
          If end_date is missing, it does mean that the object is still "active". I have modifed data_B such that every observation has start_date, hence, if start_date is missing in the merged dataset, it means that there was no match, thus both dummies should be 0.

          In the end, really all I need is id_A and the two dummies, indicating whether in the two periods there was an active object within 50km. It does not matter how many, nor does id_B or the date that is left. In fact, I could just drop them. I'm sorry for not sharing the data, but the access to that data is restricted.

          Comment

          Working...
          X