Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find matching observations based on euclidean distance

    Dear Statalisters,

    I am stuck with the following problem. I have data of the following form:

    Code:
    input id group xvar1 xvar2
    1 1 0 1
    2 1 0.5 1.2
    3 0 0 1.9
    4 0 0.25 1.3
    5 0 0.15 1.1
    6 0 0.1 0.7
    7 0 0.6 1.7
    8 0 0.8 0.5
    9 0 0.5 0.8
    10 0 0.8 1
    end
    where group is a dummy stating the affiliation of an observation to a certain group and xvar1 as well as xvar2 are continous measures. Now, for each observation of Group 1 I would like to know the three observations of Group 0 which are closest in terms of the euclidean distance computed over xvar1 and xvar2.

    A simple R code tells me that for observation 1 these are observations 5,6 and 10.

    Code:
    x <- matrix(c(0, 1, 0.5, 1.2, 0, 1.9, 0.25, 1.3, 0.15, 1.1, 0.1, 0.7,
                  0.6, 1.7, 0.8, 0.5, 0.5, 0.8, 0.8, 1), ncol=2, byrow=TRUE)
    dist <- dist(x, method = "euclidean", diag = TRUE, upper = FALSE, p = 2)
    dist
    Now, I would like to tell Stata to keep only those observations of Group 0 that are matched with Group 1 observations. Does anyone have an idea how I can achieve this in Stata? I used teffects nnmatch but this does not work for more than the maximum of observations in Group 1 and also estimates ATE or ATT which my exercise isn't about -- I only want my sample to be constrained to matching observations before I do further analyses. In addition to my question, is there also a chance to control for replacement/no replacement?

    Any help is highly appreciated!

  • #2
    I think this works, with the caveat of Stata expanding your database temporarily, so it may or may not work depending on the size of your original database.

    Code:
    clear all
    set more off
    
    *----- example data -----
    
    input ///
    id group xvar1 xvar2
    1 1 0 1
    2 1 0.5 1.2
    3 0 0 1.9
    4 0 0.25 1.3
    5 0 0.15 1.1
    6 0 0.1 0.7
    7 0 0.6 1.7
    8 0 0.8 0.5
    9 0 0.5 0.8
    10 0 0.8 1
    end
    
    list
    
    gen byte i = 1
    
    tempfile orig
    save "`orig'"
    
    *----- all pairwise -----
    
    rename (id group xvar*) =0
    
    joinby i using "`orig'"
    drop if group >= group0
    drop i
    
    *----- compute distance -----
    
    gen eucld = ((xvar10 - xvar1)^2 + (xvar20 - xvar2)^2) ^ (1/2)
    bysort id0 (eucld): gen nearest = _n <= 3
    
    list, sepby(id0)
    You should:

    1. Read the FAQ carefully.

    2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

    3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

    4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

    Comment


    • #3
      I think looping through observations is another strategy that could work. No problem with the size of the data in that case.
      You should:

      1. Read the FAQ carefully.

      2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

      3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

      4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

      Comment


      • #4
        Thanks, Roberto. Using the joinby command is a nice solution I have not thought of. Thanks for your help!

        Comment

        Working...
        X