Kth nearest neighbour on continuous variables

Rolf Lund

Join Date: Mar 2016

Posts: 43
#1

Kth nearest neighbour on continuous variables

27 Apr 2017, 00:19

Hello Statalist.

I am working with register data for the total Danish population and I have data on individual level for educational attainment, income, age and many other variables that could help understand groups of people. I also have geographical information (in what area people live) and I want to find areas that are similar in regards to sociodemographic traits but could be far away from each other physically.

My work process so far has been this; I have been working with propensity scores and matching before, so I thought that I could utilize some of that, but my problem is that I do not have a specific outcome. I only have traits that they need to be similar by. Then I tried working with the discrim knn command in Stata but since I have both categorical and continuous variables, I am unable to get the program to finish (because I have 4.5 million observations in 8000 geographical groups). Then I thought I could make distribution measures for each area (like mean, median, skewness, kurtosis, range) on all the parameters I want to be used to create the nearest neighbor and then aggregate this to area level before any KNN-analysis. In that way, I retain the information from the individual level but reduce the dataset from 4.5 million to 8000 observations. Unfortunately, this still stalls Stata because of the continuous variables and I have yet to actually produce an output. My guess is, that it is because I try to use 6 different distributional measurements for each sociodemographic variable (take age; I generate 6 variables for age than contain mean, median, skewness and so on for each area) and since they are all continuous then, I am asking too much of the program.

This is where I am now - not quite sure how to solve this. Any input would be appreciated.
Tags: nearest neighbor

Announcement

Kth nearest neighbour on continuous variables