Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering analysis in Stata

    Hi all,

    I have a question about how to do cluster analysis in Stata.

    My data looks like this:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte consumer float(longitude latitude) str1 estate_name
     1 114.154 22.2488 "A"
     2 114.154 22.2488 "A"
     3 114.154 22.2488 "A"
     4 114.154 22.2488 "A"
     5 113.975 22.4032 "B"
     6 113.975 22.4032 "B"
     7 113.975 22.4032 "B"
     8 114.244 22.4287 "C"
     9 114.244 22.4287 "C"
    10 114.229 22.2811 "D"
    11 114.229 22.2811 "D"
    12 114.229 22.2811 "D"
    13 114.104 22.3783 "E"
    14 114.217 22.3251 "F"
    15 114.217 22.3251 "F"
    16 114.151 22.2445 "G"
    17 114.147  22.334 "H"
    18 114.147  22.334 "H"
    19 114.147  22.334 "H"
    20 114.147  22.334 "H"
    21 114.262 22.3064 "I"
    22 114.262 22.3064 "I"
    23 114.061 22.3674 "J"
    end
    What I want to do is to classify the consumers whose estate addresses are within 5 kilometers into one cluster and then calculate some variables of interest within and without the cluster.

    I encountered two difficulties during the process. The first one is how to create a loop to calculate the distance among estates automatically since I have lots of estates in the dataset. I know the command to calculate the distance between two locations is
    Code:
    geodist lat1 lon1 lat2 lon2
    and previously, I have a dataset where there's one consumer corresponding to 1 location, and I have the command as below: but now there are multiple consumers in each building and I don't know how to revise the code

    Code:
    forval i = 1/`=_N' {
      local olat = latitude[`i']
      local olong = longitude[`i']  // note misspelling of longitude in your example
      geodist latitude longitude `olat' `olong', gen(dist`i')
    }

    Second, I don't know how to cluster the consumers based on the distance calculation because it seems that there would be some kind of high-dimensional data.

    I'm not sure if I made myself clear, please let me know if you have any idea of how to achieve this in Stata.

    Thank you so much!



Working...
X