Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two data sets with common id but different structure

    Hi Statalisters

    I am working on merging two data set with quite different structures, although they have a common id. I want to plot maps in R, but I'm doing the data management in Stata.

    I have a map data set with an id for Norwegian counties with their longitude and latitude.

    I have another data set with the data of interest for the plots with an id coded identically to the map data set. This is panel data at county level for both genders, boys and girls separately, and at different age groups over several years.

    Although the data sets have a common id, the data structure is quite different, so merging the two data sets is not straight forward to me.


    Map data set:

    Click image for larger version

Name:	geodata.png
Views:	1
Size:	23.5 KB
ID:	1537615



    Data of interest to visualize:

    Click image for larger version

Name:	adhddata.png
Views:	1
Size:	28.8 KB
ID:	1537616


    I've tried both:
    Code:
    . merge 1:m id using "C:~ADHD_merge.dta"
    variable id does not uniquely identify observations in the master data
    r(459);
    and
    Code:
    . merge 1:1 id using "C:~ADHD_merge.dta"
    variable id does not uniquely identify observations in the master data
    r(459);
    I would appreciate any input on how I may solve this issue.

  • #2
    Something here makes no sense. If the map data set is simply a list of counties with their longitudes and latitudes, then why does each county have more than one observation (indeed, up to tens of thousands?) Evidently, there's something I'm not getting from your explanation of this "map" data. What exactly is in this "map" data set?

    Comment


    • #3
      Hi Clyde

      Thank you for your response. I am relatively new to map data, but I'll try to provide some more details. I have already made a map plot for only one year, one age group and one gender group (I reduced the data set in Stata and imported to R).

      Now I want to combine the map data set with a larger data set with observations of the variable of interest (adhd medication rate per 1000 at county level) for all age groups, gender groups and all years, so I can edit groups more efficiently alongside plotting. The data set that works, looks like this:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str6 group float id_county double(longitude latitude) float adhd_pr_1000
      "01.1" 1  10.68531156276028  59.48859956745566 33.16384
      "01.1" 1 10.692866471099695 59.490463187251976 33.16384
      "01.1" 1 10.696115331962785 59.493375407845775 33.16384
      "01.1" 1 10.708315889128597 59.490451533272335 33.16384
      "01.1" 1 10.713956978021121   59.4778906633188 33.16384
      "01.1" 1 10.718585124375561  59.47630827739814 33.16384
      "01.1" 1 10.720418520606842   59.4736656505165 33.16384
      "01.1" 1  10.72591715474232 59.471882092417836 33.16384
      "01.1" 1  10.73608854099581  59.47583202527449 33.16384
      "01.1" 1 10.741668744010008  59.47906380163044 33.16384
      "01.1" 1  10.74368716721441  59.48182185799269 33.16384
      "01.1" 1 10.752992614083421 59.484744371252226 33.16384
      end
      I not sure whether the code example is helpful. Each county has multiple observations of longitude and latitude. My understanding is that these are used to define borders (as x-axis and y-axis) when plotting the maps, then these areas are "filled" with the variable of interest. In R this data set provides this plot:
      Click image for larger version

Name:	basic.png
Views:	1
Size:	15.1 KB
ID:	1537685

      Using this code:
      Code:
      plot <- ggplot(data = ADHD_visualisering, aes(x=long, y = lat, group = group)) +
        geom_polygon(aes(fill = adhd_pr_1000))  
      plot
      When I revisit the code, it seem to be the variable -group- not -id_county- that gives a valid plot. I may have to revisit my merging strategy based on this. As a preliminary consideration, I think the map data set has to be multiplied by each combination of age group, gender and year, because each combination will have a different adhd medication rate. This will result in a very large data set.

      Comment


      • #4
        I don't know R, so I'm not able to visualize what kind of input R is expecting. But now that I understand that your map data set gives longitude and latitude for many contour points rather than just a single centroid location, the only way to combine the two files is not to -merge- them but to join them with the -joinby- command. The outline would be

        Code:
        use other_data, clear
        joinby id using map_data
        See -help joinby- for additional information. In particular, I suspect this joined data set will be very large and you may want to at least trim out some of the unmatched observations. -joinby- has an option that lets you control that process, and some others that may be useful as well.

        As stated, I don't know if the resulting combined data set will be what you need to prepare your input for R as I have no understanding of the latter. But I can't think of any other meaningful way to combine these two data sets.

        Comment


        • #5
          Thank you so much, Clyde! I did not know about the -joinby- command and I will definitely look into this.

          Comment


          • #6
            As an update, -joinby- was exactly what I needed. Now I want to do some simple spatial regressions using the -spmap- package, but Stata seem to require polygons for this purpose, and I'm unsure whether there's an easy transformation from my data structure to polygons. I'll make a separate post on this.

            Comment

            Working...
            X