Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate observations by creating pairs of every possible combination of values within subgroup and calculate differences between them

    Hi all,

    I have a dataset with aggregated data for several European countries and NUTS-2 regions within them, inluding data on the average ideology on a left-right scale (ideology) and the average satisfaction with democracy within each region (demosat). I also have spatial lat-long data for each region. This is what the data looks like:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str24 country str5 nuts_id float(latitude longitude ideology demosat)
    "Austria"  "AT11" 2700481.5 4796796.5 4.1857142       3.5
    "Austria"  "AT12"   2810612 4748394.5  5.024038  3.608511
    "Austria"  "AT13"   2808263   4795769 4.7064934  3.718182
    "Austria"  "AT21" 2632784.5 4662314.5 4.6597223  3.352941
    "Austria"  "AT22"   2689864 4739647.5  5.038869  3.105769
    "Austria"  "AT31"   2788914   4614961  4.568915 3.3439155
    "Austria"  "AT32"   2690008 4548079.5  4.790698 3.6267605
    "Austria"  "AT33"   2674450   4378918         5  3.564516
    "Austria"  "AT34" 2680342.5 4312294.5 4.7352943 3.6756756
    "Belgium"  "BE1"    3096269   3923952   4.69028 3.3281004
    "Belgium"  "BE2"    3122059 3970053.5  5.134692   3.25513
    "Belgium"  "BE3"    3051593   3953864 4.7217956  3.309789
    "Bulgaria" "BG31" 2360773.3   5440434  5.358757 1.9232876
    "Bulgaria" "BG32" 2366455.5   5581033  5.441341  2.101523
    "Bulgaria" "BG33" 2413181.8   5716947  5.718487  1.855372
    end

    I would like to create new observations that consist of every possible combination of pairs of regions within a country. Additionally, I want the explanatory variables of those newly created region pairs (ideology, demosat) to take on the difference in values between those two regions and - on the basis of the lat-long variables - also create a new variable that contains the geographical distance between those two regions in km for each pair. Is something like that possible?

    I would greatly appreciate any help.


    I'm currently using STATA 15.1 if that matters.


    Kind regards,
    Florian

  • #2
    I can get you most of the way there:
    Code:
    preserve
    ds country, not
    local vbles `r(varlist)'
    rename (`vbles') =2
    tempfile copy2
    save `copy2'
    
    restore
    ds country latitude longitude nuts_id, not
    local demostubs `r(varlist)'
    rename (`vbles') =1
    joinby country using `copy2'
    drop if nuts_id1 == nuts_id2
    foreach v of local demostubs {
        gen diff_`v' = `v'2 - `v'1
    }
    This gets you everything but the geographical distance between the paired regions. Ordinarily, I would have finished the job with -geodist latitude1 longitude1 latitude2 longitude2, gen(distance12)- (-geodist is by Robert Picard and is available from SSC). But your longitude and latitude variables are in some metric I don't recognize, and neither does -geodist-. I would expect these numbers to be between -90 and 90, and -180 to 180. I don't know what to make of the numbers you are showing there. If you can convert them to degrees, then -geodist- will bring you home.

    Comment


    • #3
      Dear Clyde,

      thank you very much for your answer! You're right about the lat-long data -- I'm still quite new when it comes to spatial data and didn't notice that I downloaded the wrong geographical data, my mistake.

      When I downloaded the correct spatial data and included your geodist command after the loop, it seems to have output the correct distance between the regions.

      The pairing and calculation of the explanatory variable's difference also worked like a charm, so thank you very much for that!

      Only one small thing missing: Currently, the code not only pairs AT11 with AT32, but also AT32 with AT11 (as an example) which is not needed. Normally, I would simply remove the mathematical sign from all values and eliminate all duplicates, but there must be a more elegant way to prevent STATA from creating redundant pairs, right?


      Kind regards,
      Florian

      Comment


      • #4
        It wasn't clear whether you wanted the pairs in both directions or not, so I left them in. But to run the code and not end up with the reversed versions of the same pairs change
        Code:
        drop if nuts_id1 == nuts_id2
        to:
        Code:
        drop if nuts_id1 >= nuts_id2
        That way you will only get each pair just once, with the first nuts code alphabetically preceding the second.

        Comment


        • #5
          Clyde's code worked perfectly for what I wanted to achieve (thank you again, Clyde!). What I'm wondering now is: What is the best way to account for this structure in a regression model? My current data structure currently looks as follows with multiple years per region pair (simplified example):

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str26 country double year str7(nuts_id1 nuts_id2) double distance float(diff_ap diff_gdp)
          "Austria" 2008 "AT11" "AT31" 209.02007450947184 -.007558346 -14739
          "Austria" 2008 "AT11" "AT21"  214.1117111627732    .5297625  -8449
          "Austria" 2008 "AT11" "AT34"   494.297205956563    .6200213 -16391
          "Austria" 2008 "AT11" "AT13"  79.28116352742613   -.2108135 -26965
          "Austria" 2008 "AT11" "AT33"  381.9807871580956   .22640324 -15662
          "Austria" 2008 "AT11" "AT22" 119.14053100781692  -.22346807  -9934
          "Austria" 2008 "AT11" "AT12"  96.88331030841978   -.1886282  -7308
          "Austria" 2008 "AT11" "AT32" 258.22351648915054   .07046771 -20850
          "Austria" 2013 "AT11" "AT31" 209.02007450947184 -.020678043 -15951
          "Austria" 2013 "AT11" "AT21"  214.1117111627732   .12894654  -7480
          "Austria" 2013 "AT11" "AT33"  381.9807871580956  -.04224801 -17549
          "Austria" 2013 "AT11" "AT12"  96.88331030841978  -.05305886  -6085
          "Austria" 2013 "AT11" "AT32" 258.22351648915054 -.066107035 -23453
          "Austria" 2013 "AT11" "AT22" 119.14053100781692   .06557059  -9765
          "Austria" 2013 "AT11" "AT13"  79.28116352742613  -.19122386 -26502
          end
          Next to other explanatory variables not included in this example, I would like to regress the difference of affective polarization between two regions (diff_ap) on their difference of average GDP per capita (diff_gdp).

          Can I just use
          Code:
           reg diff_ap diff_gdp
          for that and include i.year in the model to account for the structure of the data or are more elaborate model specifications necessary? I've read that gravity models could be more fitting for this kind of regression, but I would appreciate any insight as I wasn't quite able to grasp the logic of these models so far and don't know if they are really necessary in this case.

          Comment

          Working...
          X