Generate observations by creating pairs of every possible combination of values within subgroup and calculate differences between them

Florian Feick

Join Date: Jul 2023

Posts: 5
#1

Generate observations by creating pairs of every possible combination of values within subgroup and calculate differences between them

01 Aug 2023, 06:36

Hi all,

I have a dataset with aggregated data for several European countries and NUTS-2 regions within them, inluding data on the average ideology on a left-right scale (ideology) and the average satisfaction with democracy within each region (demosat). I also have spatial lat-long data for each region. This is what the data looks like:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str24 country str5 nuts_id float(latitude longitude ideology demosat) "Austria" "AT11" 2700481.5 4796796.5 4.1857142 3.5 "Austria" "AT12" 2810612 4748394.5 5.024038 3.608511 "Austria" "AT13" 2808263 4795769 4.7064934 3.718182 "Austria" "AT21" 2632784.5 4662314.5 4.6597223 3.352941 "Austria" "AT22" 2689864 4739647.5 5.038869 3.105769 "Austria" "AT31" 2788914 4614961 4.568915 3.3439155 "Austria" "AT32" 2690008 4548079.5 4.790698 3.6267605 "Austria" "AT33" 2674450 4378918 5 3.564516 "Austria" "AT34" 2680342.5 4312294.5 4.7352943 3.6756756 "Belgium" "BE1" 3096269 3923952 4.69028 3.3281004 "Belgium" "BE2" 3122059 3970053.5 5.134692 3.25513 "Belgium" "BE3" 3051593 3953864 4.7217956 3.309789 "Bulgaria" "BG31" 2360773.3 5440434 5.358757 1.9232876 "Bulgaria" "BG32" 2366455.5 5581033 5.441341 2.101523 "Bulgaria" "BG33" 2413181.8 5716947 5.718487 1.855372 end

I would like to create new observations that consist of every possible combination of pairs of regions within a country. Additionally, I want the explanatory variables of those newly created region pairs (ideology, demosat) to take on the difference in values between those two regions and - on the basis of the lat-long variables - also create a new variable that contains the geographical distance between those two regions in km for each pair. Is something like that possible?

I would greatly appreciate any help.

I'm currently using STATA 15.1 if that matters.

Kind regards,
Florian
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30093
#2

01 Aug 2023, 09:25

I can get you most of the way there:

Code:

preserve ds country, not local vbles `r(varlist)' rename (`vbles') =2 tempfile copy2 save `copy2' restore ds country latitude longitude nuts_id, not local demostubs `r(varlist)' rename (`vbles') =1 joinby country using `copy2' drop if nuts_id1 == nuts_id2 foreach v of local demostubs { gen diff_`v' = `v'2 - `v'1 }

This gets you everything but the geographical distance between the paired regions. Ordinarily, I would have finished the job with -geodist latitude1 longitude1 latitude2 longitude2, gen(distance12)- (-geodist is by Robert Picard and is available from SSC). But your longitude and latitude variables are in some metric I don't recognize, and neither does -geodist-. I would expect these numbers to be between -90 and 90, and -180 to 180. I don't know what to make of the numbers you are showing there. If you can convert them to degrees, then -geodist- will bring you home.
1 like
Comment
Florian Feick

Join Date: Jul 2023

Posts: 5
#3

01 Aug 2023, 10:42

Dear Clyde,

thank you very much for your answer! You're right about the lat-long data -- I'm still quite new when it comes to spatial data and didn't notice that I downloaded the wrong geographical data, my mistake.

When I downloaded the correct spatial data and included your geodist command after the loop, it seems to have output the correct distance between the regions.

The pairing and calculation of the explanatory variable's difference also worked like a charm, so thank you very much for that!

Only one small thing missing: Currently, the code not only pairs AT11 with AT32, but also AT32 with AT11 (as an example) which is not needed. Normally, I would simply remove the mathematical sign from all values and eliminate all duplicates, but there must be a more elegant way to prevent STATA from creating redundant pairs, right?

Kind regards,
Florian
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30093
#4

01 Aug 2023, 11:42

It wasn't clear whether you wanted the pairs in both directions or not, so I left them in. But to run the code and not end up with the reversed versions of the same pairs change

Code:

drop if nuts_id1 == nuts_id2

to:

Code:

drop if nuts_id1 >= nuts_id2

That way you will only get each pair just once, with the first nuts code alphabetically preceding the second.
Comment

Florian Feick

Join Date: Jul 2023
Posts: 5

06 Jan 2024, 07:16

Clyde's code worked perfectly for what I wanted to achieve (thank you again, Clyde!). What I'm wondering now is: What is the best way to account for this structure in a regression model? My current data structure currently looks as follows with multiple years per region pair (simplified example):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str26 country double year str7(nuts_id1 nuts_id2) double distance float(diff_ap diff_gdp)
"Austria" 2008 "AT11" "AT31" 209.02007450947184 -.007558346 -14739
"Austria" 2008 "AT11" "AT21"  214.1117111627732    .5297625  -8449
"Austria" 2008 "AT11" "AT34"   494.297205956563    .6200213 -16391
"Austria" 2008 "AT11" "AT13"  79.28116352742613   -.2108135 -26965
"Austria" 2008 "AT11" "AT33"  381.9807871580956   .22640324 -15662
"Austria" 2008 "AT11" "AT22" 119.14053100781692  -.22346807  -9934
"Austria" 2008 "AT11" "AT12"  96.88331030841978   -.1886282  -7308
"Austria" 2008 "AT11" "AT32" 258.22351648915054   .07046771 -20850
"Austria" 2013 "AT11" "AT31" 209.02007450947184 -.020678043 -15951
"Austria" 2013 "AT11" "AT21"  214.1117111627732   .12894654  -7480
"Austria" 2013 "AT11" "AT33"  381.9807871580956  -.04224801 -17549
"Austria" 2013 "AT11" "AT12"  96.88331030841978  -.05305886  -6085
"Austria" 2013 "AT11" "AT32" 258.22351648915054 -.066107035 -23453
"Austria" 2013 "AT11" "AT22" 119.14053100781692   .06557059  -9765
"Austria" 2013 "AT11" "AT13"  79.28116352742613  -.19122386 -26502
end

Next to other explanatory variables not included in this example, I would like to regress the difference of affective polarization between two regions (diff_ap) on their difference of average GDP per capita (diff_gdp).

Can I just use

Code:

 reg diff_ap diff_gdp

for that and include i.year in the model to account for the structure of the data or are more elaborate model specifications necessary? I've read that gravity models could be more fitting for this kind of regression, but I would appreciate any insight as I wasn't quite able to grasp the logic of these models so far and don't know if they are really necessary in this case.

Announcement

Generate observations by creating pairs of every possible combination of values within subgroup and calculate differences between them

Comment

Comment

Comment

Comment