Creating a new variable for each ID/group separately, containing the mean distance between each pair among all possible pairs of its parts

Joost Koning

Join Date: May 2022
Posts: 4

Creating a new variable for each ID/group separately, containing the mean distance between each pair among all possible pairs of its parts

11 May 2022, 01:27

Hello everyone!

For my master-thesis at the Erasmus University Rotterdam in The Netherlands, I have a large data-set (45455 rows and 7755 unique ID’s in total) containing information on corporate owners and their subsidiaries with the geographic coordinates for every subsidiary.

I want to create a new variable for every corporate owner containing the mean difference in geographic location of its subsidiaries relative to each other (so between each pair among all possible pairs of its subsidiaries for every corporate owner separately), to then use this as independent variable in a regression.

On the web I found "writing a loop by installing and using a related user-written command from SSC, -geonear-" could be helpful for me.

Unfortunately, I really don’t know how to do this. Does someone maybe see a solution to this problem? It would mean a lot!

Here is a data example:

Code:

dataex ID_x CompanynameLatinalphabet Noofsubsidiaries SubsidiaryName Latitude Longitude in 8235/8248

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4 ID_x str199 CompanynameLatinalphabet str4 Noofsubsidiaries str199 SubsidiaryName str18(Latitude Longitude)
"211" "DOW BENELUX B.V."                                                      "3" "TERNEUZEN PARTNERSHIP SERVICES B.V."                                                            "51.3088989257812"  "3.77801609039306" 
"211" "DOW BENELUX B.V."                                                      "3" "VALUEPARK TERNEUZEN BEHEER B.V."                                                                "51.3371200561523"  "3.82822895050048" 
"211" "DOW BENELUX B.V."                                                      "3" "POLYOL BELGIUM"                                                                                 "51.1929359436035"  "4.42132902145385" 
"212" "BOEHRINGER INGELHEIM VETMEDICA GESELLSCHAFT MIT BESCHRAENKTER HAFTUNG" "9" "BOEHRINGER INGELHEIM ANIMAL HEALTH INTERNATIONAL GMBH"                                          "49.975830078125"   "8.065308570861809"
"212" "BOEHRINGER INGELHEIM VETMEDICA GESELLSCHAFT MIT BESCHRAENKTER HAFTUNG" "9" "BOEHRINGER INGELHEIM ANIMAL HEALTH PARTICIPATIONS GMBH"                                         "49.975830078125"   "8.065308570861809"
"212" "BOEHRINGER INGELHEIM VETMEDICA GESELLSCHAFT MIT BESCHRAENKTER HAFTUNG" "9" "BOEHRINGER INGELHEIM VETERINARY RESEARCH CENTER GMBH & CO. KG"                                  "49.975830078125"   "8.065308570861809"
"212" "BOEHRINGER INGELHEIM VETMEDICA GESELLSCHAFT MIT BESCHRAENKTER HAFTUNG" "9" "BOEHRINGER INGELHEIM VRC GMBH"                                                                  "49.975830078125"   "8.065308570861809"
"212" "BOEHRINGER INGELHEIM VETMEDICA GESELLSCHAFT MIT BESCHRAENKTER HAFTUNG" "9" "BOEHRINGER INGELHEIM ANIMAL HEALTH FRANCE"                                                      "not found"         "not found"        
"212" "BOEHRINGER INGELHEIM VETMEDICA GESELLSCHAFT MIT BESCHRAENKTER HAFTUNG" "9" "BOEHRINGER INGELHEIM ANIMAL HEALTH ITALIA S.P.A., IN BREVE BOEHRINGER INGELHEIM AH IT S.P.A."   "45.4636192321777"  "9.18811607360839" 
"212" "BOEHRINGER INGELHEIM VETMEDICA GESELLSCHAFT MIT BESCHRAENKTER HAFTUNG" "9" "BOEHRINGER INGELHEIM ANIMAL HEALTH JAPAN CO. LTD"                                               "35.6869621276855"  "139.749465942382" 
"212" "BOEHRINGER INGELHEIM VETMEDICA GESELLSCHAFT MIT BESCHRAENKTER HAFTUNG" "9" "MERIAL BARCELONETA LLC"                                                                         "18.45068359375"    "-66.5386276245117"
"212" "BOEHRINGER INGELHEIM VETMEDICA GESELLSCHAFT MIT BESCHRAENKTER HAFTUNG" "9" "MERIAL SAUDE ANIMAL LTDA"                                                                       "-23.5506515502929" "-46.6333808898925"
"213" "BASF ITALIA S.P.A."                                                    "6" "BASF COATINGS SERVICES ITALY S.R.L."                                                            "45.6282539367675"  "9.146426200866699"
"213" "BASF ITALIA S.P.A."                                                    "6" "NUNHEMS ITALY SOCIETA' A RESPONSABILITA' LIMITATA, O IN FORMA ABB REVIATA NUNHEMS ITALY S.R.L." "44.6651649475097"  "11.1350002288818" 
end

Thanks a lot in advance. Kind regards, Joost Koning

Tags: data, group, mean distance

Mike Lacy

Join Date: Apr 2014
Posts: 2421

11 May 2022, 10:18

I believe this does what Joost wants. The crucial commands here are 1) the built-in -joinby-, which creates pairings of observations among those observations that share a value of company name; and 2) -geodist-, a community-contributed command from SSC.

A side note: I'd be concerned about possible variations in spelling etc. of the subsidiary and company names, but I'm ignoring that.

Code:

// Some numeric variables are accidentally presented as strings in the example data.
// This is likely not true in the actual data, but we need to fix that here.
destring Noofsubsidiaries Latitude Longitude, replace force
//
// Make a file of all possible distinct pairs within company.
// This will be smaller and less messy using numeric ids, which are created here.
egen conum = group(Companyname*)
egen subnum = group(Sub*Name)
// The original file, with the numeric ids, will be needed for merging.
tempfile original
save `original'
//
//  Real work starts here.
//
// We need a copy of the file of subsidaries to "join" with itself.
// Variable names must be different in this copy.
keep conum subnum Latitude Longitude
preserve
rename (subnum Latitude Longitude) =2
tempfile sub2
save `sub2'
restore
//
// Now make pairs within company.
joinby conum using `sub2'
keep if subnum < subnum2 // remove duplicates and self-pairs
//
// With this file of all of pairs of subsidiaries within each company, -geodist- does the hard work.
geodist Latitude Longitude Latitude2 Longitude2, gen(dist)
//
egen meandist = mean(dist), by(conum)
bysort conum: keep if _n ==1  //  Only one meandist per company, right?
// Put the mean distance for each company onto the original file
keep conum meandist
merge 1:m conum using `original'

Last edited by Mike Lacy; 11 May 2022, 10:20.

Comment

Joost Koning

Join Date: May 2022

Posts: 4
#3

12 May 2022, 02:42

Hello Mike and readers,

Thanks you for you answer: It indeed solved the problem perfectly!

For everybody's information: The command

Code:

geodist Latitude Longitude Latitude2 Longitude2, gen(dist)

gave the following warning:

Code:

Warning: failed to converge due to near-antipodal points Replaced distance(s) with missing value .a Number of distance(s) affected = 3

Only 3 failed point-convergings in over 40.000 is very little!

Furthermore, after merging in the final step with the syntax

Code:

merge 1:m conum using `original'

the result was:

Code:

Result Number of obs ----------------------------------------- Not matched 2,699 from master 0 (_merge==1) from using 2,699 (_merge==2) Matched 42,757 (_merge==3) -----------------------------------------

I don't know why these 2699 didn't match, but maybe those were the empty places in the coordinates variable because there wan't geographic information about those subsidiaries?

I am extremely grateful and happy! Thanks! It means a lot!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#4

12 May 2022, 07:19

The merge was made using the company identifier as the key variable, so any problem with the match arises from that variable, so no, the problem would not arise from problems with the coordinates variables. One likely possibility would be differences in spelling of the company names in different observations. To diagnose the problem, I'd start by simply browsing the non-matched observations and see what's going on, e.g. browse if _merge == 2.
Comment

Announcement

Creating a new variable for each ID/group separately, containing the mean distance between each pair among all possible pairs of its parts

Comment

Comment

Comment