Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets

    Dear all,

    I am computing some diversity indices and for that issue I need to match two datasets. The first one contains all the different nationalities of people which are living in different counties (dataset1.dta). It looks like the following example:
    County Origin_1 Origin_2 Share_1 Share_2 d
    1 A B 0.2 0.6 AB
    1 A C 0.2 0.1 AC
    1 A D 0.2 0.1 AD
    1 B C 0.6 0.1 BC
    1 B D 0.6 0.1 BD
    1 C D 0.1 0.1 CD
    2 A B 0.4 0.3 AB
    2 A C 0.4 0.2 AC
    2 A D 0.4 0.1 AD
    2 B C 0.3 0.2 BC
    2 B D 0.3 0.1 BD
    2 C D 0.2 0.1 CD
    The variable "County" is the county every person is living at. For each county all the permutations of the different populations are matched. So, origin_1 is the nationality of one part and origin_2 of the matched part to compute the Herfindahl-Hirtschmann-Index. Share_1 and Share_2 are the different shares regarding the origin of the population. d is a string variable, which contains all the permutations of the different origins of the people.

    Now I would like to match it to a distance dataset (dataset2.dta), which looks like the following example:
    Origin_1 Origin_2 d distance
    A B AB 0.3
    A C AC 0.2
    A D AD 0.7
    B C BC 0.9
    B D BD 0.1
    C D CD 0.2
    Origin_1 and Origin_2 are again the countries of origin where people were born in. d is the same like above and distance is a distance measure between two persons from different countries.

    Now, I need to get a dataset, which looks like the following example (I need to compute: bysort county: Herfindahl-Hirschmann-Index= 1-total(share_1*share_2*distance)):
    County Origin_1 Origin_2 Share_1 Share_2 d distance
    1 A B 0.1 ... AB 0.3
    1 A C 0.1 ... AC 0.2
    1 A D ... ... AD 0.7
    1 B C ... ... BC 0.9
    1 B D ... ... BD 0.1
    1 C D ... ... CD 0.2
    2 A B ... ... AB 0.3
    2 A C ... ... AC 0.2
    2 A D ... ... AD 0.7
    2 B C ... ... BC 0.9
    2 B D ... ... BD 0.1
    2 C D ... ... CD 0.2
    I tried to match dataset1 and dataset2 by using the following command:

    use "dataset1.dta", clear

    sort d

    merge n:n d using "dataset2.dta"

    drop _merge

    But using this command I do not get the dataset, I want. I've got something like that:
    County Origin_1 Origin_2 Share_1 Share_2 d distance
    1 A B ... ... AB 0.3
    2 A C ... ... AC 0.2
    2 B C ... ... BC 0.9
    1 D A ... ... AD 0.7
    2 B D ... ... BD 0.1
    1 D C ... ... CD 0.2
    And getting this dataset I cannot compute the Herfindahl-Hirschmann-Index, because I do not get a merged dataset for every county. Can someone help me?

    Thanks in advance!

  • #2
    Here's how you can merge the data you show. Note that using dataex from SSC to present data examples helps in both understanding the question and presenting a possible solution.

    Code:
    clear
    input byte County str1(Origin_1 Origin_2) float(Share_1 Share_2) str2 d
    1 "A" "B" .2 .60000002 "AB" 
    1 "A" "C" .2 .1 "AC" 
    1 "A" "D" .2 .1 "AD" 
    1 "B" "C" .60000002 .1 "BC" 
    1 "B" "D" .60000002 .1 "BD" 
    1 "C" "D" .1 .1 "CD" 
    2 "A" "B" .40000001 .30000001 "AB" 
    2 "A" "C" .40000001 .2 "AC" 
    2 "A" "D" .40000001 .1 "AD" 
    2 "B" "C" .30000001 .2 "BC" 
    2 "B" "D" .30000001 .1 "BD" 
    2 "C" "D" .2 .1 "CD" 
    end
    tempfile f
    save "`f'"
    
    clear
    input str1(Origin_1 Origin_2) str2 d float distance
    "A" "B" "AB" .30000001 
    "A" "C" "AC" .2 
    "A" "D" "AD" .69999999 
    "B" "C" "BC" .89999998 
    "B" "D" "BD" .1 
    "C" "D" "CD" .2 
    end
    
    * check that d uniquely identifies observations
    isid d, sort
    
    merge 1:m d using "`f'", keep(match using) nogen
    sort County d
    list, sepby(County) noobs

    Comment


    • #3
      Thanks again, Robert!
      Ok, next time I try to use dataex!

      Have a nice weekend,

      Hans

      Comment

      Working...
      X