Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets with multiple identify observations

    Hello everyone,
    as part of my Master thesis, I am trying to convert observations (products) classified with a code Harmonized System Revision 2002 to a code Harmonized System Revision 2007. To do so I would use the Stata command merge. I post here my master dataset:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
     reporter  HS2002  ntmcode
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F6"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "G9"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    "GIN" "290319" "F4"
    end
    In which I have thousands of combinations between country (reporter), products classified as HS Revision 2002 code and a quality standard for products imported. Basically, it lists all countries that establish a particular quality standard on a particular product.

    My using dataset is a conversion table taken from https://unstats.un.org/unsd/trade/cl...ce-tables.asp:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    HS2007 HS2002
    "010110" "010110"
    "010190" "010190"
    "010210" "010210"
    "010290" "010290"
    "010310" "010310"
    "010391" "010391"
    "010392" "010392"
    "010410" "010410"
    "010420" "010420"
    "010511" "010511"
    "010512" "010512"
    "010519" "010519"
    "010594" "010592"
    "010594" "010593"
    "010599" "010599"
    "010611" "010611"
    "010612" "010612"
    "010619" "010619"
    "010620" "010620"
    "020732" "020732"
    "030193" "030193"
    "030194" "030199"
    "030195" "030199"
    "030199" "030199"
    "030211" "030211"
    "030212" "030212"
    "030219" "030219"
    "030221" "030221"
    "030222" "030222"
    "030223" "030223"
    "030229" "030229"
    "030231" "030231"
    "030232" "030232"
    "030233" "030233"
    "030234" "030234"
    end
    I would like to merge the variable HS2002 in order to have the products classified as HS 2007 Revision in the master dataset. However, the using data does not contain unique observations in any of the two columns, to give an example, in the using dataset you can find the following three cases, namely 1) one observation in column HS2002 equal different values in HS2007, 2) observations are equal in both column, and 3) different values in HS2002 equal the same value in HS2007. Therefore, I would ask which syntax of merge I should use among 1:1, 1:m, ... m:m (the variable I would merge is HS2002).

    I thank in advance anyone who is willing to help and I apologize for the length of the message.
    HS2002 HS2007
    x y
    x z
    m m
    w w
    r w

  • #2
    So, to reiterate, it's likely that some older class in 2002 now have evolved into multiple classes in the 2007 version, making the merging a bit hazy? If so, this seems to be more of a data problem than a Stata problem. But nonetheless we can merge whatever we have:

    Code:
    clear
    input str6 HS2007 str6 HS2002
    "010110" "010110"
    "010190" "010190"
    "010210" "010210"
    "010290" "010290"
    "010310" "010310"
    "010391" "010391"
    "010392" "010392"
    "010410" "010410"
    "010420" "010420"
    "010511" "010511"
    "010512" "010512"
    "010519" "010519"
    "010594" "010592"
    "010594" "010593"
    "010599" "010599"
    "010611" "010611"
    "010612" "010612"
    "010619" "010619"
    "010620" "010620"
    "020732" "020732"
    "030193" "030193"
    "030194" "030199"
    "030195" "030199"
    "030199" "030199"
    "030211" "030211"
    "030212" "030212"
    "030219" "030219"
    "030221" "030221"
    "030222" "030222"
    "030223" "030223"
    "030229" "030229"
    "030231" "030231"
    "030232" "030232"
    "030233" "030233"
    "030234" "030234"
    end
    
    bysort HS2002 (HS2007): gen defnum = _n
    reshape wide HS2007, i(HS2002) j(defnum)
    The last two line will create a crosswalk file that contains unique HS2002. The HS2007, however, would be broken down into columns. If it's unique, then only HS20071 is filled out. If there are multiple new versions, then you'll see multiple columns filled out (e.g. see HS 2002 = 030199)

    With that, you can use a m:1 merge. As of how to deal with the multiple versions of HS2007 within a HS2002 class, that you'd probably have to figure out.
    Last edited by Ken Chui; 31 Mar 2021, 11:29.

    Comment

    Working...
    X