Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with merging on CN8-product codes

    Good afternoon,

    I have run into a problem while trying to merge two datasets.

    The first dataset contains two variables. CN8, which is a variable that contains unique values of eight-digit codes. Like 2012090. The other variable "sanctioned" equals 1 for each of the 2837 observations. So the dataset is a list of products that are covered by the sanction I study.

    The second dataset also contains two variables. The first one is "product_id", which is ranging from 1 to 3124 for all observations. The second is CN8. The way this dataset differs from the first dataset is that the creators here aggregated to some extent on CN8-product codes. So where the first dataset contained only one eight-digit code, the second one contains a group of CN8-product codes for each observation. To illustrate, one observation has a value that equals 27 for product_id and 61061000 61062000 61069010 61069030 61069050 61069090 for CN8. The number of CN8 codes differs across observations. Some product_id's only contain one CN8 code, most are a list of multiple codes and some even contain 50 codes or more.

    I want to merge on CN8. What I want to achieve is that for each product_id, the variable "sanctioned" equals one if at least one of the list of CN8 codes it is comprised of (e.g. 61061000 61062000 61069010 61069030 61069050 61069090) is part of the CN8 codes of the first dataset, and equals zero otherwise. As for as I know, there is no command to directly merge when a unique value of CN8-product codes is part of a list of codes in the variable that I am merging on. Is there anyone who has a suggestion on how I can achieve this.

    Thanks in advance

    Kind Regards,
    Nels Westerhuis

  • #2
    Welcome to the forum, Nels! Please take a look at the forum FAQ to understand how you can pose questions here more effectively. For instance, as you will read in point #12, it really helps us when you post an extract of your data using the dataex command. In your case, a short excerpt of each of the two datasets would help enormously.

    Comment


    • #3
      Dear Hemanshu,

      Thank you for your help. I have indeed tried to extract data using dataex using the following command:

      dataex product_id CN8

      However, for the second dataset, Stata gives the following error: data width (32772 chars) exceeds max linesize. Try specifying fewer variables

      Therefore, I had decided to try to explain what the dataset looks like. Please let me know if I have to describe the dataset in more detail or whether some things have to be clarified.

      Best,
      Nels

      Comment


      • #4
        If I understand this correctly, it is indeed some distance from what merge offers. You might need to split the variable with many items and reshape long before you can merge and then reshape wide

        If all else fails, then as the FAQ indicates, the minimum we need is a simple data example showing the main problem with the two datasets. Invented data could be fine.

        Someone may be able and willing to read your question and invent a suitable data example. Otherwise I am with Hemanshu Kumar. A realistic data example would greatly improve your chances of a good answer.

        EDIT: Well, a minimal instant later in terms of posting, Hemanshu did just that. (Almost.)
        Last edited by Nick Cox; 05 May 2025, 05:16.

        Comment


        • #5
          Well, it is possible to manually create a small data extract to replicate your problem. I create some dummy data below and show a possible solution:

          Code:
          clear
          input long CN8 byte sanctioned
          2012090 1
          2012091 1
          2012092 1
          2012093 1
          2012094 1
          end
          
          tempfile data1
          save `data1' 
          
          clear
          input byte product_id str50 CN8
          1 "2012094"
          2 "2012091 2012095"
          3 "2012096 2012097 2012098"
          4 "2012091 2012092 2012099"
          end
          
          split CN8, gen(CN8_)
          rename CN8 all_CN8
          reshape long CN8_, i(product_id all_CN8) j(num)
          
          drop if missing(CN8_)
          rename CN8_ CN8
          destring CN8, replace
          merge m:1 CN8 using `data1', keep(master match) nogen
          
          egen wanted = max(sanctioned) , by(product_id)
          replace wanted = 0 if missing(wanted)
           
          duplicates drop product_id, force
          drop num sanctioned CN8
          rename all_CN8 CN8
          sort product_id
          which produces:
          Code:
          . list, noobs abbrev(10)
          
            +-----------------------------------------------+
            | product_id                       CN8   wanted |
            |-----------------------------------------------|
            |          1                   2012094        1 |
            |          2           2012091 2012095        1 |
            |          3   2012096 2012097 2012098        0 |
            |          4   2012091 2012092 2012099        1 |
            +-----------------------------------------------+

          Comment


          • #6
            Thank you Hemanshu Kumar and Nick Cox. This has been quite helpful for me.

            Comment

            Working...
            X