Problem with merging on CN8-product codes

Nels Westerhuis

Join Date: May 2025

Posts: 3
#1

Problem with merging on CN8-product codes

05 May 2025, 04:34

Good afternoon,

I have run into a problem while trying to merge two datasets.

The first dataset contains two variables. CN8, which is a variable that contains unique values of eight-digit codes. Like 2012090. The other variable "sanctioned" equals 1 for each of the 2837 observations. So the dataset is a list of products that are covered by the sanction I study.

The second dataset also contains two variables. The first one is "product_id", which is ranging from 1 to 3124 for all observations. The second is CN8. The way this dataset differs from the first dataset is that the creators here aggregated to some extent on CN8-product codes. So where the first dataset contained only one eight-digit code, the second one contains a group of CN8-product codes for each observation. To illustrate, one observation has a value that equals 27 for product_id and 61061000 61062000 61069010 61069030 61069050 61069090 for CN8. The number of CN8 codes differs across observations. Some product_id's only contain one CN8 code, most are a list of multiple codes and some even contain 50 codes or more.

I want to merge on CN8. What I want to achieve is that for each product_id, the variable "sanctioned" equals one if at least one of the list of CN8 codes it is comprised of (e.g. 61061000 61062000 61069010 61069030 61069050 61069090) is part of the CN8 codes of the first dataset, and equals zero otherwise. As for as I know, there is no command to directly merge when a unique value of CN8-product codes is part of a list of codes in the variable that I am merging on. Is there anyone who has a suggestion on how I can achieve this.

Thanks in advance

Kind Regards,
Nels Westerhuis
Tags: None
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1396
#2

05 May 2025, 04:39

Welcome to the forum, Nels! Please take a look at the forum FAQ to understand how you can pose questions here more effectively. For instance, as you will read in point #12, it really helps us when you post an extract of your data using the dataex command. In your case, a short excerpt of each of the two datasets would help enormously.
1 like
Comment
Nels Westerhuis

Join Date: May 2025

Posts: 3
#3

05 May 2025, 04:52

Dear Hemanshu,

Thank you for your help. I have indeed tried to extract data using dataex using the following command:

dataex product_id CN8

However, for the second dataset, Stata gives the following error: data width (32772 chars) exceeds max linesize. Try specifying fewer variables

Therefore, I had decided to try to explain what the dataset looks like. Please let me know if I have to describe the dataset in more detail or whether some things have to be clarified.

Best,
Nels
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#4

05 May 2025, 05:08

If I understand this correctly, it is indeed some distance from what merge offers. You might need to split the variable with many items and reshape long before you can merge and then reshape wide

If all else fails, then as the FAQ indicates, the minimum we need is a simple data example showing the main problem with the two datasets. Invented data could be fine.

Someone may be able and willing to read your question and invent a suitable data example. Otherwise I am with Hemanshu Kumar. A realistic data example would greatly improve your chances of a good answer.

EDIT: Well, a minimal instant later in terms of posting, Hemanshu did just that. (Almost.)

Last edited by Nick Cox; 05 May 2025, 05:16.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1396

05 May 2025, 05:08

Well, it is possible to manually create a small data extract to replicate your problem. I create some dummy data below and show a possible solution:

Code:

clear
input long CN8 byte sanctioned
2012090 1
2012091 1
2012092 1
2012093 1
2012094 1
end

tempfile data1
save `data1' 

clear
input byte product_id str50 CN8
1 "2012094"
2 "2012091 2012095"
3 "2012096 2012097 2012098"
4 "2012091 2012092 2012099"
end

split CN8, gen(CN8_)
rename CN8 all_CN8
reshape long CN8_, i(product_id all_CN8) j(num)

drop if missing(CN8_)
rename CN8_ CN8
destring CN8, replace
merge m:1 CN8 using `data1', keep(master match) nogen

egen wanted = max(sanctioned) , by(product_id)
replace wanted = 0 if missing(wanted)
 
duplicates drop product_id, force
drop num sanctioned CN8
rename all_CN8 CN8
sort product_id

which produces:

Code:

. list, noobs abbrev(10)

  +-----------------------------------------------+
  | product_id                       CN8   wanted |
  |-----------------------------------------------|
  |          1                   2012094        1 |
  |          2           2012091 2012095        1 |
  |          3   2012096 2012097 2012098        0 |
  |          4   2012091 2012092 2012099        1 |
  +-----------------------------------------------+

Comment

Nels Westerhuis

Join Date: May 2025

Posts: 3
#6

05 May 2025, 06:22

Thank you Hemanshu Kumar and Nick Cox. This has been quite helpful for me.
Comment

Announcement

Problem with merging on CN8-product codes

Comment

Comment

Comment

Comment

Comment