Finding common values from a variable corresponding to two different values in another variable

Ale Asad

Join Date: Oct 2019

Posts: 2
#1

Finding common values from a variable corresponding to two different values in another variable

11 Oct 2019, 08:41

Hello,

I'm interested in finding common values from a variable corresponding to two different values in another variable. For example, given two variables country_name and product_export_id as follows:
country_name product_export_id

USA 1001

USA 1201

China 1230

China 1001

I am interested in knowing the common product_export_id for USA and China and how many they are. Something like USA and China export only 1 product that is the same and what that product's product_export_id is.

Can anyone please help me do this in stata? I later want to be able to create an index using this information.
Any help will be much appreciated.

Thank you!

Last edited by Ale Asad; 11 Oct 2019, 08:44.
Tags: Suggestion, syntax

William Lisowski

Join Date: Dec 2014
Posts: 10150

11 Oct 2019, 09:04

Welcome to Statalist.

Perhaps this example will start you in a useful direction.

Code:

clear
input str5 country_name int product_export_id
"USA"   1001
"USA"   1201
"China" 1230
"China" 1001
end
sort product_export_id country_name
by product_export_id: generate common = _N==2
list, clean abbreviate(20)

Code:

. list, clean abbreviate(20)

       country_name   product_export_id   common  
  1.          China                1001        1  
  2.            USA                1001        1  
  3.            USA                1201        0  
  4.          China                1230        0

Comment

Ale Asad

Join Date: Oct 2019

Posts: 2
#3

11 Oct 2019, 16:06

Thank you for your response William Lisowski . Unfortunately, the data I am working with has a lot of different countries and id's that I cannot really input. Is there any other way to get around it?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

11 Oct 2019, 17:26

I do not understand your question in post #3 about inputing your data.

Perhaps you have yet to read the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. If not, you should take a few mintues to review it. Note especially sections 9-12 on how to best pose your question. You'll learn that the dataex command is how we provide sample data, as described in section 12 of the FAQ. I needed to read into the Stata the sample data you gave in post #1, and that is how I presented it.

With that said, it seems as if you are saying that you have the possibility of more than just two countries in your data. In that case, perhaps the following is will start you in a useful direction.

Code:

// create example data
cls
clear
input str5 country_name int product_export_id
"USA"   1001
"USA"   1201
"China" 1230
"China" 1001
"Italy" 1230
"Italy" 1001
end

// do the work
rename country_name country_name1
save set1, replace

rename country_name1 country_name2
save set2, replace

use set1, clear
joinby product_export_id using set2
drop if country_name1 == country_name2
sort country_name1 country_name2 product_export_id
list, sepby(country_name1 country_name2) abbreviate(20) noobs

Code:

. list, sepby(country_name1 country_name2) abbreviate(20) noobs

  +---------------------------------------------------+
  | country_name1   product_export_id   country_name2 |
  |---------------------------------------------------|
  |         China                1001           Italy |
  |         China                1230           Italy |
  |---------------------------------------------------|
  |         China                1001             USA |
  |---------------------------------------------------|
  |         Italy                1001           China |
  |         Italy                1230           China |
  |---------------------------------------------------|
  |         Italy                1001             USA |
  |---------------------------------------------------|
  |           USA                1001           China |
  |---------------------------------------------------|
  |           USA                1001           Italy |
  +---------------------------------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

11 Oct 2019, 17:27

If you read the Forum FAQ, as all Forum members are requested to do before posting, you would have recognized that the beginning part of the code in #2 that says

Code:

clear input str5 country_name int product_export_id "USA" 1001 "USA" 1201 "China" 1230 "China" 1001 end

is an abbreviated version of -dataex- output and that -dataex- is the preferred way of showing example data here on the Forum. It is not intended that you will actually create your real data set in this way. It's just a highly efficient way of creating an example data set to illustrate some code.

So you should disregard those lines in the code in #2 and instead -use- your actual data set. Then run the rest of the code from #2

Code:

sort product_export_id country_name by product_export_id: generate common = _N==2 list, clean abbreviate(20)

Now, if your real data set is very large, the -list- output will go on for a long time and may not be useful to you. Also, it may be that there are some products that are exported by more than just two countries, and these groupings would not be identified by the code from #2. So another approach is:

Code:

by product_export_id (country_name), sort: generate common = _N > 1 browse product_export_id country_name if common

Added: Crossed with #4.
1 like
Comment

Announcement