How to create new columns for each country observation and count this number of countries?

Christoforos Zardis

Join Date: Jul 2021
Posts: 11

How to create new columns for each country observation and count this number of countries?

24 Oct 2021, 12:32

Dear all,

I wanted your help with the following. I am working on a patent project and in the following link (https://1drv.ms/f/s!APdCpn4GgD8tfg) you can find the Excel dataset with the patents of one of the companies, under ‘Sheet1’ Excel sheet.

Columns:
column A: pn (patent number. two first characters of each patent number refer to the country the patent was registered in. eg. For AU8425677-A the country code is AU. However, ‘EP’ as country code refers to European patents, so this refers to all country codes under ‘EP_codes’ Excel sheet)
column B: year
column C: environ_patents (1 for environmental patents, 0 for non-environmental patents)

The variables I want to create are the following
1. Which countries it has registered patents in a year.
2. The number of countries all patents in a year have been registered in.
3. Which countries it has registered non-environmental patents in a year
4.The number of countries non-environmental patents have been registered in a year
5. Which countries it has registered environmental patents in a year
6. The number of countries environmental patents have been registered in a year

The most important variables are the first and second one, since once created, the rest variables will be much easier to be created.
For example, for the below observations:

pn	year	environ_patents
WO2000077147-A1; BR9001712-U2; BR9001712-Y1	2000	0
AU9050094-A; ZA9001404-A; ZA9001404-B; BR9003529-T1	2000	0
AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2	2000	1
WO2000096237-A1; BE766843-A; FR2135376-A	2000	1
IT766843-A; FR2135376-A	2001	0
BE766843-A; WO2001102655-A1; FR2135376-A	2001	1

I would like the result as below:

Which countries it has registered patents in a year.

2000: BR AU AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR JP ZA CA DE
2001: IT FR BE

The number of countries all patents in a year have been registered in.

2000: 44
2001: 3

Which countries it has registered non-environmental patents in a year.

2000: AU ZA BR
2001: IT FR
4.The number of countries non-environmental patents have been registered in a year
2000: 3
2001: 2

Which countries it has registered environmental patents in a year

2000: AU AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR JP ZA CA DE
2001: BE FR

The number of countries environmental patents have been registered in a year

2000: 43
2001: 2

The above variables will ignore ‘WOxxxx’ codes(meaning World Organization), show only identical countries(not duplicates) per row and per year, and in case of ‘EPxxxx’ (stands for environmental patents, which includes European countries the company chose the patent to be registered in), to show the country codes under the ‘EP_codes’ Excel sheet.
For the first variable, I would like to generate new columns(columns named: country_1, country_2, etc) for each identical country identified in the ‘pn’ column as in the results below -the below are columns that should appear on the right side of the previous table's columns:

country_1

country_2

country_3

country_4

country_5

country_6

country_7

country_8

country_9

country_10

country_11

country_12

country_13

country_14

country_15

country_16

country_17

country_18

country_19

country_20

country_21

country_22

country_23

country_24

country_25

country_26

country_27

country_28

country_29

country_30

country_31

country_32

country_33

country_34

country_35

country_36

country_37

country_38

country_39

country_40

country_41

country_42

country_43

.Therefore, could you advise on the Stata commands required the get the results above?

Thank you very much,
C.

Last edited by Christoforos Zardis; 24 Oct 2021, 12:41.

Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1058

24 Oct 2021, 17:56

I think some structure like this may be easier for later analysis:

Code:

clear
input str105 pn year environ_patents
"WO2000077147-A1; BR9001712-U2; BR9001712-Y1" 2000 0
"AU9050094-A; ZA9001404-A; ZA9001404-B; BR9003529-T1" 2000 0
"AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2" 2000 1
"WO2000096237-A1; BE766843-A; FR2135376-A" 2000 1
"IT766843-A; FR2135376-A" 2001 0
"BE766843-A; WO2001102655-A1; FR2135376-A" 2001 1
end

save temp01, replace

* Split:
split pn, gen(tpn) p(;)

* Extract only the first two symbols:
foreach x of varlist tpn*{
replace `x' = trim(`x')
replace `x' = substr(`x', 1, 2)
}

* Add an unique id and then reshape to get a list of unique countries
gen uid = _n
reshape long tpn, i(uid) j(instance)
levelsof tpn, local(country_list)
reshape wide

* Create country variables
foreach co in `country_list'{
    gen country_`co' = ""
    foreach ins of varlist tpn*{
        replace country_`co' = "`co'" if `ins' == "`co'"
    }
}
drop tpn* country_WO

list year environ_patents country_AU-country_DE, sep(0)

Results:

Code:

     +------------------------------------------------------------------------+
     | year   enviro~s   countr~U   count~BE   count~BR   count~CA   count~DE |
     |------------------------------------------------------------------------|
  1. | 2000          0                               BR                       |
  2. | 2000          0         AU                    BR                       |
  3. | 2000          1         AU                               CA         DE |
  4. | 2000          1                    BE                                  |
  5. | 2001          0                                                        |
  6. | 2001          1                    BE                                  |
     +------------------------------------------------------------------------+

You will need to fix the EP problem yourself with a simple loop that creates/replaces the EU countries. I (and probably most users here) am not very excited about downloading data from an external link, so I was not able to find out the list is.

Comment

Christoforos Zardis

Join Date: Jul 2021
Posts: 11

25 Oct 2021, 06:14

Hi Ken,

Many thanks for your help. You are correct, I am attaching here the full dataset split into two txt files('Sheet1-part1 and Sheet1-part2') and the 'EP_codes'..

I used the codes on the full dataset(Sheet1) and I can see the new columns created corresponding to new identical countries. However would it be possible to advise on the EP issue? Please see attached the 'EP_codes'. In the following example,(copied from above):

pn	year	environ_patents
AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2	2000	1

Can you advise how can the 'EPxxxx' codes from above be replaced with the 'EP_codes' as below, but to show only identical countries(not duplicates) per row? For example 'DE' will show only once as below(same for 'EP_code' countries which will replace the three 'EPxxxx' codes but which will only appear once and not three times)

country_1

country_2

country_3

country_4

country_5

country_6

country_7

country_8

country_9

country_10

country_11

country_12

country_13

country_14

country_15

country_16

country_17

country_18

country_19

country_20

country_21

country_22

country_23

country_24

country_25

country_26

country_27

country_28

country_29

country_30

country_31

country_32

country_33

country_34

country_35

country_36

country_37

country_38

country_39

country_40

country_41

country_42

Regarding the six questions from my initial message, would it be possible to advise of the commands I can get the countries (identical ones/not duplicates) for questions: 1,3 and 5 and their number for questions: 2,4 and 6?

Thank you very much,
C.

Attached Files

Last edited by Christoforos Zardis; 25 Oct 2021, 06:19.

Comment

Christoforos Zardis

Join Date: Jul 2021
Posts: 11

25 Oct 2021, 07:23

Hi Ken,
Please see below a revised version of my previous post, as I believe this is more explainable: Thank you!

I used the codes on the full dataset(Sheet1) and I can see the new columns created corresponding to new identical countries. I am also attaching here the full dataset split into two txt files('Sheet1-part1 and Sheet1-part2') and the 'EP_codes'.,as you are correct about external links,.

However, as a continuation to your codes, would it be possible to advise on the 'EPxxxx' codes (you can find attached the 'EP_codes')? In the following example,(copied from above):

pn	year	environ_patents
AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2	2000	1

Can you advise how can the 'EPxxxx' codes from above be replaced with the 'EP_codes' as below, but to show only identical countries(not duplicates) per row(as you already did with the existing initial example dataset)? For example 'DE' will show only once as below(same for 'EP_code' countries which will replace the three 'EPxxxx' codes but which will only appear once and not three times)

country_1

country_2

country_3

country_4

country_5

country_6

country_7

country_8

country_9

country_10

country_11

country_12

country_13

country_14

country_15

country_16

country_17

country_18

country_19

country_20

country_21

country_22

country_23

country_24

country_25

country_26

country_27

country_28

country_29

country_30

country_31

country_32

country_33

country_34

country_35

country_36

country_37

country_38

country_39

country_40

country_41

country_42

Regarding the six questions from my initial message, would it be possible to advise of the commands on how I can get the countries (identical ones/not duplicates) for questions: 1,3 and 5 and their number for questions: 2,4 and 6? Questions 1 and 2 are the most critical ones since then questions 3-6 I guess should be get with a sort of 'environ_patents' column.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35727
#5

25 Oct 2021, 07:29

Please don’t use attachments without good reason. This is explained in detail in the FAQ Advice all posters are asked to read before posting. Your thread is already asking a lot and expecting that people open attachments with other software and then work out what that might mean in Stata is asking even more.
Comment
Christoforos Zardis

Join Date: Jul 2021

Posts: 11
#6

25 Oct 2021, 08:13

Hi Nick, understood - thanks for explaining. I will only be providing small example datasets as my initial message, with which I still get the answers.
Comment

Ken Chui

Join Date: Aug 2014
Posts: 1058

25 Oct 2021, 13:34

This revised version should be able to convert the EP into the list of the countries in EP:

Code:

clear
input str105 pn year environ_patents
"WO2000077147-A1; BR9001712-U2; BR9001712-Y1" 2000 0
"AU9050094-A; ZA9001404-A; ZA9001404-B; BR9003529-T1" 2000 0
"AU8425677-A; EP126524-A; JP59231039-A; ZA8401939-A; EP126524-B; CA1215991-A; DE3462027-G; EP126524-B2" 2000 1
"WO2000096237-A1; BE766843-A; FR2135376-A" 2000 1
"IT766843-A; FR2135376-A" 2001 0
"BE766843-A; WO2001102655-A1; FR2135376-A" 2001 1
end

save temp01, replace

* Split:
split pn, gen(tpn) p(;)

* Extract only the first two symbols:
foreach x of varlist tpn*{
replace `x' = trim(`x')
replace `x' = substr(`x', 1, 2)
}

* Add an unique id and then reshape to get a list of unique countries
gen uid = _n
reshape long tpn, i(uid) j(instance)
levelsof tpn, local(country_list)
reshape wide

* Create country variables
foreach co in `country_list'{
    gen country_`co' = ""
    foreach ins of varlist tpn*{
        replace country_`co' = "`co'" if `ins' == "`co'"
    }
}
drop tpn* country_WO

* Set up the list of coutnries in EP:
global eplist AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS ///
              IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

foreach x of global eplist{
    capture gen country_`x' = ""
    replace country_`x' = "`x'" if country_EP == "EP"
}

order country_*, alpha last
drop country_EP

And to be honest I didn't even realize that you're making this data set in order to answer those questions. For those questions, this format you requested is not the best way to get at them. A simply long format with repeated years is way better. My major issue is: for those questions, I'm very confused if you want the output to look that way, or if they are just data summary. This draft codes may be helpful, but you'd need to check the answers:

Code:

* Change to long form
reshape long country_, i(uid) j(ins, string)
drop if country_ == ""
drop pn ins
* Generate year-country tag
egen yr_co = tag(year country_)
* Generate year-country-env tag
egen yr_co_en = tag(year country_ environ_patents)


* Q1
list country if year == 2000 & yr_co == 1, sep(0)
list country if year == 2001 & yr_co == 1, sep(0)

* Q2
tab year yr_co if yr_co == 1

* Q3
list country if year == 2000 & yr_co_en == 1 & environ_patents == 0, sep(0)
list country if year == 2001 & yr_co_en == 1 & environ_patents == 0, sep(0)

* Q4
tab year yr_co_en if environ_patents == 0 & yr_co_en == 1

* Q5
list country if year == 2000 & yr_co_en == 1 & environ_patents == 1, sep(0)
list country if year == 2001 & yr_co_en == 1 & environ_patents == 1, sep(0)

* Q6
tab year yr_co_en if environ_patents == 1 & yr_co_en == 1

Comment

Christoforos Zardis

Join Date: Jul 2021

Posts: 11
#8

26 Oct 2021, 10:05

Hi Ken,

Many thanks, the codes return the results I wanted. I know there are more straight forward ways to get them, I just wanted this format as it will help in the future to see which of these countries are developing and underdeveloped.

Thank you again
C.
Comment

Announcement

How to create new columns for each country observation and count this number of countries?

Comment

Comment

Comment

Comment

Comment

Comment

Comment