Extracting two letter country names from a string variable

Karishma DSouza

Join Date: Oct 2016
Posts: 111

Extracting two letter country names from a string variable

22 Oct 2020, 15:02

Hi. I want to extract country names from a string variable that has a significant amount of text. I've created a loop with the country names, but run into problems with words like "us"/"eu" which could be both a country name, but also letter combinations in other words. Is there anyway for me to extract the country names as stand alone words as in the loop below?

#delim ;

local list `""afghanistan" "albania" "algeria" "andorra" "angola" ///
"tanzania" "thailand" "timor-leste" "togo" "tonga" "trinidad and tobago" ///
"tunisia" "turkey" "turkmenistan" "tuvalu" "uganda" "ukraine" ///
"united arab emirates" "united kingdom" "united states of america" "uruguay" ///
"uzbekistan" "vanuatu" "venezuela" "vietnam" "yemen" "zambia" "zimbabwe" "uae"///
"uk" "usa" "asia" "australasia" "europe" "america" "s africa" "us" "eu" ///
"hong kong" "the us" "the eu" "russian federation" "burma" "n&s america" ///
"s america" "selected markets" "';

#delim cr
macro list _list

Drug

Preclinical

risperidone

Infection, HIV/AIDS Enrollment is complete in a non-randomized, open-label Phase II trial (TMC-278-C213; PAINT) in 35 treatment-naive adult HIV-1 patients to evaluate 25mg once-daily x48wk, in India, S Africa, Thailand, Uganda, Ukraine and the US (ClinicalTrials.gov Web Page, 5 Nov 2009 & 11 Sep 2013, http://clinicaltrials.gov/show/NCT00799864). In open-label, controlled, proof-of-concept German Phase II trial (TMC278-TiDP15-C221; 2008-006009-18) in 10 treatment-naive HIV-1 patients, to evaluate the antiviral activity of monotherapy over 7 days following single dose of TMC-278LA 600mg im has been suspended (EU Clinical Trial Registry, 2008-006009-18, 10 May 2012, https://www.clinicaltrialsregister.e...8-006009-18/DE).

PS: I am unable to use dataex because of the following error. The amount of text in the variable is significantly long.
" Example generated by -dataex-. To install: ssc install dataex
clear
input strL Preclinical
data width (2847 chars) exceeds max linesize. Try specifying fewer variables"

Tags: None

Karishma DSouza

Join Date: Oct 2016

Posts: 111
#2

22 Oct 2020, 18:52

An additional request to this: as shown in the example above, a single observation can have more than one country mentioned, in different places across the text. I'd like a separate variable to be created for each of the countries mentioned. So, in the above example: Variable1 would have India, variable2 would have S Africa, variable3 Thailand, etc....ending with the last variable having German
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#3

22 Oct 2020, 20:05

Well, I originally passed this one over because I saw there were multiple countries mentioned in your example, and I had no idea how you wanted to handle that situation.

But now I'm confused. German isn't in your local macro list, so shouldn't it be omitted? And why do you not want EU to be the last value, when eu is in your list? Did you also mean to skip India, which appears in your example of preclinical but not in your list? This is all very confusing.

Last edited by Clyde Schechter; 22 Oct 2020, 20:30.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#4

22 Oct 2020, 21:01

This is actually a pretty difficult problem. Apart from things like us and eu appearing as character sequences in the middle of words, there is the additional problem of having to recognize multiword sequences like "united states of america." The following code works with your example. Note that I have modified your local macro list. There is no point in having "the eu" and "eu" both in it: eu will be picked up, whether it appears after "the" or not, and there is no reason to have the result come out differently in the two cases.

Code:

replace preclinical = lower(preclinical) replace preclinical = subinstr(preclinical, "timor-leste", "timorleste", .) replace preclinical = subinstr(preclinical, "trinidad and tobago", "trinidadandtobago", .) replace preclinical = subinstr(preclinical, "united arab emirates", "unitedarabemirates", .) replace preclinical = subinstr(preclinical, "united kingdom", "unitedkingdom", .) replace preclinical = subinstr(preclinical, "united states of america", "unitedstatesofamerica", .) replace preclinical = subinstr(preclinical, "s africa", "safrica", .) replace preclinical = subinstr(preclinical, "hong kong", "hongkong", .) replace preclinical = subinstr(preclinical, "russian federation", "russianfederation", .) replace preclinical = subinstr(preclinical, "n&s america", "nsamerica", .) replace preclinical = subinstr(preclinical, "s america", "samerica", .) replace preclinical = subinstr(preclinical, "selected markets", "selectedmarkets", .) split preclinical, parse(" ") gen(country) order preclinical, last gen long obs_no = _n reshape long country, i(obs_no) replace country = regexs(1) if regexm(country, "[^a-zA-Z]*([a-zA-Z]*)[^a-zA-Z]*") local list afghanistan albania algeria andorra angola /// tanzania thailand timorleste togo tonga trinidadandtobago /// tunisia turkey turkmenistan tuvalu uganda ukraine /// unitedarabemirates unitedkingdom unitedstatesofamerica uruguay /// uzbekistan vanuatu venezuela vietnam yemen zambia zimbabwe uae /// uk usa asia australasia europe america safrica us eu /// hongkong russianfederation burma nsamerica /// samerica selectedmarkets gen byte keeper = 0, after(country) foreach l of local list { replace keeper = 1 if lower(country) == `"`l'"' } keep if keeper drop keeper by obs_no (_j), sort: replace _j = _n reshape wide order drug, after(obs_no)

Note also that the code replaces the original value of preclinical by a lower case version. This simplifies the subsequent code. If you need to retain the original version of preclinical, I suggest you make a copy of the variable under another name for that purpose.

Your use of -delim ;- makes the use of the /// separators pointless. Since I prefer using ///, I've ignored the -delim ;- here. I've also edited the last to contain versions of the country names that have all non-letter characters removed from them. The -replace- statements at the top of the code do the same thing to the variable preclinical. This is key to the functioning of the code as it relies on having spaces delimit country names, and it also relies on being able to ignore non-letter characters immediately preceding or following a country name (e.g. in your example there are commas after several country names, and there is a left parenthesis preceding EU.) So if non-letter characters or spaces occur within country names, the code breaks. That's why all those replace statements are needed.

If you want neater results, you can write some more -replace- statements at the end of the code to undo the de-spacing and de-hyphenation, (and removal of an & character) and perhaps restore proper case.

The code only pulls out countries mentioned in the local macro list. So it doesn't catch German and it doesn't catch India as they don't appear there.

The country* variables created give the countries mentioned in order of appearance in pre-clinical. If you want them in alphabetical order instead, add a -sort country- command immediately after -drop keeper-.

Last edited by Clyde Schechter; 22 Oct 2020, 21:19.
1 like
Comment
Karishma DSouza

Join Date: Oct 2016

Posts: 111
#5

23 Oct 2020, 20:01

Clyde, I can’t thank you enough for helping me with this! I spent way too many hours on it with little success. I also greatly appreciate your notes and suggestions made after the code.

You are also right, my original macro was incomplete and confusing. For the sake of brevity, I shortened what is in reality an extremely lengthy macro of all worldwide countries. In doing so I was shortsighted about the confusion it could pose, and I apologize for that.
Comment
Karishma DSouza

Join Date: Oct 2016

Posts: 111
#6

25 Oct 2020, 07:55

Hi Clyde. I haven't been able to run the code on my entire dataset of ~35,000 observations. In fact, even when I create a sample dataset of 250 observations, Stata freezes. And this is despite running the code for over 24 hours. I suspect that it's happening because there's just too much to process here. Would you happen to know how I can resolve this?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#7

25 Oct 2020, 11:36

At what point does it appear to freeze? How far into the code does it get?

One place where it may appear to be freezing is at the -reshape long- command: that's a very slow command. There is a user-written command, -tolong- available from SSC that does the same thing (and has the same syntax) and works faster. If it's freezing at the -reshape-, using -tolong- instead would probably solve it.
1 like
Comment
Karishma DSouza

Join Date: Oct 2016

Posts: 111
#8

27 Oct 2020, 07:44

Thanks, Clyde. It freezes at the point of splitting itself. There are times when it does manage to split, but doesn't appear to generate the country variable. Would you have any suggestions for that?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30116

27 Oct 2020, 17:33

Interesting. Yes, I guess those strings in preclinical are very long and -split- is choking on them. Here's a different approach that should work. It is likely to be very slow as well, but it should run (if you have version 16 of Stata):

Code:

gen long obs_no = _n

local list afghanistan albania algeria andorra angola ///
tanzania thailand timor-leste togo tonga "trinidad and tobago" ///
tunisia turkey turkmenistan tuvalu uganda ukraine ///
united arab emirates united kingdom "united states of america" uruguay ///
uzbekistan vanuatu venezuela vietnam yemen zambia zimbabwe uae ///
uk usa asia australasia europe america "s africa" us eu ///
"hong kong" "russian federation" burma "n&s america" ///
"s america" "selected markets"

frame create matches obs_no seq str144 match

forvalues i = 1/`=_N' {
    local seq = 0
    foreach l of local list {
        display `"`l'"'
        if regexm(lower(preclinical[`i']), "[^a-zA-Z]*`l'[^a-zA-Z]") {
            frame post matches (obs_no[`i']) (`++seq') (`"`l'"')
        }
    }
}

frame matches {
    reshape wide match, i(obs_no) j(seq)
    quietly compress
}

frlink 1:1 obs_no, frame(matches)
frget match*, from(matches)
drop obs_no matches

If you do not have version 16, you can modify this code to use a -postfile- instead of a -frame- to hold the intermediate results, and then bring those in using -merge 1:1 obs_no- with the postfile.

The version of the local macro list that I use here is from #1, slightly cleaned up.

Comment

Karishma DSouza

Join Date: Oct 2016
Posts: 111

#10

28 Oct 2020, 13:48

Clyde Schechter I'm so sorry to repeatedly bother you with this. I work off Stata15 and so needed to modify the -frames- to -postfile-. I'm not at all familiar with either but tried reading up and modifying the code to the below. At the end of the first loop (forvalues i = 1/`=_N'), I get the following r(198) error: "( invalid name". I'm also attaching here a screenshot of the error because I'm not sure how to best describe it.

Code:

 gen long obs_no = _n  local list afghanistan albania algeria andorra angola /// tanzania thailand timor-leste togo tonga "trinidad and tobago" /// tunisia turkey turkmenistan tuvalu uganda ukraine /// united arab emirates united kingdom "united states of america" uruguay /// uzbekistan vanuatu venezuela vietnam yemen zambia zimbabwe uae /// uk usa asia australasia europe america "s africa" us eu /// "hong kong" "russian federation" burma "n&s america" /// "s america" "selected markets"  postfile create matches obs_no seq str144 match using results, replace  forvalues i = 1/`=_N' {     local seq = 0     foreach l of local list {         display `"`l'"'         if regexm(lower(preclinical[`i']), "[^a-zA-Z]*`l'[^a-zA-Z]") {             postfile post matches (obs_no[`i']) (`++seq') (`"`l'"')         }     } }  postfile matches {     reshape wide match, i(obs_no) j(seq)     quietly compress }  merge 1:1 obs_no using postfile matches frget match*, from(matches) drop obs_no matches

Attached Files

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30116

#11

28 Oct 2020, 17:08

Well, you didn't really get the -postfile- syntax right. It's pretty different from the syntax used with -frame- commands. Here's a fixed version:

Code:

gen long obs_no = _n  
local list afghanistan albania algeria andorra angola ///
tanzania thailand timor-leste togo tonga "trinidad and tobago" ///
tunisia turkey turkmenistan tuvalu uganda ukraine ///
united arab emirates united kingdom "united states of america" uruguay ///
uzbekistan vanuatu venezuela vietnam yemen zambia zimbabwe uae ///
uk usa asia australasia europe america "s africa" us eu ///
"hong kong" "russian federation" burma "n&s america" ///
"s america" "selected markets"  

tempfile results
postfile matches obs_no seq str144 match using `results', replace

forvalues i = 1/`=_N' {
     local seq = 0    
     foreach l of local list {
        display `"`l'"'        
        if regexm(lower(preclinical[`i']), "[^a-zA-Z]*`l'[^a-zA-Z]*") {
            post matches (obs_no[`i']) (`++seq') (`"`l'"')         
        }    
    }
}
postclose matches  


merge 1:m obs_no using `results', nogenerate
reshape wide match, i(obs_no) j(seq)
drop obs_no
order drug, first

Changes to your code shown in bold face.

Added: By the way, for future reference, when you are not using the current version of Stata, it is best to say what version you are using so that you will get a response that is appropriate to your situation. Do read the Forum FAQ to see this and other helpful advice for getting the most out of Statalist.

Last edited by Clyde Schechter; 28 Oct 2020, 17:19.

Comment

Karishma DSouza

Join Date: Oct 2016

Posts: 111
#12

29 Oct 2020, 12:44

Thank you, Clyde. That worked perfectly!
Comment

Rana Deljavan Anvari

Join Date: Sep 2021
Posts: 13

#13

28 Sep 2021, 13:27

Hi. I want to separate the countries, which exist in Stata, to importer countries and exporters countries which the Importers will contain 30 largest economies such as Germany, Japan, US, Spain, Belgium, Canada, Italy, India, Australia, Brazil, Mexico, Netherlands, Switzerland, Turkey, Ireland, Poland, Sweden, Israel, Denmark, Philippines, Egypt, .... and 149 Exporters countries like Afghanistan, Albania, Algeria, Andorra, Armenia, Areas, nes, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Bosnia Herzegovina, Brazil, Bulgaria, Cambodia, Cameroon ....

Now, my question is how could I run it which shows Importers and Exporters in separate columns?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str33 country float mdate double(n_cases n_deaths) str3 country_code
"Afghanistan"     720        0      0 "AFG"
"Afghanistan"     721        1      0 "AFG"
"Afghanistan"     722      113      4 "AFG"
"Afghanistan"     723     2590     81 "AFG"
"Afghanistan"     724    12501    172 "AFG"
"Afghanistan"     725    15762    464 "AFG"
"Afghanistan"     726     5743    563 "AFG"
"Afghanistan"     727     1452    118 "AFG"
"Afghanistan"     728     1179     60 "AFG"
"Afghanistan"     729     2292     79 "AFG"
"Afghanistan"     730     4211    222 "AFG"
"Afghanistan"     731     7065    467 "AFG"
"Afghanistan"     732     2450    183 "AFG"
"Afghanistan"     733      517     38 "AFG"
"Afghanistan"     734      903     61 "AFG"
"Afghanistan"     735     5063    174 "AFG"
"Africa (total)"  720        0      0 ""   
"Africa (total)"  721        6      0 ""   
"Africa (total)"  722     4623    143 ""   
"Africa (total)"  723    39806   1656 ""   
"Africa (total)"  724   102561   2423 ""   
"Africa (total)"  725   235734   5438 ""   
"Africa (total)"  726   575100  10611 ""   
"Africa (total)"  727   288355   9314 ""   
"Africa (total)"  728   270721   6922 ""   
"Africa (total)"  729   279842   6695 ""   
"Africa (total)"  730   367415   8524 ""   
"Africa (total)"  731   668590  15554 ""   
"Africa (total)"  732   840428  27848 ""   
"Africa (total)"  733   297189  10660 ""   
"Africa (total)"  734   298318   8010 ""   
"Africa (total)"  735   366131  10629 ""   
"Albania"         722      212     10 "ALB"
"Albania"         723      583     21 "ALB"
"Albania"         724      342      2 "ALB"
"Albania"         725     1265     22 "ALB"
"Albania"         726     3117    111 "ALB"
"Albania"         727     3861    114 "ALB"
"Albania"         728     4886    116 "ALB"
"Albania"         729     6936    122 "ALB"
"Albania"         730    16423    280 "ALB"
"Albania"         731    21813    395 "ALB"
"Albania"         732    25898    267 "ALB"
"Albania"         733    27561    479 "ALB"
"Albania"         734    13634    326 "ALB"
"Albania"         735     5192    147 "ALB"
"Algeria"         720        0      0 "DZA"
"Algeria"         721        3      0 "DZA"
"Algeria"         722      508     31 "DZA"
"Algeria"         723     3963    432 "DZA"
"Algeria"         724     4920    190 "DZA"
"Algeria"         725     3879    244 "DZA"
"Algeria"         726    18192    334 "DZA"
"Algeria"         727    12681    270 "DZA"
"Algeria"         728     7990    259 "DZA"
"Algeria"         729     6136    213 "DZA"
"Algeria"         730    23949    437 "DZA"
"Algeria"         731    18187    362 "DZA"
"Algeria"         732     8680    142 "DZA"
"Algeria"         733     5294    104 "DZA"
"Algeria"         734     3357     90 "DZA"
"Algeria"         735     6365    220 "DZA"
"America (total)" 720       17      0 ""   
"America (total)" 721      120      2 ""   
"America (total)" 722   165815   3041 ""   
"America (total)" 723  1319861  87393 ""   
"America (total)" 724  1443698  89631 ""   
"America (total)" 725  2243103  94643 ""   
"America (total)" 726  4630298 129494 ""   
"America (total)" 727  3623472 108691 ""   
"America (total)" 728  3823823 111132 ""   
"America (total)" 729  3529362  72709 ""   
"America (total)" 730  6005330  83967 ""   
"America (total)" 731 10302162 156619 ""   
"America (total)" 732 10461529 236289 ""   
"America (total)" 733  4600047 156329 ""   
"America (total)" 734  5066572 144829 ""   
"America (total)" 735  6967170 201575 ""   
"Andorra"         722      334      6 "AND"
"Andorra"         723      414     39 "AND"
"Andorra"         724       16      6 "AND"
"Andorra"         725       91      1 "AND"
"Andorra"         726       70      0 "AND"
"Andorra"         727      199      1 "AND"
"Andorra"         728      986      0 "AND"
"Andorra"         729     2715     22 "AND"
"Andorra"         730     1887      1 "AND"
"Andorra"         731     1480      8 "AND"
"Andorra"         732     2059     22 "AND"
"Andorra"         733      818      6 "AND"
"Andorra"         734     1162      5 "AND"
"Andorra"         735     1192     10 "AND"
"Angola"          722        7      2 "AGO"
"Angola"          723       28      0 "AGO"
"Angola"          724       51      2 "AGO"
"Angola"          725      181      7 "AGO"
"Angola"          726      881     41 "AGO"
"Angola"          727     1476     55 "AGO"
"Angola"          728     2746     86 "AGO"
"Angola"          729     5435     91 "AGO"
end
format %tm mdate

Also I ran the below code to create zero for all months of 2019, but I could not get the result.
replace case=0 if mdata==2019

Could you please guide me in both parts of my question?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#14

28 Sep 2021, 14:20

I ran the below code to create zero for all months of 2019, but I could not get the result.
replace case=0 if mdata==2019

The syntax for that would be:

Code:

replace case = 0 if year(dofm(mdate)) == 2019

Thank you for explaining how to identify Importers vs Exporters. But I still do not understand what you mean when you say you want them in separate columns? What would that look like? Which rows of which exporters would be paired up with which rows of the importers? Why do you want this? What do you plan to do with it?
Comment
Rana Deljavan Anvari

Join Date: Sep 2021

Posts: 13
#15

29 Sep 2021, 01:02

Thanks a lot for your response, Clyde. I mean, for instance, I have various countries in my datasets in a column. Now, I want to have two independent variables; for example, for Importers and Exporters for n_cases as well as I should have Importers and Exporters for n_deaths variables.
For more explanation:

Importers mdate n_cases n_deaths Exporters mdate n_cases n_deaths
Germany Albania
Japan Colombia
Belgium Bangladesh
United States Barbados
etc. etc

This is exactly what I want. Because my model is bilateral trade which I need import countries and export countries consider separately.
Comment

Announcement

Extracting two letter country names from a string variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment