Creating a dummy to compare two columns of string variables

Steffen Mauch

Join Date: Dec 2021

Posts: 37
#1

Creating a dummy to compare two columns of string variables

09 Jun 2023, 09:26

Here is an excerpt from my data:

Code:

input id_notice_cn iso_country_code admin_languages_tender OfficialLanguages 20221025 "IT" "IT" "IT" 20221026 "PL" "PL" "PL" 20221027 "FR" "FR" "FR" 20221028 "FR" "FR" "FR" 20221029 "FR" "FR" "FR" 20221030 "FI" "FI" "FI|SV" 20221031 "DE" "DE" "DE" 20221032 "IT" "IT" "IT" 20221033 "IT" "DE|IT" "IT" 20221034 "IT" "DE|IT" "IT" 20221035 "ES" "CA|ES" "ES|CA|GL|EU" 20221036 "LT" "LT" "LT"

I would like to create a dummy that is 1 for each id_notice_cn if there are additional ("overlapping") languages contained in admin_languages_tender that are not contained in OfficialLanguages. For example, 20221033 and 20221034 contain "DE" and "IT" for admin_languages_tender. In OfficialLanguages there is only IT contained. Hence, the dummy should be 1 for those two observations.

The tricky thing is that the dummy should not be sensitive to the order of languages in the language codes (e.g. for some observations it might be "DE|IT", for others "IT|DE" and so on but that is not supposed to change the result)..

What I tried so far is the following:

Code:

sort id_notice_cn gen lang_diff = 0 split OfficialLanguages, p("|") foreach langvar in OfficialLanguages1 OfficialLanguages2 OfficialLanguages3 OfficialLanguages4 { replace lang_diff = 1 if strpos(admin_languages_tender, `langvar') == 0 & admin_languages_tender != "" }

However, that produces not the desired result.

Any help would be greatly appreciated.

Last edited by Steffen Mauch; 09 Jun 2023, 09:31.
Tags: dummy, strings

Clyde Schechter

Join Date: Apr 2014
Posts: 30168

09 Jun 2023, 09:50

I can't help thinking that there is a simpler, better way to do this. But the following works:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long id_notice_cn str4 iso_country_code str7 admin_languages_tender str13 OfficialLanguages
20221025 "IT" "IT"    "IT"         
20221026 "PL" "PL"    "PL"         
20221027 "FR" "FR"    "FR"         
20221028 "FR" "FR"    "FR"         
20221029 "FR" "FR"    "FR"         
20221030 "FI" "FI"    "FI|SV"      
20221031 "DE" "DE"    "DE"         
20221032 "IT" "IT"    "IT"         
20221033 "IT" "DE|IT" "IT"         
20221034 "IT" "DE|IT" "IT"         
20221035 "ES" "CA|ES" "ES|CA|GL|EU"
20221036 "LT" "LT"    "LT"         
end

capture program drop one_notice
program define one_notice
    tempname off_lang
    frame put OfficialLanguages, into(`off_lang')
    frame `off_lang' {
        split OfficialLanguages, parse("|") gen(language)
        gen `c(obs_t)' obs_no = _n
        reshape long language, i(obs_no)
        levelsof language, local(official_languages)
    }
    tempname admin_lang
    frame put admin_languages_tender, into(`admin_lang')
    frame `admin_lang' {
        split admin_languages_tender, parse("|") gen(language)
        gen `c(obs_t)' obs_no = _n
        reshape long language, i(obs_no)
        levelsof language, local(admin_languages)
    }
    local extra: list admin_languages - official_languages
    gen byte wanted = (`"`extra'"' != "")

end

isid id_notice_cn
runby one_notice, by(id_notice_cn)

-runby- is written by Robert Picard and me. It is available from SSC.

Your example data is not true -dataex- output, though it looks somewhat like it. The point is, that as shown, it is not directly importable to Stata. It took me longer to clean up the example input than it did to solve the actual problem posed. In the future, always use the -dataex- command to show example data--do not try to imitate it. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1478

10 Jun 2023, 11:10

I second Clyde's remark about using actual dataex output.

Here is a somewhat modified version of the OP's code that I think does the job:

Code:

sort id_notice_cn
split admin_languages_tender, p("|")

gen lang_diff = 0
foreach langvar of varlist admin_languages_tender? {
    replace lang_diff = 1 if strpos("|" + OfficialLanguages + "|", "|" + `langvar' + "|") == 0 ///
            & `langvar' != ""
}

drop admin_languages_tender?

which produces:

Code:

. list, noobs sep(0) abbrev(30)

  +------------------------------------------------------------------------------------------+
  | id_notice_cn   iso_country_code   admin_languages_tender   OfficialLanguages   lang_diff |
  |------------------------------------------------------------------------------------------|
  |     20221025                 IT                       IT                  IT           0 |
  |     20221026                 PL                       PL                  PL           0 |
  |     20221027                 FR                       FR                  FR           0 |
  |     20221028                 FR                       FR                  FR           0 |
  |     20221029                 FR                       FR                  FR           0 |
  |     20221030                 FI                       FI               FI|SV           0 |
  |     20221031                 DE                       DE                  DE           0 |
  |     20221032                 IT                       IT                  IT           0 |
  |     20221033                 IT                    DE|IT                  IT           1 |
  |     20221034                 IT                    DE|IT                  IT           1 |
  |     20221035                 ES                    CA|ES         ES|CA|GL|EU           0 |
  |     20221036                 LT                       LT                  LT           0 |
  +------------------------------------------------------------------------------------------+

Comment

Steffen Mauch

Join Date: Dec 2021

Posts: 37
#4

13 Jun 2023, 08:38

First of all, let me apologize for my embarrassing mistake of not inputting the data properly. Apparently, I didn't 'copy and paste' correctly and I only realized that after you flagged this to me. I'm terribly sorry for having wasted your time.

I thank you very much for your responses, though!

@Clyde: Thank you very much for your solution. Unfortunately, I find myself unable to install the -runby- package. I believe my institution does not allow me to access this 'external' source, at least telling from the output I get.

Code:

. ssc install runby remote connection failed http://fmwww.bc.edu/repec/bocode/r/ either 1) is not a valid URL, or 2) could not be contacted, or 3) is not a Stata download site (has no stata.toc file). r(677);

I will try to run your code as soon as I have access to another Stata license that allows me to do so.

@Hemanshu Your code seems to do the trick! Thanks a lot for this solution! I'll have to understand what your code does line by line (I'm not very familiar with handling strings) and see if it works for all of the data I have at hand but that's a great start. Thank you again!
Comment

Announcement