Generating variables based on unique values of parsed variables & merging

Laura Hill

Join Date: Aug 2021
Posts: 42

Generating variables based on unique values of parsed variables & merging

05 Oct 2022, 09:58

Dear all,

I have generated a new variable per nationality in a variable.
I am able to do so using the code

Code:

tab DMNationality, generate(nationality)

But the above variable can contain multiple nationalities per person, which are divided by a point comma ;.
(for example person A might have 'France; Italy' as a value for the variable DMNationality and person B has 'Italy;France;Canada')
I can parse these with the following code

Code:

split DMNationality, parse(;) generate(DMNAT)

Which results in the following output

Code:

. split DMNationality, parse(;) generate(Parse)
variables created as string:
Parse1  Parse2  Parse3  Parse4  Parse5

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str63 DMNationality str37 Parse1 str27(Parse2 Parse3) str20 Parse4 str16 Parse5
"Czech Republic;Germany"        "Czech Republic" "Germany" ""          "" ""
"Czech Republic;Germany"        "Czech Republic" "Germany" ""          "" ""
"Italy;Austria"                 "Italy"          "Austria" ""          "" ""
"Italy;Austria"                 "Italy"          "Austria" ""          "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Germany"                       "Germany"        ""        ""          "" ""
"Germany"                       "Germany"        ""        ""          "" ""
"Austria;Germany"               "Austria"        "Germany" ""          "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Switzerland"                   "Switzerland"    ""        ""          "" ""
"Switzerland"                   "Switzerland"    ""        ""          "" ""
"Switzerland"                   "Switzerland"    ""        ""          "" ""
"Switzerland"                   "Switzerland"    ""        ""          "" ""
"Switzerland;Austria;Australia" "Switzerland"    "Austria" "Australia" "" ""
"Switzerland;Austria;Australia" "Switzerland"    "Austria" "Australia" "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Austria"                       "Austria"        ""        ""          "" ""
"Switzerland"                   "Switzerland"    ""        ""          "" ""
"Switzerland"                   "Switzerland"    ""        ""          "" ""
"Switzerland"                   "Switzerland"    ""        ""          "" ""
"Switzerland"                   "Switzerland"    ""        ""          "" ""
"Switzerland"                   "Switzerland"    ""        ""          "" ""
"Austria;Germany"               "Austria"        "Germany" ""          "" ""
"Germany"                       "Germany"        ""        ""          "" ""
end

So, the issue I have now is that variables Parse1 - Parse5 can contain the same values (nationalities), but if I would use tabulate & generate for each of the variables, then some variables would actually be about the same nationality.
For example, the variable 'nationality_Parse1_1' created using the variable' Parse1' might be the same as the variable 'nationality_Parse2_13' created using the variable 'Parse3', they might both be France (eg. both variables will have a similar label: 'Parse1 == France' & 'Parse2 == France")

My final goal for these variable creations is to create a Blau diversity index.
Does anyone have a solution for my issue?
For example, is there a way to, based on the last part of the label (eg. == France) to instead of creating a new variable, adapt the existing one?
Thank you in advance for your time and help!

Best regards,
Laura

Last edited by Laura Hill; 05 Oct 2022, 10:01.

Tags: country dummies, label, merging variables, parse, tabulate

Mike Lacy

Join Date: Apr 2014

Posts: 2449
#2

05 Oct 2022, 12:02

I believe there might be a data structure that would be useful here, but my answer would depend on what you have in mind when you say you want "to create a Blau diversity index."
Whatever you call that measure (IQV, Simpson's Index, etc.), it presumes a multinomial variable with one possible outcome for each measurement. I don't get what you have in mind in that regard: You could create a single categorical variable encompassing all nationality categories, but then some persons would have to have multiple observations, which to my understanding does not work for such a variable. Or, you could have one variable for each person with categories for all possible combinations of 1 nationalities, 2 nationalities, ..., k nationalities, which I can't imagine being of any possible use. I wonder if you might just be interested in examining the amount of variation in number of nationalities a person has, which would be something different. So, yes, I can think of various ways to create a consistent variable indicating a person's nationalities, but how to do that in a useful way would depend on what you wanted the Blau index to measure. If you can show an example or otherwise explain a bit how you want to apply the Blau index, that would likely lead to a better answer.
2 likes
Comment
Laura Hill

Join Date: Aug 2021

Posts: 42
#3

06 Oct 2022, 01:55

Hi Mike Lacy, thank you for your reply. Indeed looking at your message,
I realize what I am trying to achieve might not be possible from a logical standpoint.
The idea would be to create a variable per company with a diversity measure per year (some people only joined the team later, based on dummy variables t0-t10).

I have used the method described by David Benson in the following post .
Initially, I only used Parse1 (the first mentioned nationalities) to create the Blau index.
So each person had only 1 nationality attributed to them
I think I wanted to see what changes there would be if indeed a person would be "counted" twice, once for each of their nationalities.

However, indeed it must be wondered whether this is a logical step to take or not.
Thanks in advance for your advice!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#4

06 Oct 2022, 08:37

OK, so you *do* want to use multiple observations per person, once for each of their nationalities, and to then calculate a diversity index on that data. It's possible to do this, but I have never seen the diversity index used with more than one observation per object (individual). Perhaps someone else knows of an application like that, but I don't, and it would (to my knowledge) contradict the logic on which that measure is based.

Nevertheless, what you want can be done by reshaping to a long format, and then calculating the index on the nationality variable in the resulting data set:

Code:

// Create a separate observation for each nationality an individual reports rename Parse* nationality* // meaningful names drop DM* // redundant gen id = _n reshape long nationality, i(id) j(seq) drop if strtrim(nationality) == "" tab nationality entropyetc nationality

The community-contributed program -entropyetc-, available at SSC and used in the code of the posting you referenced, can calculate various diversity indices, but be careful because its return list accidentally exchanges the Simpson and the exp(H) indices it displays. -entropyetc- allows the -if qualifier- and a -by()- option, so you could do calculations separately by year and company.
Comment

Announcement

Generating variables based on unique values of parsed variables & merging

Comment

Comment

Comment