Please need help creating a new variable based on the difference in values of another variable

Daniel McAdams

Join Date: Nov 2022
Posts: 26

Please need help creating a new variable based on the difference in values of another variable

26 Apr 2024, 11:10

I have a dataset with the 5 variables: country, gender, year, indicator, and rate. Only year and rate are numerical. Not all countries have the same number of years. There are 16 different countries, 2 values for gender (female and male), and there are 5 values for indicator (Security, Safety, Benefits, Income, and JQI). I want to:

1. Create a new variable named "Gap" that finds the difference between the maxrate for male and female for EVERY indicator that corresponds to each country and year.
2. Change the shape of data so the value of the indicators become their own variables. For example a column for "Security" "Safety" "Benefits", etc, in addition to a column for "Gap"

Thank you!

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str18 country str6 gender int year str9 indicator float maxrate
"Argentina" "Female" 2003 "Security"    .360841
"Argentina" "Female" 2004 "Security"   .3781143
"Argentina" "Female" 2006 "Security"   .4311728
"Argentina" "Female" 2019 "Security"    .472563
"Argentina" "Female" 2005 "Security"   .4101961
"Argentina" "Female" 2020 "Security"  .45858145
"Argentina" "Female" 2018 "Security"   .4777635
"Argentina" "Female" 2008 "Security"    .484028
"Argentina" "Female" 2012 "Security"   .5111926
"Argentina" "Female" 2009 "Security"   .4936739
"Argentina" "Female" 2007 "Security"   .4653374
"Argentina" "Female" 2021 "Security"   .4874307
"Argentina" "Female" 2017 "Security"   .4982486
"Argentina" "Female" 2010 "Security"   .5049549
"Argentina" "Female" 2016 "Security"   .5038539
"Argentina" "Female" 2011 "Security"    .509892
"Argentina" "Female" 2014 "Security"   .5088673
"Argentina" "Female" 2013 "Security"    .527341
"Argentina" "Female" 2003 "Satisfied"  .7271592
"Argentina" "Female" 2004 "Satisfied"  .7422624
"Argentina" "Female" 2006 "Satisfied"  .7584438
"Argentina" "Female" 2019 "Satisfied"  .7617159
"Argentina" "Female" 2005 "Satisfied"  .7644056
"Argentina" "Female" 2020 "Satisfied"  .7731675
"Argentina" "Female" 2018 "Satisfied"  .7758492
"Argentina" "Female" 2008 "Satisfied"  .7853383
"Argentina" "Female" 2012 "Satisfied"  .7867281
"Argentina" "Female" 2009 "Satisfied"  .7886654
"Argentina" "Female" 2007 "Satisfied"  .7890391
"Argentina" "Female" 2021 "Satisfied"  .7914971
"Argentina" "Female" 2017 "Satisfied"  .7929662
"Argentina" "Female" 2010 "Satisfied"  .7958922
"Argentina" "Female" 2016 "Satisfied"  .7992005
"Argentina" "Female" 2011 "Satisfied"   .799603
"Argentina" "Female" 2014 "Satisfied"  .8047739
"Argentina" "Female" 2013 "Satisfied"  .8058099
"Argentina" "Female" 2003 "Benefits"  .33799475
"Argentina" "Female" 2004 "Benefits"  .36472055
"Argentina" "Female" 2006 "Benefits"   .4192061
"Argentina" "Female" 2019 "Benefits"  .50244933
"Argentina" "Female" 2005 "Benefits"   .3962191
"Argentina" "Female" 2020 "Benefits"   .5088221
"Argentina" "Female" 2018 "Benefits"  .50271964
"Argentina" "Female" 2008 "Benefits"   .4685473
"Argentina" "Female" 2012 "Benefits"   .5114124
"Argentina" "Female" 2009 "Benefits"   .4891078
"Argentina" "Female" 2007 "Benefits"   .4521289
"Argentina" "Female" 2021 "Benefits"   .5173345
"Argentina" "Female" 2017 "Benefits"  .51861537
"Argentina" "Female" 2010 "Benefits"    .506619
"Argentina" "Female" 2016 "Benefits"   .5274991
"Argentina" "Female" 2011 "Benefits"  .50157076
"Argentina" "Female" 2014 "Benefits"   .5362883
"Argentina" "Female" 2013 "Benefits"  .53849596
"Argentina" "Female" 2003 "Income"     .5835037
"Argentina" "Female" 2004 "Income"     .6242281
"Argentina" "Female" 2006 "Income"     .7246749
"Argentina" "Female" 2019 "Income"     .7672296
"Argentina" "Female" 2005 "Income"     .6779246
"Argentina" "Female" 2020 "Income"     .7178292
"Argentina" "Female" 2018 "Income"     .7929612
"Argentina" "Female" 2008 "Income"     .7810154
"Argentina" "Female" 2012 "Income"      .814217
"Argentina" "Female" 2009 "Income"     .7847396
"Argentina" "Female" 2007 "Income"     .7614899
"Argentina" "Female" 2021 "Income"     .7862559
"Argentina" "Female" 2017 "Income"     .8185806
"Argentina" "Female" 2010 "Income"     .8052228
"Argentina" "Female" 2016 "Income"      .814315
"Argentina" "Female" 2011 "Income"     .8215767
"Argentina" "Female" 2014 "Income"     .8145064
"Argentina" "Female" 2013 "Income"     .8366426
"Argentina" "Female" 2003 "JQI"       .50237465
"Argentina" "Female" 2004 "JQI"       .52733135
"Argentina" "Female" 2006 "JQI"        .5833744
"Argentina" "Female" 2019 "JQI"        .6259894
"Argentina" "Female" 2005 "JQI"       .56218636
"Argentina" "Female" 2020 "JQI"        .6146001
"Argentina" "Female" 2018 "JQI"        .6373234
"Argentina" "Female" 2008 "JQI"        .6297323
"Argentina" "Female" 2012 "JQI"        .6558875
"Argentina" "Female" 2009 "JQI"        .6390467
"Argentina" "Female" 2007 "JQI"        .6169989
"Argentina" "Female" 2021 "JQI"        .6456295
"Argentina" "Female" 2017 "JQI"        .6571026
"Argentina" "Female" 2010 "JQI"        .6531722
"Argentina" "Female" 2016 "JQI"        .6612171
"Argentina" "Female" 2011 "JQI"        .6581606
"Argentina" "Female" 2014 "JQI"         .666109
"Argentina" "Female" 2013 "JQI"        .6770724
"Argentina" "Male"   2003 "Security"   .4442297
"Argentina" "Male"   2004 "Security"  .46667385
"Argentina" "Male"   2005 "Security"   .4926981
"Argentina" "Male"   2020 "Security"   .5125269
"Argentina" "Male"   2019 "Security"   .5426688
"Argentina" "Male"   2018 "Security"  .56642944
"Argentina" "Male"   2006 "Security"   .5288204
"Argentina" "Male"   2021 "Security"   .5523707
"Argentina" "Male"   2016 "Security"  .57382536
"Argentina" "Male"   2009 "Security"   .5579912
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

26 Apr 2024, 11:53

You don't say in which direction you want to calculate the gap. I did it as the male maxrate minus the female.

The code doesn't work well with your example data, because for all of the indicators other than Security, you have only Female data, so no gap could be calculated. I assume you won't have that problem in your real data set.

Code:

by country year indicator (gender), sort: assert _N <= 2 by country year indicator (gender): gen gap_ = maxrate[2] - maxrate[1] rename maxrate maxrate_ reshape wide maxrate_ gap_, i(country gender year) j(indicator) string

Caution, if the first command gives you an error message and halts execution, do not proceed. That command is there to verify that you have unique observations for each combination of country, year, indicator, and gender. That assumption, true in the example data, must hold in the full data in order for the code to produce correct answers. Also, if that assumption doesn't hold, then there are probably errors in your data management or in the data itself, as there should be only one correct value of maxrate for each such combination.

Finally, all of the values of indicator in your example data are suitable for inclusion in variable names: they are short enough, and they consist only of letters. But if there are other indicators whose names include, for example, embedded blanks, or special characters, the -reshape- will fail because it will be unable to create variables with the desired names. If you encounter this difficulty, then before the -reshape- command, add -replace indicator = strtoname(indicator)-, which will edit indicator into a similar value that meets Stata's restrictions on variable names. (If the resulting string is too long, you will need yet another -replace indicator = ...- command using the -substr()- function to shorten it to the allowable size.
Comment
Daniel McAdams

Join Date: Nov 2022

Posts: 26
#3

26 Apr 2024, 12:05

Thanks so much, Clyde Schechter . As your predicted, I got the following message after running:

Code:

by country year indicator (gender), sort: assert _N <= 2 850 contradictions in 850 by-groups assertion is false

Is there any way to get around it? The data use is raw data so I did not make any changes!

I have re-thought how I want to calculate the gap and I would like to have 5 new columns that calculate the gap for each indicator. So having the gap in the vertical way. Could you help me with this too? Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

26 Apr 2024, 12:50

So, the first thing to do is identify the offending observations. Since there are violations of this assumption in all 850 combinations of country year and indicator, it seems that everything violates this. One possibility is that the data has more than two values for the gender variable. Your example data didn't, but maybe in the full data there is a third value for missing or unspecified? Or perhaps they are using gender in the post-modern sense and there are many different genders recorded? Look at your data in the browser and see whether that's what's happening. If so, assuming you still want to define the gap as the difference between male and female, ignoring other gender variants, we can modify the code as follows*:

Code:

isid country year indicator gender, sort by country year indicator (gender): egen max_female /// = max(cond(gender == "Female", maxrate, .)) by country year indicator (gender): egen max_male /// = max(cond(gender == "Male", maxrate, .)) gen gap_ = max_male - max_female drop max_female max_male rename maxrate maxrate_ reshape wide gap_ maxrate_, i(country gender year) j(indicator) string

Note: if you get an error message at the -isid- command then we are not, in fact, simply in this post-modern gender designation situation: you have surplus observations that need to be eliminated even taking into account the multiple genders. The following paragraphs deal with this situation and others:

Alternatively, there may be surplus observations for values of country year indicator and gender, that is, more than one observation with the same values on all four of those variables. You can identify them with:

Code:

duplicates tag country year indicator gender, gen(flag) sort country year indicator gender browse if flag

There are several possible situations.
The observations having the same values of country year indicator and gender are all identical on all variables in the data set. They are, that is, pure and complete duplicates. In that case you can remove them by running -duplicates drop-. Then you can go back to using the code shown above in this post. I should point out, however, that usually the presence of pure duplicates reflects an error in the data management that created this data set. I understand that you have used the original raw source data without modification. So the error would have been made by your source, and you should notify them of this so that they can recheck how they created this data set, or perhaps explain why it is supposed to be the way you have it (see, for example, 2. below). While you can get past this problem with -duplicates drop-, you really should pursue this question of why there are pure duplicates, because whatever led to their presence may also reflect the presence of other errors in the data. In short, the data is not trustworthy, though it may turn out to be correct on investigation.

The surplus observations disagree on some other variables. It may be that what they disagree on is, for example, a variable indicating a subdivision of the country, or month of the year, or subtypes of the indicators, or something else indicating that you are grouping your data too coarsely and you have to work with it at a more fine-grained level. In that case, fixing the code requires adding these subdivididng variables into the -by- prefixes, before -(gender)-, and the -i()- option of the -reshape- command.

The surplus observations disagree on some other variables, but these other variables do not point to subdivision of your data to a more fine-grained level. In this case, you are working with data that is self-contradictory. You cannot proceed with this. Something is, without question, wrong with the data. You will have to take the matter up with the source and have them fix it. N.B.: It may turn out that, in fact, the apparent contradictions do arise from subdivision, but the mistake was that they forgot to give you the subdividing variable. That's the best case scenario, and it puts you back in case 2 just above. But don't be surprised if it turns out that the data are genuinely self-contradictory and it may take your source some time and effort to figure out which, if any, of the contradictory observations is correct, and how to fix all of the errors.

*Added: One caution about this. When Stata tests strings for equality, it is case-sensitive. So if your gender variable contains some values that are Male and others that are male, then this is an error, not post-modernism. This is the same gender being coded in two different ways and it's messing up the data. If it turns out you have this situation, run -replace gender = proper(gender)- and it will turn male into Male and female into Female, and then you should be able to run the original code in #2 of this thread without getting an error message.

Additional addition: Using raw data unmodified is usually a bad idea. Most data sets contain things that interfere with proper analysis, and even outright errors. Best practice is to import the raw data source and then modify it to remove these problems, saving the resulting data set under a different name. (Never overwrite the original source data.) This process is known as data cleaning. It involves inspecting all of the data, with various commands designed for that purpose, to remove impossible values, values of variables within observations or across observations that are inconsistent, changing "magic number" missing value codes to Stata missing values, changing string variables that look like dates to Stata date variables, and other processes. Attempting to analyze data that has not been cleaned usually ends in tears.

Last edited by Clyde Schechter; 26 Apr 2024, 12:57.
Comment
Daniel McAdams

Join Date: Nov 2022

Posts: 26
#5

26 Apr 2024, 13:18

This was super helpful! I managed to fix the issue
Comment

Announcement

Please need help creating a new variable based on the difference in values of another variable

Comment

Comment

Comment

Comment