Creating new variable by subtraction of two existing ones.

Alfonso Toribio

Join Date: May 2022

Posts: 15
#1

Creating new variable by subtraction of two existing ones.

09 May 2022, 10:16

Hi, there!

I'm new in the Stata forum and I've searched for a similar question but I achieve different results than expected by those procedures.

I have two variables, called "subjective" (which stands for a 5 scale subjective social class classification) and "income" (also 5 values scale standardized as the first one). I want to create a new variable (for example, "mismatch"), that is the result of each value subtraction. The new variable has, then, to have 5 values as a scale, so the fifth step on "mismatch" has to be the result of subtracting "subjective" and "income" fifth step. I expect to have both negative and positive values, because the outcome variable should allow me to see underrepresentation of that social class when they are negative, and the opposite for positive values.

In the photo I've attached, the blacked row shows (in percentages) what I've calculated manually and what results for the new variable should look like (they are the results of subtracting the first by the second row).

I'm really sorry if I'm repeating the question, I searched for something even close but I could find nothing. Hope you can help me. Thanks in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

09 May 2022, 10:29

What you show in your screenshot, even apart from the final two columns of that table, is clearly not Stata data. It is some kind of summary table you presumably created from Stata data, and then fed into some pretty-print application. While this illustrates what you want to end up with, nobody can get you from here to there, as we have no information about what "here" is. So please post back showing an example from the Stata data set that underlies these calculations. Use the -dataex- command to do that. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

If you do that, I think you will get a timely and helpful response.
2 likes
Comment

Alfonso Toribio

Join Date: May 2022
Posts: 15

09 May 2022, 11:14

Thanks, Clyde! Let me know if I've done it right.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(subjective income)
3 3
4 4
3 3
4 4
3 3
3 3
3 3
2 2
4 4
4 4
3 2
4 4
4 4
4 4
5 4
3 3
4 4
4 4
4 3
4 3
2 1
5 5
3 2
3 2
3 3
3 3
3 3
5 5
5 4
3 2
2 2
5 4
3 3
3 3
2 2
5 5
3 3
5 5
3 3
2 2
2 2
4 4
3 3
4 4
1 1
4 3
4 4
4 4
3 3
4 4
2 1
4 4
3 3
3 3
4 4
3 3
4 4
4 4
4 4
4 4
4 4
4 4
3 3
4 4
3 3
4 4
4 4
3 3
3 3
5 4
4 4
5 5
4 4
5 5
3 3
4 4
5 5
3 3
4 4
3 3
4 4
5 5
4 4
4 4
3 2
5 4
5 4
5 4
4 4
5 4
4 4
4 4
4 4
4 4
5 4
5 4
5 4
5 4
4 4
4 4
end
label values subjective X045
label def X045 1 "Upper class", modify
label def X045 2 "Upper middle class", modify
label def X045 3 "Lower middle class", modify
label def X045 4 "Working class", modify
label def X045 5 "Lower class", modify
label values income income
label def income 1 "Fifth step", modify
label def income 2 "Forth step", modify
label def income 3 "Third step", modify
label def income 4 "Second step", modify
label def income 5 "Lower step", modify

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

09 May 2022, 14:30

Your use of -dataex- was perfect, thank you!

The following will create a data set in frame results that contains the figures you want, laid out as you want. You can tinker with the details of the appearance.

Code:

count if !missing(subjective)
by subjective, sort: gen pct_subjective = 100*_N/`r(N)'
count if !missing(income)
by income, sort: gen pct_income = 100*_N/`r(N)'

tempfile labeler
label save X045 using `labeler'

frame create results int class_num float(subjective income)
forvalues i = 1/5 {
    summ pct_subjective if subjective == `i', meanonly
    local subjective = r(mean)
    summ pct_income if income == `i', meanonly
    local income = r(mean)
    frame post results (`i') (`subjective') (`income')
}

frame change results
run `labeler'
label values class_num X045
gen percent_diff = income - subjective
gen result = "Overrepresentation" if percent_diff > 0
replace result = "Underrepresentation" if percent_diff < 0
list, noobs clean

Comment

Alfonso Toribio

Join Date: May 2022

Posts: 15
#5

09 May 2022, 15:33

Sorry for not advicing earlier Clyde Schechter I have Stata version 14 so the frame command does not work in mine. Sorry for not saying it earlier, I'm trying to learn how this works.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

09 May 2022, 16:44

I can't help with the code at the moment, but can advise that the best way to learn how Statalist works is to spend a few moments reading through the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post.

As you've probably realized, in future topics your first post should include the fact that you are working with Stata 14. Unless you get lucky and score an upgrade to the latest version.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

09 May 2022, 17:12

So, before there were frames, we used -postfile-s for this kind of situation.

Code:

count if !missing(subjective)
by subjective, sort: gen pct_subjective = 100*_N/`r(N)'
count if !missing(income)
by income, sort: gen pct_income = 100*_N/`r(N)'

tempfile labeler
label save X045 using `labeler'

capture postutil clear
tempfile no_frames
postfile results int class_num float (subjective income) using `no_frames'

forvalues i = 1/5 {
    summ pct_subjective if subjective == `i', meanonly
    local subjective = r(mean)
    summ pct_income if income == `i', meanonly
    local income = r(mean)
    post results (`i') (`subjective') (`income')
}
postclose results

use `no_frames', clear
run `labeler'
label values class_num X045
gen percent_diff = income - subjective
gen result = "Overrepresentation" if percent_diff > 0
replace result = "Underrepresentation" if percent_diff < 0
list, noobs clean

One important difference from the frames-based code is that this code removes the original data from memory. If you need to keep it around for further use you should either save it before you get to the -use `no_frames'- command or -preserve- it at that point and then -restore- it at the end.

Comment

Alfonso Toribio

Join Date: May 2022

Posts: 15
#8

10 May 2022, 05:12

Many thanks!
I've run the code and I got my results. By the way, I do both saving it before running the -use `no_frames'- command and preserve/restore but the results I've got on the new database with the specific variable percentage_diff I'm not able of saving that new variable on my entire dataset (the one before doing this with all the other variables).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#9

10 May 2022, 09:17

OK. The thing to remember is that in version 14 (or any version before frames were introduced) you can only have a single data set open in memory. So, in order to bring the variable percentage_diff back into the original data set, you have to also save those intermediate results before bringing the original data back into memory.

But apart from those mechanics, it is unclear in what way you want to join the percentage_diff variable to the original data. The percentage_diff variable takes on five different values, and none of those values really correspond to any particular observation in the original data. Rather, each of those values represents group properties of the original data set as a whole. It seems that linking any of the newly created observations to any of the observations in the original data set is arbitrary. So it isn't clear to me how you want to link these things up.

I suppose the least bizarre way to do this would be to attach each value of percentage_diff to every observation with the corresponding value of the subjective variable in the data set, but the purpose of doing even this eludes me. Anyway, on the assumption that this is what you want to do:

Code:

count if !missing(subjective) by subjective, sort: gen pct_subjective = 100*_N/`r(N)' count if !missing(income) by income, sort: gen pct_income = 100*_N/`r(N)' tempfile labeler label save X045 using `labeler' capture postutil clear tempfile no_frames postfile results int class_num float (subjective income) using `no_frames' forvalues i = 1/5 { summ pct_subjective if subjective == `i', meanonly local subjective = r(mean) summ pct_income if income == `i', meanonly local income = r(mean) post results (`i') (`subjective') (`income') } postclose results preserve use `no_frames', clear run `labeler' label values class_num X045 gen percent_diff = income - subjective gen result = "Overrepresentation" if percent_diff > 0 replace result = "Underrepresentation" if percent_diff < 0 list, noobs clean save `"`no_frames'"', replace restore clonevar class_num = subjective merge m:1 class_num using `no_frames', assert(match) nogenerate /// keepusing(percent_diff result) drop class_num browse
1 like
Comment
Alfonso Toribio

Join Date: May 2022

Posts: 15
#10

10 May 2022, 10:49

I'm currently working on it, I'll let you know if I can achieve my objective. The difference between the variables, which are reflected in "percent_diff", are a representation measurement for each class. I want to have the level of representation as a dependent variable in order to see what explains the incongruency between income and subjective social class (the two variables from which we create percent_diff. Let me know if I'm clear enough and if I'm exceeding with my questions. Thank you so much!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#11

10 May 2022, 10:54

I want to have the level of representation as a dependent variable in order to see what explains the incongruency between income and subjective social class (the two variables from which we create percent_diff.

But this percent_diff variable is not defined as an individual-level attribute. It is only defined at the full-population level. So I don't see any possibility to "explain" it with individual level features, which is why I don't understand why you would want to incorporate it into the individual-level data set you started with.
Comment
Alfonso Toribio

Join Date: May 2022

Posts: 15
#12

10 May 2022, 10:58

What can I do then?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#13

10 May 2022, 12:19

Well, I don't know the overall context and what your goals are. But it seems to me you have two, mutually exclusive, ways to go.

1. You can redefine the "difference" variable at the individual level. It is no longer a percent difference. Rather for each individual you can calculate the discrepancy between his/her income group and subjective class identification, and then you can study indvidual-level factors that are associated with the magnitude and direction of the discrepancy variable.

2. You can stick with the population-level difference variable. But you would then need to gather data and perform separate assessments for different populations. (I'm not sure in what way the populations would differ; could be different nations, or different ethnic groups, different occupational groups, males vs females. The choice would clearly depend on some theoretical understanding in your discipline as to what my be relevant and worth investigating. All of that is way beyond my expertise.) Then you could study population-level factors associated with this population-level difference variable.

You can't really do both, at least not with the same data design.
Comment
Alfonso Toribio

Join Date: May 2022

Posts: 15
#14

10 May 2022, 15:37

I understand, you are completely right. I would like to do the first method of identifying discrepancy for each individual. It is done the same as aggregated? Or do I have to proceed in another way? Once more, thank you very much!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#15

10 May 2022, 17:13

In the original data, you can just do -gen discrepancy = income - subjective-.

Given that each of income and objective ranges from 1 to 5 you will get a discrepancy variable that ranges from -4 to 4. That said, this kind of difference variable has some undesirable statistical properties that make it unsuitable for use as a dependent variable. It will probably exhibit floor and ceiling effects. And the range of possible discrepancy scores varies depending on the income category. See https://www.fharrell.com/post/errmed/#change for more about the limits of using differences scores (referred to as change scores there, but the same rules apply). So, particularly if you are interested in regression modeling of the discrepancy, you might be better off simply using the subjective score as the outcome variable and make income a covariate instead.
Comment

Announcement