Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating new variable by subtraction of two existing ones.

    Hi, there!

    I'm new in the Stata forum and I've searched for a similar question but I achieve different results than expected by those procedures.

    I have two variables, called "subjective" (which stands for a 5 scale subjective social class classification) and "income" (also 5 values scale standardized as the first one). I want to create a new variable (for example, "mismatch"), that is the result of each value subtraction. The new variable has, then, to have 5 values as a scale, so the fifth step on "mismatch" has to be the result of subtracting "subjective" and "income" fifth step. I expect to have both negative and positive values, because the outcome variable should allow me to see underrepresentation of that social class when they are negative, and the opposite for positive values.

    In the photo I've attached, the blacked row shows (in percentages) what I've calculated manually and what results for the new variable should look like (they are the results of subtracting the first by the second row).

    I'm really sorry if I'm repeating the question, I searched for something even close but I could find nothing. Hope you can help me. Thanks in advance!
    Click image for larger version

Name:	Picture1.png
Views:	1
Size:	16.9 KB
ID:	1663667



  • #2
    What you show in your screenshot, even apart from the final two columns of that table, is clearly not Stata data. It is some kind of summary table you presumably created from Stata data, and then fed into some pretty-print application. While this illustrates what you want to end up with, nobody can get you from here to there, as we have no information about what "here" is. So please post back showing an example from the Stata data set that underlies these calculations. Use the -dataex- command to do that. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    If you do that, I think you will get a timely and helpful response.

    Comment


    • #3
      Thanks, Clyde! Let me know if I've done it right.






      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte(subjective income)
      3 3
      4 4
      3 3
      4 4
      3 3
      3 3
      3 3
      2 2
      4 4
      4 4
      3 2
      4 4
      4 4
      4 4
      5 4
      3 3
      4 4
      4 4
      4 3
      4 3
      2 1
      5 5
      3 2
      3 2
      3 3
      3 3
      3 3
      5 5
      5 4
      3 2
      2 2
      5 4
      3 3
      3 3
      2 2
      5 5
      3 3
      5 5
      3 3
      2 2
      2 2
      4 4
      3 3
      4 4
      1 1
      4 3
      4 4
      4 4
      3 3
      4 4
      2 1
      4 4
      3 3
      3 3
      4 4
      3 3
      4 4
      4 4
      4 4
      4 4
      4 4
      4 4
      3 3
      4 4
      3 3
      4 4
      4 4
      3 3
      3 3
      5 4
      4 4
      5 5
      4 4
      5 5
      3 3
      4 4
      5 5
      3 3
      4 4
      3 3
      4 4
      5 5
      4 4
      4 4
      3 2
      5 4
      5 4
      5 4
      4 4
      5 4
      4 4
      4 4
      4 4
      4 4
      5 4
      5 4
      5 4
      5 4
      4 4
      4 4
      end
      label values subjective X045
      label def X045 1 "Upper class", modify
      label def X045 2 "Upper middle class", modify
      label def X045 3 "Lower middle class", modify
      label def X045 4 "Working class", modify
      label def X045 5 "Lower class", modify
      label values income income
      label def income 1 "Fifth step", modify
      label def income 2 "Forth step", modify
      label def income 3 "Third step", modify
      label def income 4 "Second step", modify
      label def income 5 "Lower step", modify

      Comment


      • #4
        Your use of -dataex- was perfect, thank you!

        The following will create a data set in frame results that contains the figures you want, laid out as you want. You can tinker with the details of the appearance.

        Code:
        count if !missing(subjective)
        by subjective, sort: gen pct_subjective = 100*_N/`r(N)'
        count if !missing(income)
        by income, sort: gen pct_income = 100*_N/`r(N)'
        
        tempfile labeler
        label save X045 using `labeler'
        
        frame create results int class_num float(subjective income)
        forvalues i = 1/5 {
            summ pct_subjective if subjective == `i', meanonly
            local subjective = r(mean)
            summ pct_income if income == `i', meanonly
            local income = r(mean)
            frame post results (`i') (`subjective') (`income')
        }
        
        frame change results
        run `labeler'
        label values class_num X045
        gen percent_diff = income - subjective
        gen result = "Overrepresentation" if percent_diff > 0
        replace result = "Underrepresentation" if percent_diff < 0
        list, noobs clean

        Comment


        • #5
          Sorry for not advicing earlier Clyde Schechter I have Stata version 14 so the frame command does not work in mine. Sorry for not saying it earlier, I'm trying to learn how this works.

          Comment


          • #6
            I can't help with the code at the moment, but can advise that the best way to learn how Statalist works is to spend a few moments reading through the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post.

            As you've probably realized, in future topics your first post should include the fact that you are working with Stata 14. Unless you get lucky and score an upgrade to the latest version.

            Comment


            • #7
              So, before there were frames, we used -postfile-s for this kind of situation.

              Code:
              count if !missing(subjective)
              by subjective, sort: gen pct_subjective = 100*_N/`r(N)'
              count if !missing(income)
              by income, sort: gen pct_income = 100*_N/`r(N)'
              
              tempfile labeler
              label save X045 using `labeler'
              
              capture postutil clear
              tempfile no_frames
              postfile results int class_num float (subjective income) using `no_frames'
              
              forvalues i = 1/5 {
                  summ pct_subjective if subjective == `i', meanonly
                  local subjective = r(mean)
                  summ pct_income if income == `i', meanonly
                  local income = r(mean)
                  post results (`i') (`subjective') (`income')
              }
              postclose results
              
              use `no_frames', clear
              run `labeler'
              label values class_num X045
              gen percent_diff = income - subjective
              gen result = "Overrepresentation" if percent_diff > 0
              replace result = "Underrepresentation" if percent_diff < 0
              list, noobs clean
              One important difference from the frames-based code is that this code removes the original data from memory. If you need to keep it around for further use you should either save it before you get to the -use `no_frames'- command or -preserve- it at that point and then -restore- it at the end.

              Comment


              • #8
                Many thanks!
                I've run the code and I got my results. By the way, I do both saving it before running the -use `no_frames'- command and preserve/restore but the results I've got on the new database with the specific variable percentage_diff I'm not able of saving that new variable on my entire dataset (the one before doing this with all the other variables).

                Comment


                • #9
                  OK. The thing to remember is that in version 14 (or any version before frames were introduced) you can only have a single data set open in memory. So, in order to bring the variable percentage_diff back into the original data set, you have to also save those intermediate results before bringing the original data back into memory.

                  But apart from those mechanics, it is unclear in what way you want to join the percentage_diff variable to the original data. The percentage_diff variable takes on five different values, and none of those values really correspond to any particular observation in the original data. Rather, each of those values represents group properties of the original data set as a whole. It seems that linking any of the newly created observations to any of the observations in the original data set is arbitrary. So it isn't clear to me how you want to link these things up.

                  I suppose the least bizarre way to do this would be to attach each value of percentage_diff to every observation with the corresponding value of the subjective variable in the data set, but the purpose of doing even this eludes me. Anyway, on the assumption that this is what you want to do:

                  Code:
                  count if !missing(subjective)
                  by subjective, sort: gen pct_subjective = 100*_N/`r(N)'
                  count if !missing(income)
                  by income, sort: gen pct_income = 100*_N/`r(N)'
                  
                  tempfile labeler
                  label save X045 using `labeler'
                  
                  capture postutil clear
                  tempfile no_frames
                  postfile results int class_num float (subjective income) using `no_frames'
                  
                  forvalues i = 1/5 {
                      summ pct_subjective if subjective == `i', meanonly
                      local subjective = r(mean)
                      summ pct_income if income == `i', meanonly
                      local income = r(mean)
                      post results (`i') (`subjective') (`income')
                  }
                  postclose results
                  
                  preserve
                  use `no_frames', clear
                  run `labeler'
                  label values class_num X045
                  gen percent_diff = income - subjective
                  gen result = "Overrepresentation" if percent_diff > 0
                  replace result = "Underrepresentation" if percent_diff < 0
                  list, noobs clean
                  save `"`no_frames'"', replace
                  
                  restore
                  clonevar class_num = subjective
                  merge m:1 class_num using `no_frames', assert(match) nogenerate ///
                      keepusing(percent_diff result)
                  drop class_num
                  browse

                  Comment


                  • #10
                    I'm currently working on it, I'll let you know if I can achieve my objective. The difference between the variables, which are reflected in "percent_diff", are a representation measurement for each class. I want to have the level of representation as a dependent variable in order to see what explains the incongruency between income and subjective social class (the two variables from which we create percent_diff. Let me know if I'm clear enough and if I'm exceeding with my questions. Thank you so much!

                    Comment


                    • #11
                      I want to have the level of representation as a dependent variable in order to see what explains the incongruency between income and subjective social class (the two variables from which we create percent_diff.
                      But this percent_diff variable is not defined as an individual-level attribute. It is only defined at the full-population level. So I don't see any possibility to "explain" it with individual level features, which is why I don't understand why you would want to incorporate it into the individual-level data set you started with.

                      Comment


                      • #12
                        What can I do then?

                        Comment


                        • #13
                          Well, I don't know the overall context and what your goals are. But it seems to me you have two, mutually exclusive, ways to go.

                          1. You can redefine the "difference" variable at the individual level. It is no longer a percent difference. Rather for each individual you can calculate the discrepancy between his/her income group and subjective class identification, and then you can study indvidual-level factors that are associated with the magnitude and direction of the discrepancy variable.

                          2. You can stick with the population-level difference variable. But you would then need to gather data and perform separate assessments for different populations. (I'm not sure in what way the populations would differ; could be different nations, or different ethnic groups, different occupational groups, males vs females. The choice would clearly depend on some theoretical understanding in your discipline as to what my be relevant and worth investigating. All of that is way beyond my expertise.) Then you could study population-level factors associated with this population-level difference variable.

                          You can't really do both, at least not with the same data design.

                          Comment


                          • #14
                            I understand, you are completely right. I would like to do the first method of identifying discrepancy for each individual. It is done the same as aggregated? Or do I have to proceed in another way? Once more, thank you very much!

                            Comment


                            • #15
                              In the original data, you can just do -gen discrepancy = income - subjective-.

                              Given that each of income and objective ranges from 1 to 5 you will get a discrepancy variable that ranges from -4 to 4. That said, this kind of difference variable has some undesirable statistical properties that make it unsuitable for use as a dependent variable. It will probably exhibit floor and ceiling effects. And the range of possible discrepancy scores varies depending on the income category. See https://www.fharrell.com/post/errmed/#change for more about the limits of using differences scores (referred to as change scores there, but the same rules apply). So, particularly if you are interested in regression modeling of the discrepancy, you might be better off simply using the subjective score as the outcome variable and make income a covariate instead.

                              Comment

                              Working...
                              X