Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • working out expressions(sum and multiply) in stata with some missing observations

    Hi Nick Cox


    I want to add and multiply certain variables in stata in a new variable VaR. My expression is:

    gen VaR=constant_term+ (coeff_L_Tbills* L_Tbills)+ (coeff_L_Inf* L_Inf)

    Now the issue is that, for some rows where when L_Inf is missing or 0, the whole outcome VaR of the equation becomes zero. which shouldn't be otherwise if calculated manually.

    How can I amend the above code, to get the correct equation.

    Best Regards


  • #2
    Where did the coefficients come from? Probably, the easiest solution is to use the command predict
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Dear Maarten Buis

      Thank you for your asking. The coeff came from qreg and are generated as separate variables in stata, thus I believe that should not matter. we can ignore that and consider them to be the variables that need to enter into a simple arithmetic equation. but missing value in any of the variable is a problem, which is distorting the actual outcome. besides coeffs there are two other variables as mentioned in the equation in #1 ( L_Tbills), L_Inf))

      Best Regards

      Comment


      • #4
        Where the coefficients come from is vitally important, as it will determine what better options are available to you in Stata. Doing these computations manually is a completely unnecessary bug waiting to happen. In principle you can do the math yourself.

        Code:
        // load some example data and do some preparations
        sysuse nlsw88, clear
        
        gen byte urban = c_city + smsa if !missing(c_city,smsa)
        label define urban 2 "central city" ///
                           1 "suburban"     ///
                           0 "rural"
        label value urban urban
        label variable urban "urbanicity"
        
        // estimate the model
        qreg wage i.urban i.race ttl_exp i.union
        
        // find out how the coefficients are called
        qreg, coeflegend
        
        // manually create the prediction (not recommended)
        gen pred = _b[_cons] + ttl_exp*_b[ttl_exp] + ///
                   _b[1.urban]*1.urban + ///
                   _b[2.urban]*2.urban + ///
                   _b[2.race]*2.race + ///
                   _b[3.race]*3.race + ///
                   _b[1.union]*1.union
        So, in this case I directly accessed the coefficients without first storing them as variables. However, you can also see how easy it will be to make a typo and introduce a bug, especially if the model becomes bigger. However, the beauty of programming is that in programming being lazy is a virtue, not a sin. To be specific, in programming you should never do something yourself if someone else has already done it for you. So, be virtuous and use predict instead: it will do it all for you and take appropriate care of missing values. To continue the example:

        Code:
        // or just use predict (recommended)
        predict pred2           
        
        // check that the results are the same
        // (no output means the statement is true for all observations)
        assert pred == pred2
        
        // check that pred2 is missing when union is missing
        assert pred2 == . if missing(union)
        So predict takes care of missing values for you, and is much easier to use, which means it is much harder to introduce bugs (which is good).
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Dear Maarten Buis

          Thanks again for an elaborate note. I have practiced the code above with the example dataset and the actual problem is still not resolved. while usind the code I gen pred2 using predict as:

          predict pred2
          (option xb assumed; fitted values)
          (368 missing values generated)

          still you see the missing values are generated, which should not be (logically) there in my example. I will try to explain again. I have few variables in a data set and some of them have missing values(they are all numeric, none is dummy).

          I want to add and multiply them in an equation which I mentioned in #1. due to missing observations in some of the rows in one variable L_Inf, all the sum and multiply function becomes zero, which shouldn't be otherwise if calculated manually or in simple excel program. somehow stata is interpreting my equation or the missing value in a different manner.

          Thus I want to know a code where simple arithmetic equation can be calculated. As an illustration of the data, please see the table below:
          Time_Period FI_n L_TEDSp L_Inf coeff_L_TEDSp coeff_L_Inf constant_term
          2019m1 1 -7.54237 -5.07692 0.0200048 -0.0001752 -0.6206399
          2019m2 1 -7.6125 16.5316 0.0200048 -0.0001752 -0.6206399
          2019m3 1 -7.88487 14.18637 0.0200048 -0.0001752 -0.6206399
          2019m4 1 -8.30025 0.0200048 -0.0001752 -0.6206399
          2019m5 1 -8.27437 0.0200048 -0.0001752 -0.6206399
          2019m6 1 -10.1975 0.0200048 -0.0001752 -0.6206399
          2019m7 1 -10.4301 0.0200048 -0.0001752 -0.6206399
          2019m8 1 -11.4844 0.0200048 -0.0001752 -0.6206399
          2007m5 2 5.355 4.271845 204.1784 5.448221 1724.55
          2007m6 2 5.36 -0.55866 204.1784 5.448221 1724.55
          2007m7 2 5.36 -18.3521 204.1784 5.448221 1724.55
          2007m8 2 5.35866 14.22018 204.1784 5.448221 1724.55
          2007m9 2 5.62125 -3.41366 204.1784 5.448221 1724.55
          2007m10 2 5.22875 28.48233 204.1784 5.448221 1724.55
          2007m11 2 4.89375 17.47573 204.1784 5.448221 1724.55
          2007m12 2 5.13125 -7.98898 204.1784 5.448221 1724.55
          2008m1 2 4.7025 12.87425 204.1784 5.448221 1724.55
          2008m2 2 3.11188 26.39257 204.1784 5.448221 1724.55
          2008m3 2 3.0575 6.400839 204.1784 5.448221 1724.55
          2008m4 2 2.68813 0.788955 204.1784 5.448221 1724.55
          Now I want to generate a newvar= constant_term+ (coeff_L_TEDSp* L_TEDSp)+ (coeff_L_Inf* L_Inf)

          As you see the variable L_Inf has some missing data, which distorts the actual computation of the newvar for rows 2019m4 to 2019m8.

          I hope I tried to clarify this time.

          Comment


          • #6
            What is 3*(missing value) + 5? Without further assumptions the answer is (missing value). A missing value indicates that we do not know that number, so 3 times some unknown number can only be an unknown number, and adding 5 to some unknown number still results in an unknown number. So Stata does this exactly right; If any of the explanatory variables is missing, the resulting prediction must also be missing.

            Apparently, you have some assumptions you want to apply to the missing values in this case. Can you tell us what rules you want Stata to follow for your missing values?
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              Dear Maarten Buis
              I understand that, but my point is that if one variable is missing and it is a simple arithmetic equation then, 3*(missing value) + 5 should be equal to 5. ((3*0)+5= 0+5=5). simple mathematics.

              I further don't know how the analysis would change if I replace these missing with a zero, as they are not zero but simply unknown numbers.

              Best Regards

              Comment


              • #8
                Ok, so you seem to think that a missing value should equal 0. Is that correct? And if that is correct, why?
                ---------------------------------
                Maarten L. Buis
                University of Konstanz
                Department of history and sociology
                box 40
                78457 Konstanz
                Germany
                http://www.maartenbuis.nl
                ---------------------------------

                Comment


                • #9
                  No certainly I would not want to mark them zero here as that would change the interpretation of my data, as they are simply unknown data, not zero. I don't know why stata can't make a simple mathematical computation as in any software like excel, with the same concept as we believe: 3*(missing)+5 should be equal to 5.

                  Comment


                  • #10
                    You make two statements:

                    1) (missing) is not 0
                    2) 3*(missing)+5=5

                    How can those two statements both be true?

                    So, no, I don't believe 3*(missing) + 5 equals 5. The only way I could justify that computation is by assuming that missing equals zero. If I am not willing to make that assumption, and I am not, then that answer can only be missing.

                    Let's do the math again:

                    Multiplication comes before addition, so first we solve 3*(missing). Three times an unknown number is itself unknown, i.e. missing.

                    Second we solve (missing) + 5. An unknown number plus five is still unknown, i.e. missing.

                    So simple maths tells us that 3*(missing) + 5 = (missing) and not 5.

                    If excel really thinks that the answer is 5, then that is very worrying. But the problem is that excel got the computation seriously wrong, not how to let Stata make the same error.
                    ---------------------------------
                    Maarten L. Buis
                    University of Konstanz
                    Department of history and sociology
                    box 40
                    78457 Konstanz
                    Germany
                    http://www.maartenbuis.nl
                    ---------------------------------

                    Comment


                    • #11
                      Dear Maarten Buis

                      I think I get your point which is valid.

                      Any solution to get around these missing values in the data as I don't want to lose my analysis observations. I see the missing values are one or two at a time in a series of data. for example, I have the data till Aug 2019, but Sept 2019 is missing in some of the panels. In other may be the observations are missing from Jan 2000 to Mar 2000, while in the rest of the panels this data is there and therefore I would not want to exclude from my analysis.

                      Best Regards

                      Comment

                      Working...
                      X