Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Optimization with cumulative distribution functions

    Hi! I want to minimize the square distance between two normal distributions of residuals for two different cross-sections (t and t'). The first is the empirical one (in t'), and the second is constructed based on a CDF, which is a linear combination of two normal variables times a normal density. This is how one of the normal variable looks like:

    gen g_t_prime = p*normalden(u_m, miu_1, sigma_1) + (1-p)*normalden(u_m, miu_2, sigma_2)

    My set of parameters to be optimized are their respective means (miu_1 and miu_2) and standard deviations (sigma_1 and sigma_2) plus p, which is the probability of being one or the other normal density.

    In order to calculate my parameters I have tried to use the optimize function in mata as follows:

    mata:
    mata set matastrict off
    void eval0(todo, p, v, g, H)
    {
    v = sum(normalden(x_k) - sum((x_k-u_mean)/rho_final)*(p[1]*normalden(u_m, miu[1], sigma[1]) +
    (1-p[1])*normalden(u_m, miu[2], sigma[2]) - p[1]*normalden(u_m_1, miu[1], sigma[1]) + (1-p[1])*normalden(u_m_1, miu[2], sigma[2]))/(u_m - u_m-1))^2

    }
    S = optimize_init()
    optimize_init_which(S, "min")
    optimize_init_evaluator(S, &eval0())
    optimize_init_params(S, 0)
    v = optimize(S)
    }

    Where u_m, u_m_1, u_mean and x_k are vectors of random variables with size 180000x1.

    However, when I run the code I receive the following message from STATA: "3204 matrix found where scalar required"

    I've alredy tried to perform a similar exercise with moptimize but I am not pretty sure which one to use and how to display my optimization problem.

    I'll appreciate some help, thanks!!

  • #2
    Consider the following example, which corrects your typographical error of "(u_m - u_m-1)" to "(u_m - u_m_1)"
    Code:
    : u_m = (1 \ 2)
    
    : u_m_1 = (3 \ 4)
    
    : u_m - u_m_1
            1
        +------+
      1 |  -2  |
      2 |  -2  |
        +------+
    
    : x = (6 \ 8)
    
    : x / (u_m - u_m_1)
                           /:  3204  matrix found where scalar required
                     <istmt>:     -  function returned error
    r(3204);
    
    : x :/ (u_m - u_m_1)
            1
        +------+
      1 |  -3  |
      2 |  -4  |
        +------+
    
    :

    Comment


    • #3
      Thank you William Lisowski ! However, I still get the same response from STATA. Do you think my optimization problem is correct? Or should I use moptimize() instead?

      Comment


      • #4
        If you still get the same response in Stata, having replaced "/" with ":/", then you likely have another similar syntax error.

        Perhaps another member can better advise you on what you are attempting. Other than to note that "/" requires a scalar denominator while ":/" works element-by-element I'm over my head.

        Comment


        • #5
          Originally posted by Nicolas Larrea View Post
          Where u_m, u_m_1, u_mean and x_k are vectors of random variables with size 180000x1.
          I don't see any optimize_init_argument() among your set up code for optimize(). How does your evaluator function get to see these scalars and vectors?

          Comment


          • #6
            Hi Joseph Coveney, thank you very much for your help. Indeed, I was missing it. I corrected as follows:

            optimize_init_argument(S, 1, x_k)
            optimize_init_argument(S, 2, u_mean)
            optimize_init_argument(S, 3, u_m)
            optimize_init_argument(S, 4, u_m_1)

            However, once I run the code STATA tells me: " <istmt>: 3499 x_k not found". I created this variables right before the optimization code. For instance:

            gen x_k = runiform(min_bound_e, max_bound_e), where the bounds refer to a residual term previously estimated.

            Am I doing something wrong? Or perhaps I need to introduced these equations (where I create those variables) inside the optimization code?

            Comment


            • #7
              The arguments need to appear in the definition of your evaluator

              Code:
              void eval0(todo, p, x_k, u_mean, u_m, u_m_1 v, g, H)
              {
              
              ...
              
              }
              If you have too many arguments consider passing a structure (struct) or a vector of pointers to your evaluator.

              This is one difference with moptimize(). Extra arguments do need to be specified in the evaluator and are dealt with differently by moptimize.

              Comment


              • #8
                Thanks Christophe Kolodziejczyk ! In fact I did it. This is my piece of code so far:

                gen u_m = runiform(min_bound, max_bound)
                gen u_m_1 = u_m[_n-1]
                gen u_mean = (u_m - u_m_1)/2
                gen x_k = runiform(min_bound_e, max_bound_e)
                mata: mata clear
                mata:
                mata set matastrict off
                void eval0(todo, p, x_k, u_mean, u_m, u_m_1, v, g, H)
                {
                v = sum(normalden(x_k) - sum((x_k-u_mean)/rho_final)*(p[1]*normalden(u_m, miu[1], sigma[1]) + (1-p[1])*normalden(u_m, miu[2], sigma[2]) - p[1]*normalden(u_m_1, miu[1], sigma[1]) + (1-p[1])*normalden(u_m_1, miu[2], sigma[2])):/(u_m - u_m_1))^2
                }
                S = optimize_init()
                optimize_init_which(S, "min")
                optimize_init_evaluator(S, &eval0())
                optimize_init_params(S, (0, 0, 0, 0, 0))
                optimize_init_argument(S, 1, x_k)
                optimize_init_argument(S, 2, u_mean)
                optimize_init_argument(S, 3, u_m)
                optimize_init_argument(S, 4, u_m_1)
                v = optimize(S)
                }
                end

                And once I run it, I still got the same response: "<istmt>: 3499 x_k not found"

                Comment


                • #9
                  Since it's been four hours, let me add my limited knowledge here in hopes that someone with Mata expertise will expand on the answer.

                  The problem appears to be that you have leapt into Mata programming without a full understanding of the relationship of Mata to Stata.

                  In brief, the arguments to Mata functions - such as the eval0() function you have written - are Mata objects such as Mata matrices, Mata scalars, Mata pointers, and the like. They are not the names of variables in a Stata dataset as you have provided.

                  Mata includes functions that copy variables from the Stata dataset in memory into Mata objects, and vice-versa.

                  I am not a frequent user of Mata so I hesitate to suggest how your code should be revised.

                  Comment


                  • #10
                    Thanks William Lisowski, I really appreciate your comment. It's true what you just said, I am initiating in these types of functions and thus my knowledge is absolutely limited. In fact, I was trying to solve a maximization problem for vectors (in my case variables) instead of scalars. My problem is that so far, despite all the effort in the search for examples, I have not found something similar to what I am trying to do.

                    If anybody could just share with me some link regarding a set of examples that would be really helpful!

                    Comment


                    • #11
                      As William Lisowski suggested you refer to Stata obejcts in your Mata code which Mata has no knowledge of. I would access these variables from Mata with st_data() or st_view() like shown in the code below. Another option is to define these variables in Mata instead of Stata just for the purpose of simulating data. There is an equivalent of runiform in Mata and you can index vectors to the create the lagged values.
                      So far I can see two issues with your code:

                      1. the (Mata) variables rho_final, miu, and sigma are not defined in your code. So Mata will return an error. I guess these variables are the elements from the vector p and should be referred to accordingly. As far as understand your code there are 6 parameters? rho_final and p[1] are scalars, wheres miu and sigma are 1x2 vectors. But you provide only 5 initial values.
                      2. The other problem is the fact that u_m_1 is a vector of lagged values of u_m and its first in element includes a missing value. You can solve this by selecting the rows where u_m_1 is not missing with st_data(). You can also index the vectors and select the elements from the 2nd row.

                      A minor issue is also that min_bound and max_bound. I just assumed that they are respectively 0 and 1.

                      I hope this will help.

                      Code:
                      clear 
                      set obs 1000
                      gen u_m = runiform() // min_bound, max_bound)
                      gen u_m_1 = u_m[_n-1]
                      gen u_mean = (u_m - u_m_1)/2
                      gen x_k = runiform() // min_bound_e, max_bound_e)
                      gen touse = !missing(u_m_1)
                      mata: mata clear
                      mata:
                      u_mean = st_data(.,"u_mean","touse")
                      u_m = st_data(.,"u_m","touse")
                      u_m_1 = st_data(.,"u_m_1","touse")
                      x_k = st_data(.,"x_k","touse")
                      
                      mata set matastrict off
                      void eval0(todo, p, x_k, u_mean, u_m, u_m_1, v, g, H)
                      {
                      // Define rho_final, miu and sigma. 
                      v = sum(normalden(x_k) - sum((x_k-u_mean)/rho_final)*(p[1]*normalden(u_m, miu[1], sigma[1]) +
                      (1-p[1])*normalden(u_m, miu[2], sigma[2]) - p[1]*normalden(u_m_1, miu[1], sigma[1]) +
                      (1-p[1])*normalden(u_m_1, miu[2], sigma[2])):/(u_m - u_m_1))^2 // you can write expression on several lines
                      }
                      S = optimize_init()
                      optimize_init_which(S, "min")
                      optimize_init_evaluator(S, &eval0())
                      optimize_init_params(S, (0, 0, 0, 0, 0))
                      optimize_init_argument(S, 1, x_k)
                      optimize_init_argument(S, 2, u_mean)
                      optimize_init_argument(S, 3, u_m)
                      optimize_init_argument(S, 4, u_m_1)
                      v = optimize(S)
                      }
                      end

                      Comment


                      • #12
                        Dear Christophe Kolodziejczyk thank you very much for this insightful response. Truly I am really discovering this new Mata code. My reply/quesiton to your suggestions:

                        1. I have 5 parameters to be estimated and optimize: the two miu, the two sigma and p (the probability of being one normal distribution or the other). "Rho_final" is a scalar previously estimated, following the same logic I tried to incorporate it into the code.
                        2. Thank you for that correction, it's crystal clear right now!
                        3. The bounds for "u_m" and "x_k" are previously obtained from the variance of other variables and are not between 0 and 1. Does it imply a problem of any sort?

                        This is my updated code:

                        gen rho_final_1 = rho_final
                        gen u_m = runiform(min_bound, max_bound) //
                        gen u_m_1 = u_m[_n-1]
                        gen u_mean = (u_m - u_m_1)/2
                        gen x_k = runiform(min_bound_e, max_bound_e) //
                        gen touse = !missing(u_m_1)
                        mata: mata clear
                        mata:
                        u_mean = st_data(.,"u_mean","touse")
                        u_m = st_data(.,"u_m","touse")
                        u_m_1 = st_data(.,"u_m_1","touse")
                        x_k = st_data(.,"x_k","touse")
                        rho_final = st_data(1,"rho_final_1") // not sure if here I am right: I tried to select the first observation of my previously defined variable so as to have a scalar
                        mata set matastrict off
                        void eval0(todo, p, x_k, u_mean, u_m, u_m_1, v, g, H)
                        {
                        v = sum(normalden(x_k) - sum((x_k-u_mean)/rho_final)*(p[1]*normalden(u_m, miu[1], sigma[1]) +
                        (1-p[1])*normalden(u_m, miu[2], sigma[2]) - p[1]*normalden(u_m_1, miu[1], sigma[1]) +
                        (1-p[1])*normalden(u_m_1, miu[2], sigma[2])):/(u_m - u_m_1))^2
                        }
                        S = optimize_init()
                        optimize_init_which(S, "min")
                        optimize_init_evaluator(S, &eval0())
                        optimize_init_params(S, (0, 0, 0, 0, 0))
                        optimize_init_argument(S, 1, x_k)
                        optimize_init_argument(S, 2, u_mean)
                        optimize_init_argument(S, 3, u_m)
                        optimize_init_argument(S, 4, u_m_1)
                        v = optimize(S)
                        }
                        end

                        This time, the message from STATA is: "/: 3204 matrix found where scalar required"

                        Thank you very much for you support!

                        Comment


                        • #13
                          You can use st_numscalar() to fetch Stata scalar. Note that you have to pass this variable to your evaluator as well. An alternative is to define these variables as external variables. External variables are defined outside a function and can be referred to directly from your function. In the modified code below rho_final has been defined as external where the other variables are passed to the evaluator. You could also define the other variables as external as well. Using external variables is considered bad style though, since it can be hard to decode especially if you have a lot of code. I would recomment to choose one of the approaches.

                          miu and sigma are now defined as subsets of p, so Mata will find values.
                          As sigma are parameters representing variance they should be positive. I have therefore reparametrize them by taking the exponential.

                          I have to say that the code might be syntaxically correct, but that does not guarantee that you will get a meaningful solution. I am not an expert and it is difficult to give you any insight into your statistical problem is well defined or not.

                          Code:
                          rho_final = st_numscalar("rho_final")
                          void eval0(todo, p, x_k, u_mean, u_m, u_m_1, v, g, H)
                          {
                              
                              external rho_final // , x_k, u_mean, u_m, u_m_1
                          
                              miu = p[2..3]
                              sigma = exp(p[4..5]) 
                          
                              v = sum(normalden(x_k) - sum((x_k-u_mean)/rho_final):*(p[1]*normalden(u_m, miu[1], sigma[1]) +
                              (1-p[1])*normalden(u_m, miu[2], sigma[2]) - p[1]*normalden(u_m_1, miu[1], sigma[1]) +
                              (1-p[1])*normalden(u_m_1, miu[2], sigma[2])):/(u_m - u_m_1))^2
                          }
                          
                          S = optimize_init()
                          optimize_init_which(S, "min")
                          optimize_init_evaluator(S, &eval0())
                          optimize_init_params(S, (0.5, 0, 0, 0, 0))
                          optimize_init_argument(S, 1, x_k)
                          optimize_init_argument(S, 2, u_mean)
                          optimize_init_argument(S, 3, u_m)
                          optimize_init_argument(S, 4, u_m_1)
                          v = optimize(S)
                          }

                          Comment


                          • #14
                            Ok, I understand. Thanks! This time I think it almost finished running everything, but in the end I got:

                            Iteration 0: f(p) = 3.473e+18 (not concave)
                            Iteration 1: f(p) = 2.851e+18 (not concave)
                            Iteration 2: f(p) = 2.333e+18 (not concave)
                            Iteration 3: f(p) = 2.247e+18 (not concave)
                            Iteration 4: f(p) = 2.100e+18 (not concave)
                            Iteration 5: f(p) = 4.513e+08 (not concave)
                            numerical derivatives are approximate
                            flat or discontinuous region encountered
                            numerical derivatives are approximate
                            flat or discontinuous region encountered
                            Iteration 6: f(p) = 76769113 (not concave)
                            could not calculate numerical derivatives -- discontinuous region with missing values encountered
                            could not calculate numerical derivatives -- discontinuous region with missing values encountered

                            (1 line skipped)

                            To my understanding, with your correction of missing values, there was not any other problem related to that. I double checked and neither of the arguments has missing values. Since it is a parameter as well, shouldn't we define the p[1] (the probability of being from one normal distribution or the other)? Could that be the problem?

                            Finally, in my optimization problem, there is one restriction:

                            p[1]*miu[1] + (1-p[1])*miu[2] = 0

                            Would you know how to include it?

                            Thank you very much Christophe Kolodziejczyk, I would have been stuck for decades without your help.

                            Comment


                            • #15
                              I rerun the code and this time it kept iterating endlessly. I guess this in fact has something to do with the constraint (restriction) that has not been yet imposed.

                              Comment

                              Working...
                              X