Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge a Mata matrix to a Stata data file

    I'd like to merge (by a key variable) a Mata column vector to a Stata data file. I'd like to do this without using the frames of v. 16 so as to keep backward compatibility. I thought this task would be something common and already solved by someone else, but I'm not finding anything that seems on target to me. (Perhaps my oversight.) The situation that motivates this occurs if some calculation (perhaps an estimation) is done in Mata that produces a column vector that is a new variable to be put back onto the Stata file for some subset of observations.

    I could do this by preserve/clearing the Stata file, storing the Mata vector in Stata and saving it to a temp file, restoring the original file, and then -merge- ing in Stata. That's pretty klugy and slow, so I'm think there must be a more direct way that has been or can be implemented. Any suggestions?

    Here's some example data to work with:
    Code:
    //Data file
    clear
    set seed 1884
    sysuse auto
    gen int id = _n
    //
    // Give a selection of the observations/vars to Mata and create a silly new variable
    gen byte touse = runiform() > 0.2   // not all observations happen to be relevant
    mata: X =st_data(., ("id", "weight", "length"), "touse")
    mata: newvar = X[.,2] :/ X[.,3] // weight/length    
    //
    //  Now merge newvar back to Stata dataset with id as a key.   How?
    Last edited by Mike Lacy; 07 Oct 2019, 10:55.

  • #2
    My preferred way of doing this would be to create the new variable filled with missings first, and then to use a mata view to change its content, i.e. replace your last line as follows:
    Code:
    gen newvar = .
    mata: st_view(newvar, ., "newvar", "touse")
    mata: newvar[., .] = X[.,2] :/ X[.,3]
    You do not even need to use the id variable in that case.
    Last edited by Sebastian Kripfganz; 07 Oct 2019, 11:48.
    https://twitter.com/Kripfganz

    Comment


    • #3
      Sebastian's approach is more elegant than mine. But both our approaches depend on being able to "merge 1:1" by (effectively) the observation number. Is it perhaps the case that your statement oversimplified your problem?
      Code:
      gen float frommata = .
      mata: st_store(X[.,1],"frommata",newvar)

      Comment


      • #4
        Mike: The other solutions are fine. My personal preference is to use getmata with the id option, e.g.
        Code:
        . //Data file
        . clear
        
        . set seed 1884
        
        . sysuse auto
        (1978 Automobile Data)
        
        . gen int id = _n
        
        . //
        . // Give a selection of the observations/vars to Mata and create a silly new variable
        . gen byte touse = runiform() > 0.2   // not all observations happen to be relevant
        
        . mata: X =st_data(., ("id", "weight", "length"), "touse")
        
        . mata: newvar = X[.,2] :/ X[.,3] // weight/length    
        
        . mata: id=X[.,1]
        
        . getmata newvar, id(id)
        
        . li id touse newvar in 1/10
        
             +------------------------+
             | id   touse      newvar |
             |------------------------|
          1. |  1       0           . |
          2. |  2       1   19.364162 |
          3. |  3       0           . |
          4. |  4       1   16.581633 |
          5. |  5       0           . |
             |------------------------|
          6. |  6       1   16.834862 |
          7. |  7       1   13.117647 |
          8. |  8       0           . |
          9. |  9       1   18.743961 |
         10. | 10       1          17 |
             +------------------------+
        
        . //
        . //  Now merge newvar back to Stata dataset with id as a key.   How?
        .
        end of do-file

        Comment


        • #5
          Thanks to all of you. Yes, I was thinking of a 1:1 merge. I have some questions and comments for each of you separately.

          Sebastian:
          Question: I'm not a user of views and encountered a misunderstanding on my part that I can't clear up by reading the Fine manual. When I used:
          Code:
          mata: st_view(newvar, ., "newvar", "touse")
          Stata complains <istmt> 3499 newvar not found

          From this I figured that st_view() seems to require that the destination matrix, newvar, must exist before it is referenced in st_view(). So, I did:
          Code:
          mata: newvar = J(0,0,.); st_view(newvar, ., "newvar", "touse")
          which worked fine. Is this the right kind of thing to do?

          This confused me, as the manual says "st view() and st sview() create a matrix that is ... " As a user of views, can you clarify what's going on here? Does the usage entail a rule like "A Mata matrix must have been defined before being referenced in an st_view() command?"

          Anyway, per Bill's comment, yes, I oversimplified, as in my desired application (and perhaps others) the ordering of the data rows in X might get changed by some operation in Mata, which would break your approach, I think.

          -----------

          Bill:
          I was puzzled by your approach until I realized that it works because X[.,1] (the id) *happens* to hold the original observation numbers in my example. I had not used the observation numbers feature of st_store before so hadn't thought of this. Anyway, my understanding here is that st_store() would have to be done with the original data set maintained in or restored to its original order, which makes me a bit nervous. Do I understand correctly? Not a big problem, but something to be paid attention to.

          ----------


          John:

          I like this approach best. It seems robust to any ordering etc. of the data set, and is most like a conventional join operation, in which one simply tells the joining procedure what the key is, while the details of how it operates are not the user's problem. Now that I read the documentation of -getmata- again, I see that the id() option (and the example given there) fits exactly what I am thinking of. I like to use -getmata- for various purposes, but I understand it to have been deprecated by StataCorp (maybe in the Stata Blog??) as not suitable for serious use, for reasons I don't understand. I am a bit concerned about using -getmata- here only on that ground. Do you have any thoughts about that?

          ---------

          In general *my* difficulties here arise, it seems, from my relative confusion and ignorance about st_data, st_view, and st_store. I must say that the documentation for these is unnecessarily obscure relative to the rest of the Fine manual, which seems odd to me. For me, and perhaps others, Fine is not the f-word that comes to mind when trying to use the manual to clarify use of these functions. Given the crucial nature of the interface with the data, I think that something better is in order. Perhaps by making these functions so powerful they have been made almost impossible to document as well as use.

          Comment


          • #6
            I understand it to have been deprecated by StataCorp (maybe in the Stata Blog??) as not suitable for serious use
            I wasn't aware of this, Mike, but if you unearth something that warns about this I'd be grateful if you'd post it.

            Comment


            • #7
              John--Best I can find right now is from -help getmata-:

              "putmata and getmata are designed to work interactively and in do-files. The commands are not designed to work with ado-files. ... Ado-file programmers should use the Mata functions st_data() and st_view() ... and if necessary, use st_store() (see [M-5] st_store()) to post the contents of those vectors and matrices back to Stata." [my emphasis, M.L.]

              I'd like to use this Mata to Stata "merge" in an ado file, so I wonder what StataCorp thinks the problem with -getmata- is? I suppose it's slower, but -getmata- is not likely to get used in a loop, so who cares about speed?

              I *do* like the robustness of the way -getmata- handles the id. I guess something like that could be built with what Bill suggests and some temporary variables to keep the order. From my perspective, I think that a -matamerge- command, with similar features to -merge- is called for, but I'm not the one who can write a good version.

              The irony here is I bet that many of the more sophisticated user-contributed program have had to solve this problem one way or another.
              Last edited by Mike Lacy; 07 Oct 2019, 16:38.

              Comment


              • #8
                Thanks Mike. Interesting and helpful information.

                Comment


                • #9
                  I'd find my info even more useful <grin> if we knew why -getmata- should not be used in ado files, and whether there is some way mortals like us can easily implement something like its id() option.

                  Comment


                  • #10
                    Mike -

                    I'm going to abandon my approach in favor of Sebastian's, for which I give the following fully-worked-out example.
                    Code:
                    clear all
                    set seed 1884
                    sysuse auto
                    gen byte touse = runiform() > 0.2   // not all observations happen to be relevant
                    generate ratio = . // all variables need to exist
                    
                    mata:
                    data = . // viewname argument to st_view needs to exist but is irrelevant
                    st_view(data, ., "weight length ratio", "touse")
                    data[.,3] = data[.,1] :/ data[.,2] // weight/length    
                    end
                    
                    list make weight length touse ratio in 1/10, clean
                    Code:
                    . clear all
                    
                    . set seed 1884
                    
                    . sysuse auto
                    (1978 Automobile Data)
                    
                    . gen byte touse = runiform() > 0.2   // not all observations happen to be relevant
                    
                    . generate ratio = . // all variables need to exist
                    (74 missing values generated)
                    
                    . 
                    . mata:
                    ------------------------------------------------- mata (type end to exit) ----------------------
                    : data = . // viewname argument to st_view needs to exist but is irrelevant
                    
                    : st_view(data, ., "weight length ratio", "touse")
                    
                    : data[.,3] = data[.,1] :/ data[.,2] // weight/length    
                    
                    : end
                    ------------------------------------------------------------------------------------------------
                    
                    . 
                    . list make weight length touse ratio in 1/10, clean
                    
                           make            weight   length   touse      ratio  
                      1.   AMC Concord      2,930      186       0          .  
                      2.   AMC Pacer        3,350      173       1   19.36416  
                      3.   AMC Spirit       2,640      168       0          .  
                      4.   Buick Century    3,250      196       1   16.58163  
                      5.   Buick Electra    4,080      222       0          .  
                      6.   Buick LeSabre    3,670      218       1   16.83486  
                      7.   Buick Opel       2,230      170       1   13.11765  
                      8.   Buick Regal      3,280      200       0          .  
                      9.   Buick Riviera    3,880      207       1   18.74396  
                     10.   Buick Skylark    3,400      200       1         17
                    But with that example to point the way for using st_view, here's my approach using st_data and st_store, shown to be robust to reordering the rows of X.
                    Code:
                    clear all
                    set seed 1884
                    sysuse auto
                    gen byte touse = runiform() > 0.2   // not all observations happen to be relevant
                    generate ratio = . // all variables need to exist
                    gen int id = _n
                    
                    mata:
                    X = st_data(., "id weight length", "touse")
                    _jumble(X) // shuffle the deck
                    newvar = X[.,2] :/ X[.,3] // weight/length
                    X = X , newvar
                    _jumble(X) // shuffle the deck again
                    st_store(X[.,1],"ratio",X[.,4])
                    end
                    
                    list make weight length touse ratio in 1/10, clean
                    Code:
                    . clear all
                    
                    . set seed 1884
                    
                    . sysuse auto
                    (1978 Automobile Data)
                    
                    . gen byte touse = runiform() > 0.2   // not all observations happen to be relevant
                    
                    . generate ratio = . // all variables need to exist
                    (74 missing values generated)
                    
                    . gen int id = _n
                    
                    . 
                    . mata:
                    ------------------------------------------------- mata (type end to exit) ----------------------
                    : X = st_data(., "id weight length", "touse")
                    
                    : _jumble(X) // shuffle the deck
                    
                    : newvar = X[.,2] :/ X[.,3] // weight/length
                    
                    : X = X , newvar
                    
                    : _jumble(X) // shuffle the deck again
                    
                    : st_store(X[.,1],"ratio",X[.,4])
                    
                    : end
                    ------------------------------------------------------------------------------------------------
                    
                    . 
                    . list make weight length touse ratio in 1/10, clean
                    
                           make            weight   length   touse      ratio  
                      1.   AMC Concord      2,930      186       0          .  
                      2.   AMC Pacer        3,350      173       1   19.36416  
                      3.   AMC Spirit       2,640      168       0          .  
                      4.   Buick Century    3,250      196       1   16.58163  
                      5.   Buick Electra    4,080      222       0          .  
                      6.   Buick LeSabre    3,670      218       1   16.83486  
                      7.   Buick Opel       2,230      170       1   13.11765  
                      8.   Buick Regal      3,280      200       0          .  
                      9.   Buick Riviera    3,880      207       1   18.74396  
                     10.   Buick Skylark    3,400      200       1         17
                    With regard to the Fine Manual, I've blamed my problems on not having purchased and studied the definitive

                    Gould, W. W. 2018. The Mata Book: A Book for Serious Programmers and Those Who Want to Be
                    because I think I'm asking more of the Stata Manual format than it can deliver for Mata.

                    Comment


                    • #11
                      Thanks William Lisowski , particularly for the clarification on the need for the view name argument to exist prior. I'll put these examples into my files of worked alternatives. As it happens, I did just last week re-read The Mata Book, and while the coverage of st_data() etc. is better than in the manual, I still found it not so great in that area. (I'll have to look back and see if it covered that arcane point about pre-declaring the argument for st_view().) I still like your original kind of approach here, as I was taught to favor call by value unless there is a really substantial speed/memory penalty. And your approach is more robust here.

                      I am really leaning toward the idea that these interface procedures should have been multiple different procedures, rather than heavily overloaded as they are.

                      Comment


                      • #12
                        Mike Lacy
                        You are right. We need to create the new Mata variable first. For a moment, I was wondering why the code in my comment above nevertheless worked on my computer at the time I posted it. The reason is that I apparently created that Mata variable already before when I was trying a couple of things interactively. Then running the do-file with the clear statement only (instead of clear all) did not erase that variable from memory.

                        A shorter way of using st_view with the variable creation in a single line would be:
                        Code:
                        mata: st_view(newvar = ., ., "ratio", "touse")
                        Regarding sort order, I prefer to create my own index in Mata and then use this in the final view assignment. In the following example, I replaced the id variable in Mata with a column vector that contains a running index:
                        Code:
                        //Data file
                        clear all
                        set seed 1884
                        sysuse auto
                        gen int id = _n
                        //
                        // Give a selection of the observations/vars to Mata and create a silly new variable
                        gen byte touse = runiform() > 0.2   // not all observations happen to be relevant
                        gen ratio = .
                        mata: X = st_data(., ("weight", "length"), "touse")
                        mata: X = ((1::rows(X)), X)
                        mata: _jumble(X)
                        mata: st_view(newvar = ., ., "ratio", "touse")
                        mata: newvar[X[.,1], .] = X[.,2] :/ X[.,3]
                        Code:
                        . list make weight length touse ratio in 1/10, clean
                        
                               make            weight   length   touse      ratio  
                          1.   AMC Concord      2,930      186       0          .  
                          2.   AMC Pacer        3,350      173       1   19.36416  
                          3.   AMC Spirit       2,640      168       0          .  
                          4.   Buick Century    3,250      196       1   16.58163  
                          5.   Buick Electra    4,080      222       0          .  
                          6.   Buick LeSabre    3,670      218       1   16.83486  
                          7.   Buick Opel       2,230      170       1   13.11765  
                          8.   Buick Regal      3,280      200       0          .  
                          9.   Buick Riviera    3,880      207       1   18.74396  
                         10.   Buick Skylark    3,400      200       1         17
                        https://twitter.com/Kripfganz

                        Comment


                        • #13
                          Random stuff in the help mata tree tells me in help mata declarations

                          functions are called by address, not by value, and so may change the caller's arguments
                          and in help mata st_view

                          void st_view(V, real matrix i, rowvector j)
                          ...
                          The type of V does not matter; it is replaced.
                          From this I take it that in general all arguments to functions must exist at the time the function is called, so that there is a corresponding address to pass into the function.

                          The problem with st_view is that it is creating an object which does not have a corresponding type, so the created view cannot be passed back as the result of the function, which would be a more natural approach.

                          Code:
                          data = st_view(., "weight length ratio", "touse")

                          Comment


                          • #14
                            Thanks to both of you. I can sort of understand the address problem; sure would be nice if the documentation simply that the view variable had to exist prior to assigning a view.

                            I like Sebastian's solution, which to my understanding will always work so long as the result variable stays "lined up" with the input data matrix in Mata, which I should think would most always be true. I've been a frequent though inexpert Mata user for several years, and it took me some time of reading his code to understand more or less how everything worked. That's not a complaint about his code, but rather something needed because the parts of Mata used here are not very intuitive or similar to other languages I've used. As a possible help to others, I have put below a heavily commented and slightly altered version of Sebastian's code. This is what I saved for myself. Corrections/additions to my comments would be welcome.
                            Code:
                            // How to create a new Stata variable via calculations in Mata.
                            // with detailed explanatory comments.
                            // Example data
                            clear all
                            set seed 1884
                            sysuse auto
                            // Typical situation where only a selection of the observations are used.
                            gen byte touse = runiform() > 0.2  
                            //
                            // Create a space for the result variable in Stata first.
                            gen ratio = .
                            //
                            // Copy input data to Mata.  (Values only, Mata won't alter it.)
                            mata: X = st_data(., ("weight", "length"), "touse")
                            //
                            // Capture order of observations in the Stata dataset in
                            // an index, and attach it to the input data in Mata.
                            mata: X = ((1::rows(X)), X)
                            //
                            mata: _jumble(X)  // just to demo robustness; not actually to be used.
                            // Give Mata access to the Stata result variable's space (newvar == ratio)
                            // by using a view into the Stata dataset. Use only observations that meet
                            // the touse selection.
                            mata: st_view(newvar = ., ., "ratio", "touse")
                            mata: _jumble(X)  // just to demo robustness; not actually to be used
                            //
                            // The list of indices stored in X[., 1] ensures that the
                            // newvar result lines up with original data set order.
                            mata: newvar[X[.,1], .] = X[.,2] :/ X[.,3]
                            list make weight length touse ratio in 1/10, clean

                            Comment

                            Working...
                            X