Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Returning the last non missing observation

    I am using the follow loop to return the last non missing observation :
    Code:
    foreach var of varlist _all {
            local N = _N
            while missing(`var'[`N']) {
                local N = `N' - 1
            }
    }
    However this is really slow when doing it in large datasets, anyway to replicate this in mata?

    P.S My knowledge of mata is near zero.

    Thanks in advance.

  • #2
    I would do this in Stata without a loop:

    Code:
    // open example data
    sysuse auto, clear
    
    // sort so there are some missing values
    // at the end to make this example a bit
    // more interesting
    sort rep78
    
    // create a count of nonmissing values
    gen nonmiss = !missing(rep78)
    replace nonmiss = sum(nonmiss)
    
    // to preserve the current sort order
    gen sort = _n
    
    // find the max of nonmissing
    sum  nonmis, meanonly
    
    // create an indicator variable identifying the
    // last nonmissing value
    bys nonmis (sort) : gen byte last = _n == 1 if nonmis == r(max)
    
    // admire the result
    list sort nonmis rep78 last in -10/l
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Can be even simpler in Stata

      Code:
      sysuse auto, clear
      sort rep78
      
      gen obs = _n
      sum obs if !mi(rep78), meanonly
      
      local last = r(max)
      list make price mpg rep78 in `last'/l

      Comment


      • #4
        Robert: That is true in this case, as the missing values are guaranteed to be all at the end. I assumed that not to be true in the Gerald's case.
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Thank you both guys, Marteen I belive because Robert is sorting the missing will always be in the end. Anyways here is the final look which was made possible from you guys.... a little utility to format variables...
          Code:
          *! _formatvar -- Format variables to a standard format
          *! version 1.0        Gerald Gjini        March 2015
          capture program drop _formatvar
          program define _formatvar, rclass
          version 13.1
          local progname _formatvar
          
          
          foreach var of varlist _all {
                  if inlist("`var'", "date") continue
                  
                  sort `var'
                  tempvar obs
                  gen `obs' = _n
                  sum `obs' if !missing(`var'), meanonly
                  local N = r(max)
              
                  if abs(`var'[`N']) < 0.01 {
                      format `var' %9.2fc 
                  }    
                  if 0.01<= abs(`var'[`N']) & abs(`var'[`N']) < 10 {
                      format `var' %9.1fc 
                  }
                  if 10<= abs(`var'[`N']) | abs(`var'[`N']) == 0 {
                      format `var' %9.0fc 
                  }
              
              capture order date
              capture sort date
          }
          end
          exit

          Comment


          • #6
            Maarten and Gerald,

            I don't get it, there's no need to sort to find or use the last non-missing observation (I did in my example to follow Maarten's example which puts some missing at the end).

            Code:
            sysuse auto, clear
            gen obs = _n
            sum obs if !mi(rep78), meanonly
            local last = r(max)
            list make price mpg rep78 in `last'/l
            
            dis rep78[`last']

            Comment


            • #7
              Still thinking about this...

              The _formatvar program appears to simply find the maximum of each variable and apply a formatting. In that case, all you need is

              Code:
              sum `var', meanonly
              and use r(max) to decide on the format.

              Comment


              • #8
                In addition to Robert's comments, note that _formatvar is puzzling in various respects.

                The tacit assumption seems to be that all variables are numeric. It's choosing between a rather narrow range of display formats.

                It is declared as r class but returns no r-class results.

                It behaves differently if there is a variable date in the dataset,

                I think this is a case in which Gerald has written something to do what he wants, which is fine, but its generality is limited.

                Comment


                • #9
                  In the end the question was how to do things in Mata.
                  And even though things can be done in Stata it might be good to see how things can be done in Mata.

                  Take a look at the following code.
                  The key problem is to identify the last row containing a nonmissing value. For simplicity in output I've reversed the problem to finding the last row id with a missing value in a column.
                  I usually go into Mata asap:
                  Code:
                  . mata
                  ------------------------------------------------- mata (type end to exit) -------------------------------------------------------
                  : stata("sysuse auto, clear")
                  (1978 Automobile Data)
                  
                  : data = st_data(.,.)
                  
                  :
                  Now I got the auto data in the variable data
                  If I wanted a missing report I could do like (It's a little diversion, sorry):
                  Code:
                  : st_varname(1..st_nvar())', strofreal(colmissing(data))'
                                     1              2
                       +-------------------------------+
                     1 |          make             74  |
                     2 |         price              0  |
                     3 |           mpg              0  |
                     4 |         rep78              5  |
                     5 |      headroom              0  |
                     6 |         trunk              0  |
                     7 |        weight              0  |
                     8 |        length              0  |
                     9 |          turn              0  |
                    10 |  displacement              0  |
                    11 |    gear_ratio              0  |
                    12 |       foreign              0  |
                       +-------------------------------+
                  
                  :
                  Now we see that there are missings in variables make and rep78.
                  Let us find the row ids for these missing values:
                  Code:
                  : slct = st_data(., "rep78") :== .
                  
                  : select((1..rows(data))', slct)
                          1
                      +------+
                    1 |   3  |
                    2 |   7  |
                    3 |  45  |
                    4 |  51  |
                    5 |  64  |
                      +------+
                  
                  :
                  slct is a boolean row vector indicating whether rep78 is missing (1) or not (0).
                  One can of cource get the maximum row id by:
                  Code:
                  : max(select((1..rows(data))', slct))
                    64
                  
                  :
                  This is my suggestion on how you can get the needed information using very little Mata code.
                  Kind regards

                  nhb

                  Comment


                  • #10
                    per Nick's suggestion I beefed up this program to assume a "date" variable, or have the user pass one or option "nodate". Previously I wanted to return the last non-missing observation but since we are using sum and the max to format I removed the r class. Thanks Niel for the Mata code.
                    Code:
                    *! _formatvar -- Format variables to a standard format
                    *! version 1.0        Gerald Gjini        March 2015
                    *! version 2.0        Gerald Gjini        April 2015
                    capture program drop _formatvar
                    program define _formatvar
                    version 13.1
                    local progname _formatvar
                    
                        local syntax = "[ , date(varname numeric) nodate]"
                        capture syntax `syntax'
                        if _rc {
                            noisily display as error `"`progname': syntax is "`progname' `syntax'""'
                            error _rc
                        }
                    
                    // User may pass in a date varname. If no variable is passed, assume varname is "date", otherwise user need to specifically specify "nodate"
                        if missing("`date'") & missing("`nodate'") {
                            capture confirm numeric variable date
                                if _rc {
                                    noisily display as error `"`progname': User must either pass in date-varname or a variable named "date" must be present or specify "nodate""'
                                    noisily display as error `"`progname': syntax is "`progname' `syntax'""'
                                    error _rc
                                }
                                local date date
                        }
                    
                        foreach var of varlist _all {
                            if inlist("`var'", "`date'") continue
                            
                            sort `var'
                            tempvar obs
                            gen `obs' = _n
                            sum `obs' if !missing(`var'), meanonly
                        
                                if abs(`var'[r(max)]) < 0.01 {
                                    format `var' %9.2fc 
                                }    
                                if 0.01<= abs(`var'[r(max)]) & abs(`var'[r(max)]) < 10 {
                                    format `var' %9.1fc 
                                }
                                if 10<= abs(`var'[r(max)]) | abs(`var'[r(max)]) == 0 {
                                    format `var' %9.0fc 
                                }
                        
                            capture order `date'
                            capture sort `date'
                        }
                    end
                    exit

                    Comment


                    • #11
                      Based on Niels mata code (re arrenged it a bit) I wrote a little mata program that returns the last non missing observation.
                      Thanks a lot Niels

                      Code:
                      ​*! varlast -- Returns the last non missing observation
                      *! version 1.0        Gerald Gjini        April 2015
                      
                      capture program drop varlast
                      program varlast
                          version 13
                          syntax varname
                          quietly mata: lastobs("`varlist'")
                          display as txt " last obs = " as res r(last)
                      end
                      
                      
                      mata:
                      mata set matastrict on
                      
                      void lastobs(string scalar varname)
                      {
                      st_varname(1..st_nvar())', strofreal(colmissing(st_data(.,.)))'
                      st_numscalar("r(last)", max(select((1..rows(st_data(.,.)))', st_data(., varname) :!= .)))
                      }
                      end
                      exit

                      Comment


                      • #12
                        so the finish product uses the last non missing to format vars and also added an option for string to be bypassed

                        Code:
                        *! _formatvar -- Format variables to a standard format based on the last value
                        *! version 1.0        Gerald Gjini        March 2015
                        *! version 2.0        Gerald Gjini        April 2015
                        capture program drop _formatvar
                        program define _formatvar
                        version 13.1
                        local progname _formatvar
                        
                            local syntax = "[ , date(varname numeric) nodate]"
                            capture syntax `syntax'
                            if _rc {
                                noisily display as error `"`progname': syntax is "`progname' `syntax'""'
                                error _rc
                            }
                        
                        // User may pass in a date varname. If no variable is passed, assume varname is "date", otherwise user need to specifically specify "nodate"
                            if missing("`date'") & missing("`nodate'") {
                                    if _rc {
                                        noisily display as error `"`progname': User must either pass in date-varname or a variable named "date" must be present or specify "nodate""'
                                        noisily display as error `"`progname': syntax is "`progname' `syntax'""'
                                        error _rc
                                    }
                                    local date date
                            }
                        
                            foreach var of varlist _all {
                                if inlist("`var'", "`date'") continue
                                capture confirm numeric variable `var'
                                if !_rc {
                                    quietly varlast `var'
                                    
                                    if abs(`var'[r(last)]) < 0.01 {
                                        format `var' %9.2fc 
                                    }    
                                    if 0.01<= abs(`var'[r(last)]) & abs(`var'[r(last)]) < 10 {
                                        format `var' %9.1fc 
                                    }
                                    if 10<= abs(`var'[r(last)]) | abs(`var'[r(last)]) == 0 {
                                        format `var' %9.0fc 
                                    }
                                }
                                else {
                                    display as txt "`var' is a string"
                                    continue
                                }
                                capture order `date'
                                capture sort `date'
                            }
                        end
                        exit

                        Comment


                        • #13
                          Your program feeds to Mata a single variable name, but it will only find the last non-missing value if the variable is numeric.

                          The reason is that numeric variables are read in by st_data() as all missing values. So, your second line will never find a non-missing value with a string variable.

                          The line


                          Code:
                          st_varname(1..st_nvar())', strofreal(colmissing(st_data(.,.)))'
                          is redundant as its output (a table of variables and the count of missing values) is suppressed by the quietly. With a large data set this line would entail a fair amount of work, so given your goal it would be better omitted.
                          Last edited by Nick Cox; 07 Apr 2015, 09:13.

                          Comment


                          • #14
                            Thanks Nick, I had a suspicion that was the case (but never tested it)

                            Comment


                            • #15
                              Being a newbie into Mata I got tempted to write the code more in Mata and less in Stata. I'm not thinking that i could use the code myself.
                              The result is in the code block below.

                              I think Nick's comment
                              Your program feeds to Mata a single variable name, but it will only find the last non-missing value if the variable is numeric.
                              is correct but irrelevant since you (and later I) only accept numerical variables.
                              And the syntax command in Stata is one of the good things there.

                              I've chosen to let syntax command do all the error messaging.

                              What I like about Mata is that it is clearer what you have and what you get.
                              And it seems to be shorter, too

                              Hope you find usable

                              Code:
                              cls
                              mata
                                  mata clear
                                  //mata set matalnum on
                                  //mata set matastrict on
                                  
                                  void set_format(string scalar varnames)
                                  {
                                      string rowvector vec_names
                                      real rowvector slct, data
                                      real scalar i, last_id, last_val, decimals
                                      string scalar varname
                                      
                                      vec_names = tokens(varnames)'
                                      for (i=1; i <= rows(vec_names); i++) {
                                          varname = vec_names[i]
                                          st_view(data, ., varname)
                                          slct = !data :< .
                                          last_id = min((max(select((1..rows(slct))', slct)), rows(slct)))
                                          last_val = abs(data[last_id])
                                          decimals = 2 * (last_val<0.01) + (0.01<=last_val) * (last_val<10)
                                          st_varformat(varname, sprintf(`"%%9.%ffc"', decimals))
                                      }
                                  }
                              end
                              
                              capture program drop test
                              program define test
                                  syntax [varlist(numeric)]
                                  mata set_format("`varlist'")
                              end
                              
                              
                              sysuse auto, clear
                              test price-foreign
                              Kind regards

                              nhb

                              Comment

                              Working...
                              X