Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • PSA: Mata's mean() performs row-wise deletion

    This is just a public service announcement that Mata's mean() function might not work as you think it does in the case of missing data. It is actually in the help file twice, but that assumes you even thought of checking the Mata help file for mean() to see if your program was working as intended.

    In particular, when calculating the mean per column, Mata only consider rows that have no missing values. The easiest way to see what this means is perhaps the example below, where all the column means is missing, even though all the columns have non-missing values. In other words, if there's a missing in row 1 of column 1, then row 1 will be discarded for all columns.

    Just hope this prevents someone from spending hours debugging their program when the issue could be this simple (as I did over the past weeks).


    Code:
    : A = (1, 2, . \ 3, ., 5 \ ., 1, 2)
    
    : A
     
        +-------------+
        |  1   2   .  |
        |  3   .   5  |
        |  .   1   2  |
        +-------------+
    
    : mean(A)
        +-------------+
        |  .   .   .  |
        +-------------+
    As a related question - is there a function in mata that calculates the mean without row-wise deletion?
    Last edited by Jesse Wursten; 21 Nov 2016, 10:43.

  • #2
    I think the issue is that a matrix is defined by its number of rows and columns, and therefore it is problematic to implement a command that will take an uneven number of entries in each column. I therefore am not aware of any such function. However, with equal weighting of entries, the mean is simply the sum divided by the number of elements and therefore without a function that ignores missing values, you can use a combination of other Mata functions: For example


    Code:
    : A = (1, 2, . \ 3, ., 5 \ ., 1, 2)
    
    : A
           1   2   3
        +-------------+
      1 |  1   2   .  |
      2 |  3   .   5  |
      3 |  .   1   2  |
        +-------------+
    
    : colsum(A) :/ colnonmissing(A)
             1     2     3
        +-------------------+
      1 |    2   1.5   3.5  |
        +-------------------+

    Comment


    • #3
      That actually makes a lot of sense.

      To anyone wondering, using colsum/colnonmissing is 3x faster than calculating the mean per column, at least in my very ramshackle test.

      Code:
      . mata
      ------------------------------------------------- mata (type end to exit) -------------------------------------------------
      :         timer_clear()
      
      :         
      :         for(m=1; m<=100; m++){  
      >                 A = runiformint(1000,1000, 1, 5)
      >                 _editvalue(A, 4, .)
      >                 _editvalue(A, 3, .)
      >                 
      >                 // Per column
      >                 timer_on(1)
      >                 E1 = J(1, cols(A), .)
      >                 for(i=1;i<=cols(A);i++){
      >                         E1[1,i] = mean(A[.,i])
      >                 }
      >                 timer_off(1)
      >                 
      >                 // Mata functions
      >                 timer_on(2)
      >                 E2 = colsum(A):/ colnonmissing(A)
      >                 timer_off(2)
      >         }
      
      :         timer()
      
      ---------------------------------------------------------------------------------------------------------------------------
      timer report
        1.       4.97 /      100 =    .04969
        2.       1.59 /      100 =    .01587
      ---------------------------------------------------------------------------------------------------------------------------

      Comment


      • #4
        1. mean uses quadcross(), so try to compare with quadcolsum()
        2. You're looping over the columns of A when taking the mean. For loops are slow in Mata
        3. You allocate 1X1000 vector at each simulation
        4. You pass the indexed matrix A[,i] to mean() -> takes time


        I think 2,3 and 4 are the main source of discrepancy. 1 is not likely to be important for speed but it should be for precision

        Comment

        Working...
        X