Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • entropy and missing values

    The number of non-missing values varies with cases. For example, as shown in the below table, a has two values and d has full five values. I want to know how this (m in the below entropy formula) can be integrated into Stata.
    Click image for larger version

Name:	entropy.JPG
Views:	1
Size:	9.9 KB
ID:	1638287


    x y z w v
    a 2 3 . . .
    b 4 5 6 . .
    c 2 . . . .
    d 4 5 6 7 2

    [QUOTE]
    gen entropy = 0
    foreach v in a b c d e {
    replace entropy = entropy + cond(`v' == 0, 0, `v' * ln(1/`v'))
    }
    sum entropy[
    /QUOTE]

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str1 x int(y z w v)
    "a" 2 3 . .
    "b" 4 5 6 .
    "c" 2 . . .
    "d" 4 5 6 7
    end
    
    foreach var of varlist y-v {
        tempvar _`var'
        gen `_`var'' = `var' * ln(1/`var')
    }
    
    egen entropy = rowtotal(`_y'-`_v')
    Finally, "entropy" stores formula output for each line. For example, in the first line, 2*ln(1/2) + 3*ln(1/3) = -4.682131 -- Not sure if it's what you wanted.

    Code:
         +-------------------------------+
         | x   y   z   w   v     entropy |
         |-------------------------------|
      1. | a   2   3   .   .   -4.682131 |
      2. | b   4   5   6   .   -24.34292 |
      3. | c   2   .   .   .   -1.386294 |
      4. | d   4   5   6   7   -37.96429 |
         +-------------------------------+

    Comment


    • #3
      Some confusion here. The entropy recipe here, as is standard, requires that p is a probability (proportion, fraction) and so that the probabilities add to 1.


      Here I use natural logarithms. Some people use logarithms to base 2 and yet others logarithms to base 10. Also, my code assumes that the inputs are counts or amounts. If they are category codes the calculation is quite different.


      Code:
      clear 
      input str1 x y z w v
      a 2 3 . . .
      b 4 5 6 . .
      c 2 . . . .
      d 4 5 6 7 2
      end 
      
      egen total = rowtotal(y z w v)
      
      gen entropy = 0 
      gen prob = . 
      
      foreach v in y z w v {
          replace prob = cond(missing(`v'), 0, `v' / total) 
          replace entropy = entropy + prob * ln(1 / prob)) if prob > 0  
      } 
      
      list 
      
           +-------------------------------------------------+
           | x   y   z   w   v   total    entropy       prob |
           |-------------------------------------------------|
        1. | a   2   3   .   .       5   .6730117          0 |
        2. | b   4   5   6   .      15   1.085189          0 |
        3. | c   2   .   .   .       2          0          0 |
        4. | d   4   5   6   7      22   1.365393   .3181818 |
           +-------------------------------------------------+
      
      .

      Comment


      • #4
        Fei Wang Thank you.
        Nick Cox Thank you. But there is something incorrect in your code. It does not work. Errors messages are as follows.
        too many ')' or ']'
        /total invalid name

        Comment


        • #5
          You’re right. I simplified the code from what I had earlier but introduced a typo. Delete the last ) before the if qualifier.

          Comment


          • #6
            Nick Cox Thank you. It works well.

            Comment

            Working...
            X