Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Promotion Problems

    The summary of this post is that (a) replace, nopromote functions differently on string variables than on numeric variables, (b) with multibyte unicode characters, replace, nopromote functions poorly on string variables, and (c) because replace, nopromote is infrequently used in Stata code, the choice to have the st_store() and especially st_sstore() Mata functions not trigger promotion in Stata (and not have any means of triggering it) leads to unexpected results.

    I think this behavior is at a minimum not well documented, and arguably a bug rather than a feature, especially with respect to the handing of strings.

    Here's some sample code and results, the discussion follows.

    Code:
    clear
    set obs 5
    generate byte b1 = 1
    generate byte b2 = 1
    generate byte b3 = 1
    generate str1 s1 = "-"
    generate str1 s2 = "-"
    generate str1 s3 = "-"
    generate str1 u1 = "-"
    generate str1 u2 = "-"
    generate str1 u3 = "-"
    describe b* s* u*
    replace b1 = 666
    replace s1 = "abc"
    display "unicode character " ustrunescape("\u2022") " = 3 bytes " tobytes(ustrunescape("\u2022"),1)
    replace u1 = ustrunescape("\u2022")
    replace b2 = 666, nopromote
    replace s2 = "abc", nopromote
    replace u2 = ustrunescape("\u2022"), nopromote
    mata: bb = J(5,1,666)
    mata: st_store(.,"b3",bb)
    mata: ss = J(5,1,"abc")
    mata: st_sstore(.,"s3",ss)
    mata: uu = J(5,1,ustrunescape("\u2022"))
    mata: st_sstore(.,"u3",uu)
    generate byte bb = 1
    generate str1 ss = "-"
    generate str1 uu = "-"
    getmata bb, replace
    getmata ss, replace
    getmata uu, replace
    describe b* s* u*
    list b* s* u*, clean noobs
    Code:
    . describe b* s* u*
    
                  storage   display    value
    variable name   type    format     label      variable label
    --------------------------------------------------------------------------------------------------
    b1              byte    %8.0g                
    b2              byte    %8.0g                
    b3              byte    %8.0g                
    s1              str1    %9s                  
    s2              str1    %9s                  
    s3              str1    %9s                  
    u1              str1    %9s                  
    u2              str1    %9s                  
    u3              str1    %9s                  
    
    . replace b1 = 666
    variable b1 was byte now int
    (5 real changes made)
    
    . replace s1 = "abc"
    variable s1 was str1 now str3
    (5 real changes made)
    
    . display "unicode character " ustrunescape("\u2022") " = 3 bytes " tobytes(ustrunescape("\u2022"),1)
    unicode character • = 3 bytes \xe2\x80\xa2
    
    . replace u1 = ustrunescape("\u2022")
    variable u1 was str1 now str3
    (5 real changes made)
    
    . replace b2 = 666, nopromote
    (5 real changes made, 5 to missing)
    (5 values changed to missing because of storage type)
    
    . replace s2 = "abc", nopromote
    (5 real changes made)
    (5 values truncated because of storage type)
    
    . replace u2 = ustrunescape("\u2022"), nopromote
    (5 real changes made)
    (5 values truncated because of storage type)
    
    . mata: bb = J(5,1,666)
    
    . mata: st_store(.,"b3",bb)
    
    . mata: ss = J(5,1,"abc")
    
    . mata: st_sstore(.,"s3",ss)
    
    . mata: uu = J(5,1,ustrunescape("\u2022"))
    
    . mata: st_sstore(.,"u3",uu)
    
    . generate byte bb = 1
    
    . generate str1 ss = "-"
    
    . generate str1 uu = "-"
    
    . getmata bb, replace
    
    . getmata ss, replace
    
    . getmata uu, replace
    
    . describe b* s* u*
    
                  storage   display    value
    variable name   type    format     label      variable label
    --------------------------------------------------------------------------------------------------
    b1              int     %8.0g                
    b2              byte    %8.0g                
    b3              byte    %8.0g                
    bb              int     %8.0g                
    s1              str3    %9s                  
    s2              str1    %9s                  
    s3              str1    %9s                  
    ss              str1    %9s                  
    u1              str3    %9s                  
    u2              str1    %9s                  
    u3              str1    %9s                  
    uu              str1    %9s                  
    
    . list b* s* u*, clean noobs
    
         b1   b2   b3    bb    s1   s2   s3   ss   u1   u2   u3   uu  
        666    .    .   666   abc    a    a    a    •    �    �    �  
        666    .    .   666   abc    a    a    a    •    �    �    �  
        666    .    .   666   abc    a    a    a    •    �    �    �  
        666    .    .   666   abc    a    a    a    •    �    �    �  
        666    .    .   666   abc    a    a    a    •    �    �    �  
    
    .
    From the "b" series, we see that replace and getmata, replace promote the byte to an integer to accommodate the new value, while replace, nopromote and st_store replace the data with a missing value.

    From the "s" series, we see that only
    replace promotes the str1 to str3 to accommodate the longer value, while replace, nopromote and st_sstore and getmata, replace quietly replace the data with just what will fit from the replacement value, rather than provide a missing value to signal failure, as was the case for numeric data.

    From the "u" series, we again see that only
    replace promotes the str1 to str3 to accommodate the three-byte Unicode character, while replace, nopromote and st_sstore and getmata, replace quietly replace the data with just what will fit from the replacement value, which in this case is only the first byte of a three-byte character, which then is not a valid Unicode character.

    I will add that in code not demonstrated here, getmata (with no replace option) with character data functioned as you would expect it to, creating ss and uu as str3 variables.

    IMHO, ideally st_store and st_sstore would trigger promotion by the received variable, and in any event commands that do not trigger promotion should for string variables should replace values that will not fit with missing values, as is done for numeric variables. Also, ideally st_store, st_sstore, and getmata, replace should provide diagnostics similar to those from replace.

    This post was precipitated by the following topic on the Mata forum.

    https://www.statalist.org/forums/for...sing-st_sstore
    Last edited by William Lisowski; 16 Dec 2018, 09:41.
Working...
X