Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata data types

    Stata manual declares the ranges for each data storage type in the manual for datatypes, for example, the range for byte is: [-127;100]. The interval is asymmetric since the largest values in the range are reserved for missing and extended missing values.

    So the total number of values that can be represented by a byte type is: 127 (negatives) + 1 (zero) +100 (positives) + 1 (mv) + 26 (extended mv) = 255.

    But one byte can hold 256 different values! Upon careful inspection we can see that the value -128 (which can in theory be represented by a signed byte storage type) is blacklisted. We can't create such a value byte variable in Stata:

    Code:
    clear
    set obs 1
    generate byte x=-128
    results in a missing value in variable x.

    The problem is that Stata doesn't do the same validation when opening datasets, and happily accepts value -128. This results in a rather strange behavior later.

    Code:
    clear
    use "http://www.radyakin.org/statalist/2017/1371324-stata-data-types.dta"
    clonevar V1=V0
    summarize
    Produces the following output:
    Code:
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
              V0 |          1        -128           .       -128       -128
              V1 |          0
    while identical stats are expected for clones and originals.
    1. What is the purpose of blacklisting values like -128, -32768, etc for their exact value types? For example, C# fits -128 nicely into a signed byte type (sbyte).
    2. Would it be possible to add validation to prohibit such values in input files if these values are subsequently incorrectly processed in Stata? (as demonstrated by clonevar/summarize example above).
    Thank you, Sergiy Radyakin

  • #2
    I note that in ones complement implementations of signed integers -128 cannot be represented in one byte, nor -32,768 in two bytes (but there are instead two representations of zero). I'm led to suspect that some consideration along these lines is what led the developers of Stata to the choice they made.

    Comment

    Working...
    X