Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A bug with SPSS imports and 81 byte variable labels

    Stata limits variable labels to a maximum of 80 characters. SPSS, on the other hand, limits variable labels to a maximum of 256 bytes. Stata's documentation states:

    If an SPSS variable label is too long, it will be truncated to 80 characters, and the original variable label will be stored as a variable characteristic.
    This is wrong on two counts. The first is that after importing an .sav file, variable labels are truncated to 80 bytes, not 80 characters. If your labels are purely ASCII characters, you will not notice the difference. But if your labels are written in a script where each character is multiple bytes, like Arabic, you'll notice quite quickly. Your label will be half as long or shorter than what Stata can actually store, and truncation will frequently occur partway through a character, leaving an invalid Unicode character at the end (appearing as �).

    There's a chance there is some esoteric reason for doing it this way, and that this is not a bug but rather a mistake in the documentation. But what is almost surely a bug is that the original variable label is only stored as a variable characteristic (named spss_variable_label) if it is 82 bytes or longer. If you import a variable with an 81 byte label, the last byte is simply lost and not recoverable in the Stata data, existing neither in the 80 byte label nor in a variable characteristic.

    If any of this behavior is fixed or otherwise changed in a future update, I would really appreciate it if StataCorp could reply letting me know which version has changed it. I have written a command for internal use at my company that in one step has to match up variables in Stata with variables in a different file format, and it explicitly takes all of this odd behavior into account.

    (I don't believe this behavior is version dependent, but just in case, I am running the latest version of Stata 19.5, born date 18 Feb 2026, compile number 195038, on MacOS 14.7.4)

  • #2
    Jackie,

    I’m currently looking into this issue.

    Kevin

    Comment


    • #3
      Jackie,

      This is a bug in
      Code:
      import spss
      . The fix will be released in a future update to Stata 18/Stata 19, and Stata 19.5.

      Comment


      • #4
        Thanks for the update Kevin!

        If the bug will be simultaneously fixed in Stata 18, 19, and 19.5, and this label behavior is not predictable by simply checking the version number, I've just realized that I'll have to retool my aforementioned program to check for both possibilities at once.

        Everyone at StataCorp whose life I have made more difficult today by giving them more work to do can rest assured that I've made my own life more difficult as well, haha.

        Comment


        • #5
          Kevin Crow (StataCorp)

          The newest Stata update from 15 April 2026 fixed one of the aforementioned issues. After importing an SPSS file, the Stata label will now be the first 80 characters of the SPSS label, instead of the first 80 bytes.

          However, at least on my copy of Stata, variable characteristics storing full labels aren't working at all! Any variable with a label too long for Stata is meant to have a characteristic named spss_variable_label that stores the entire label. This worked in previous versions (except for labels exactly 81 bytes long), but now I don't see any such characteristics saved at all. In case the issue is version dependent, I'm running Stata for MacOS (compile number 195044).

          The Stata documentation has been updated to read:

          If an SPSS variable label is too long, it will be truncated to 256 characters, and the original variable label will be stored as a variable characteristic.
          The same as before, but with the number 80 changed to 256. This makes no sense, because SPSS labels cannot be longer than 256 bytes, and therefore not longer than 256 characters, so they could never be truncated to the first 256 characters. Did someone set the condition for the creation of a spss_variable_label characteristic to be a label greater than 256 characters instead of 80 characters, an impossible condition that never gets triggered?

          Comment


          • #6
            To be completely clear, below is an outline of what is meant to happen versus what happened before and what happens now.

            Intended behavior:
            *SPSS labels 80 characters or shorter become Stata labels
            *SPSS labels 81 characters or longer have the first 80 characters become Stata labels, and the full label is stored in a variable characteristic named spss_variable_label

            Incorrect behavior (before 15 April 2026):
            *SPSS labels 80 bytes or shorter become Stata labels
            *SPSS labels 81 bytes long have the first 80 bytes become Stata labels, and no variable characteristic is created
            *SPSS labels 82 bytes or longer have the first 80 bytes become Stata labels, and the full label is stored in a variable characteristic named spss_variable_label

            Incorrect behavior (15 April 2026 to today):
            *SPSS labels 80 characters or shorter become Stata labels
            *SPSS labels 81 characters to 256 characters have the first 80 characters become Stata labels, and no variable characteristic is created
            *(Seemingly, from documentation) SPSS labels longer than 256 characters have the first 80 characters become Stata labels, and the full label is stored in a variable characteristic named spss_variable_label, but since this is impossible, it never occurs

            Comment


            • #7
              Jackie,

              You are correct. The bug fix on 15 April 2026 fixed the issue with storing up to 80 Unicode characters, but broke the code for storing the entire variable label as a variable characteristic when the label exceeds 80 Unicode characters. I have fixed this, and it will be out in the next update. Thanks for reporting this.

              Kevin

              Comment

              Working...
              X