Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can strL be disabled? It is breaking all my pre-16 code.

    The strL data type is clearly useful, but the fact that it can't be used for a merge is breaking
    backward compatibility with Stata 15 and earlier. I don't know if strL is new to Stata16, or if
    Stata16 is just more aggressive about using it as a data type.

    My prior code is breaking because variables that were formerly `str##` are now being read by Stata
    as `strL`, so they cannot be used as merge keys. My research team has written about 200,000 lines of
    Stata code over the last 10 years, and these breakages are happening left and right. In many cases,
    the efficiency replacement is not remotely worth it-- today some code broke while running a string
    replacement using a replacement file that was only 10 strings in length.

    In this specific case, the 10-line string file was imported with `import delimited`, and Stata
    defaulted the key string to strL, breaking the merge.

    I know I can recast these to the `str##` format, but I would like to avoid updating our codebase in
    thousands of places.

    Ideally, I would like to ask Stata to avoid using the `strL` format except when I specifically
    request it. Alternately, to avoid using the `strL` in cases when the string length is less than some
    character length, like 100.

    Can this be done?

    Thank you!

    -p

  • #2
    strL as a variable or storage type (not a format) was introduced in Stata 13. It's not new. See e.g.

    Code:
    help whatsnew12to13
    Can you give a 10-line data file which behaves as you report? Naturally, you can and should remove any confidential or sensitive details.

    Otherwise I think you may need to take this up with StataCorp technical services.

    Please note our request to use full real names, as at e.g. https://www.statalist.org/forums/help#realnames

    Comment


    • #3
      merging on a strL is unwise, and is impossible in Stata (tested in 15 and 16).

      1) There's a couple of issues here on top of what Nick has pointed out.

      2) the "older" str formats can only hold up to 2048 bytes, so if your code somehow worked before, then it was doing so either by silent (to you) truncation of data or you have never encountered longer string data till now. Presumably that entire string is important data.

      3) the 10-line string is an import error caused by missing delimiters (say).

      #1 makes me think that #2 or #3 is true. In the long run, you would be better to redesign your code to use a better merge key.

      Comment


      • #4
        Thank you for the response.

        Here is a link to a 28-row file with two string columns:
        https://www.dropbox.com/s/bjikrslzqb...hicle.csv?dl=0.

        When I use import delimited to read this file in Stata 16, v1 is str28 and v2 is strL. When I import
        the same file in Stata 14, v1 is str28 and v2 is str321.

        This is a translation file where the Telugu string is a merge key. As a result, our code which works
        in Stata 14 does not work in Stata 16, which defaults the field to strL.

        I should have mentioned in my initial post that these are foreign language strings with Unicode;
        perhaps that is related to why Stata 16 treats them differently.

        Nevertheless, my question remains-- is there an option or way I can request that Stata refrain from
        converting these strings to strL upon import?

        p.s. I sent a support request to replace my pseudonym with my real name.

        Comment


        • #5
          I think this would be a good question to direct to Stata Technical Services. The output of help whatsnew15to16 tells us import delmiited was enhanced but not in ways that seem to have anything to do with this. If you can attach a small dataset that demonstrates the problem in Stata 16 after successful import in Stata 14, that will add weight to your question.

          Added in edit: I don't think this will help, but try, in Stata 16,
          Code:
          version 14: import delimited ...
          Last edited by William Lisowski; 14 Aug 2020, 14:09.

          Comment


          • #6
            The behavior of import delimited was changed in Stata 16 to use a strL if using a strL would result in a smaller dataset. This might be what is happening in your case. However using the dataset that you provided along with the proper encoding, the results are as follows:

            Code:
             . import delimited "telugu_vehicle.csv", varnames(1) encoding(UTF-8) clear 
            (2 vars, 27 obs)
            
            . d
            
            Contains data
              obs:            27                          
             vars:             2                          
            ---------------------------------------------------------------------------------------------------------------------------------------
                          storage   display    value
            variable name   type    format     label      variable label
            ---------------------------------------------------------------------------------------------------------------------------------------
            master          str28   %28s                  
            vehicle         str168  %168s                 
            ---------------------------------------------------------------------------------------------------------------------------------------
            Sorted by: 
                 Note: Dataset has changed since last saved.
            Using a strL vs a long str can dramatically reduce the size of the dataset when there is a large difference between the average width of a string and the longest string. The change was done under version control so you could prefix the import delimited command with a version 15: statement, or simply recast str varname before you merge. Example:

            Code:
            version 15: import delimited "telugu_vehicle.csv", varnames(1) encoding(UTF-8) clear
            or
            Code:
            recast str vehicle

            Comment


            • #7
              Thanks William and James, it's helpful to understand how and when this happens.

              As James currently describes the situation, a datafile with some strings could default to str and merge correctly, but then cause errors if the file is marginally changed such that Stata decides it should use a strL instead. As noted, this feature definitely causes code that works prior to Stata15 to break in Stata16.

              It seems like the only reliable coding approach here is to recast all potential merge strings to str, whether it seems necessary at the time or not, since it is hard to predict which format Stata will use when reading a file in. Is this now the recommended approach for merges with strings?

              I humbly submit that it might be nice to have a setting to disable the automatic selection of strL vs str. Or else to permit merges on strL in a future version.

              Comment


              • #8
                Originally posted by Paul Novosad View Post
                . . .code that works prior to Stata15 to break in Stata16.

                It seems like the only reliable coding approach here is to
                instruct your team to put the version number at the top of all do-files.

                .ÿ
                .ÿversionÿ15.1ÿ//ÿÿthis

                .ÿ
                .ÿlocalÿline_sizeÿ`c(linesize)'

                .ÿsetÿlinesizeÿ79

                .ÿ
                .ÿclearÿ*

                .ÿ
                .ÿimportÿdelimitedÿtelugu_vehicle.csv,ÿvarnames(1)
                (2ÿvars,ÿ27ÿobs)

                .ÿ
                .ÿdescribe

                Containsÿdata
                ÿÿobs:ÿÿÿÿÿÿÿÿÿÿÿÿ27ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
                ÿvars:ÿÿÿÿÿÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
                -------------------------------------------------------------------------------
                ÿÿÿÿÿÿÿÿÿÿÿÿÿÿstorageÿÿÿdisplayÿÿÿÿvalue
                variableÿnameÿÿÿtypeÿÿÿÿformatÿÿÿÿÿlabelÿÿÿÿÿÿvariableÿlabel
                -------------------------------------------------------------------------------
                masterÿÿÿÿÿÿÿÿÿÿstr28ÿÿÿ%28sÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
                vehicleÿÿÿÿÿÿÿÿÿstr321ÿÿ%321sÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
                -------------------------------------------------------------------------------
                Sortedÿby:ÿ
                ÿÿÿÿÿNote:ÿDatasetÿhasÿchangedÿsinceÿlastÿsaved.

                .ÿ
                .ÿsetÿlinesizeÿ`line_size'

                .ÿ
                .ÿexit

                endÿofÿdo-file


                .

                Comment

                Working...
                X