Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • User support for Stata increasing the variable label length for the next update.

    I thought I would contact you because I am looking for user support. I think it is time for Stata to rethink the variable label 80 character policy in light of the following factors:

    - Advancements in search platforms which use metadata stored in variable labels in their search functions. Many terms need to be stored in order to create visibility.
    - Advancements in data capture technology, which is increasing the volume of data we are able to collect and store.
    - Increase worldwide in complex data-generating lifecourse studies, with complex and precise metadata.
    - Increased use of multi-study search platforms, which require more clarity to distinguish studies (and particular waves and subsets of studies) from the variable labels.
    - Advancements in understandings of the use of metadata.
    - Stata's competitors’ decisions to increase their character lengths.

    I think these factors may outweigh issues of functionality, such as the variable label length supplied by research data providers will not be clear on a Stata produced graph. Researchers dealing with a tiny subset of very complex data projects change their own labels, but mistakes are easily made by publishing researchers if the correct label is not attached to the variable provided by the data generator.

    I know the -char- and -notes- functions in Stata: they are great, but the metadata is lost when converting to other widely used packages, whereas variable labels are not

    What do you think?

  • #2
    How would longer variable labels ever be shown? used? manipulated?

    If data includes long strings, fine, and you can already use string variables to hold and process them.

    Comment


    • #3
      Nick, I'm sensitive to your points, but I have to agree with the OP. I get SAS datasets routinely where the originator puts the entire case record forms' question (prompt) in the SAS variable label (which has a much longer variable label length limit than does Stata). When converting from SAS to Stata, Stat/Transfer puts the excess characters into a -note-, but it would be more convenient when I export from Stata to worksheets (first row) for my colleagues to work up if I didn't need to workaround Stata's variable label length limit. I've encountered the same thing with SPSS datasets where the entire text (the prompt) for questionnaires' items are put into the variable label. I would guess that other Stata users who work with survey data experience similar annoyances daily.

      Comment


      • #4
        Naturally I accept that people really want to import longer variable labels.

        So, suppose that StataCorp are willing to change the maximum length. How big does it need to be?

        And what else needs to be done?

        My life is fairly simple in this respect. Already I often find variable labels too long to show well on graph or table margins, but I just need to trim them down or write my own text.

        People with much, much longer labels that they want to import need to spell out what else they need, for example:

        1. Stata won't have more space than before in graphs and tables and listings. So, variable labels will just be truncated or unreadable. Bad news, and so what is the advantage?

        2. How is the information in variable labels to be easily visible and extracted? Stata does have string functions and macro manipulation commands. Anything else?

        Comment


        • #5
          I believe that longer variable labels are probably not a good place to store meta-data and additional information; Stata's notes and characteristics are much better suited for such purposes. From a conceptual point of view, Stata's limit on variable label length is not flawed; the deficiency here lies in the apparent absence of notes and characteristics, as proper places to store meta-data, from SAS, SPSS and other packages.

          From a practical point of view, I would ask for the price of longer variable labels. I suspect you would, once again, have to change the dta-format. If this is the case, I opt against it, as I suspect that (much) more everyday problems already arise from frequent changes in dta-formats over the past couple of releases. I often receive (or save) Stata datasets that do not open in older versions of the software because others (and myself) do not routinely use saveold for various reasons. If I were StataCorp I would worry more about users of older versions of my software to be able to collaborate smoothly than trying to compensate the flaws of other packages so users can import datasets (with yet another third party program) more conveniently.

          Best
          Daniel

          Comment


          • #6
            I think as an end-user of data, one tends to recode the metadata to make sense to the project. Data producers, visible on multi-platforms need to ensure that their data is completely identifiable even when it stands alone. To do this it is necessary not just to include the actual label (which can be long, esp. with questionnaires) but also the wave of collection marker, the dataset marker, the section marker, the derivative marker etc. So basically, we use it in different ways. For me, it means I provide the Stata copy with trucated metadata or tailor it specifically for the package. Joseph, you must have a newer version of StatTransfer than me! Or maybe it just does that for SAS.

            I am putting together a list of the competitors varlabel lengths to look at how long is good enough. It was easy for varname (where Stata came out as the shortest @ 32 char, far shorter than competitors and miles away from what is needed for gene-code data), bit more work for varlabel... so far SAS has 256...

            Stata are not keen as in order to do it they have to change their dataset format, which they try not to do.

            Comment


            • #7
              StataCorp (not Stata) will make the decisions here but allowing a longer variable label won't, so far as I can see, break any users' previous code or create problems with any existing datasets. The implications are all for how the variable labels will and can be used once inside Stata.

              Adding detail on why you want this is interesting and I am manifestly not negotiating here on the company's behalf. But it's easy to guess that they too seek answers to the kinds of questions I posed in #4. I don't see any answers as yet.

              Comment


              • #8
                Originally posted by Miss Amy Dillon View Post
                I think as an end-user of data, one tends to recode the metadata to make sense to the project. Data producers, visible on multi-platforms need to ensure that their data is completely identifiable even when it stands alone. To do this it is necessary not just to include the actual label (which can be long, esp. with questionnaires) but also the wave of collection marker, the dataset marker, the section marker, the derivative marker etc. So basically, we use it in different ways. For me, it means I provide the Stata copy with trucated metadata or tailor it specifically for the package. Joseph, you must have a newer version of StatTransfer than me! Or maybe it just does that for SAS.

                I am putting together a list of the competitors varlabel lengths to look at how long is good enough. It was easy for varname (where Stata came out as the shortest @ 32 char, far shorter than competitors and miles away from what is needed for gene-code data), bit more work for varlabel... so far SAS has 256...

                Stata are not keen as in order to do it they have to change their dataset format, which they try not to do.

                I dont think the current limits are prohibitive/problematic, and I definitely wouldnt want them lengthened if the cost is something like what Daniel is describing above.

                I agree with what's been said above about the impracticality of longer variable labels in a lot of (but not all) types of output and how it's good practice to store longer labels (well what you consider labels, to me this sounds like documentation) like full survey question text in characteristics and notes. As characteristics, these meta data can be easily piped into graphs and tables (you reference them as you would a macro, though getting them to fit in your output is another matter entirely) and you can have multiple attributes per variable (e.g., your tags/markers for sections, data source etc can be stored as separate characteristics). Characteristics can be up to 67,784 in length and macros can be much longer (determined by your flavor of Stata). A more minor point: I wonder what would happen to the -describe- output if variable labels were lengthened and not truncated in the display (I hate to think the wrapping and window sizing struggles that would ensue).

                I am having trouble understanding why 32 characters is not enough for variable naming. The number of combinations of variable names far outstrips the number of possible number of variables that Stata datasets (or most it's competitors) can hold. I don't have a sound footing in combinatorics, but if you multiply the number of possible characters (26 upper, 26 lower, underscore , 10 digits ; except for the first character which cannot be a digit) then you have at least 63^32 = 3.792e+57 combinations [ this is probably closer but likely still wrong: 2 x (27! * (37! * 31) ) = 9.2920461e+72 ] .

                I get that these are not necessarily as human-readable as here_is_my_very_long_and_descriptive_variable_name _in_SAS_2_final_redo_again might be but (1) that's what codebooks and systematic variable naming conventions are for (2) that seems like it would bloat the .dta overhead meta data and make searches or manipulation of meta data via -ds-, -rename-,, etc slower (and, again, necessitate new .dta structures).
                Last edited by eric_a_booth; 16 May 2018, 14:35.
                Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

                Comment


                • #9
                  Originally posted by daniel klein View Post
                  I believe that longer variable labels are probably not a good place to store meta-data and additional information; Stata's notes and characteristics are much better suited for such purposes. From a conceptual point of view, Stata's limit on variable label length is not flawed; the deficiency here lies in the apparent absence of notes and characteristics, as proper places to store meta-data, from SAS, SPSS and other packages.
                  Although many SPSS users may not know about it, SPSS does have an ADD DOCUMENT command that can be used to add "a block of text of any length in the active dataset".
                  --
                  Bruce Weaver
                  Email: [email protected]
                  Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
                  Version: Stata/MP 18.0 (Windows)

                  Comment


                  • #10
                    Originally posted by Bruce Weaver View Post

                    Although many SPSS users may not know about it, SPSS does have an ADD DOCUMENT command that can be used to add "a block of text of any length in the active dataset".
                    Reminds me of the filewrite() and fwrite() commands, e.g.,

                    Code:
                    clear
                    mata: fh = fopen("myfile.txt", "w")
                    mata: fwrite(fh, "really long string goes here")
                    type myfile.txt
                    
                    set obs 1
                     g x = filewrite("myfile2.txt", "more really long strings here" )
                    type myfile2.txt
                    (Though you'd need to write a program to get the syntax down to something similar to the SPSS approach)


                    In terms of reading from meta data like characteristics (as discussed above), I guess you could store this meta information in separate text files and extract strings (and images!) from a sequence of files with something like:

                    https://www.stata.com/stata-news/news31-4/spotlight/

                    Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

                    Comment


                    • #11
                      One more thing that the ADD DOCUMENT SPSS command reminds me of is Nick Cox's command -filei- from SSC, you can

                      Code:
                      filei + "text string here"  myfile.txt
                      to add text immediately to a document.
                      Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

                      Comment


                      • #12
                        Thanks for the mention of filei. It's more a demonstration of concept than a generally good way to work at text files. At least, if anyone finds it useful, that's good but I wouldn't defend it more broadly.

                        Comment


                        • #13
                          I thought I would provide an examine of a situation which a longer variable label would help. I have a dataset that includes 82 9 character dummy variables for a person's job type. This is way too many - we need to consolidate them into maybe 10 groups, though each group will have a different # of the originals. Of course we need to keep track of which original jobs are being compressed into which new groups. We have just written code to do so, consolidate based on various criteria and label with the original job names, and bumped up against the 80 character limit, since if a group has more than 7 of the original jobs they won't all fit into the new label. If you have a workaround in the mean time it would be much appreciated. So we need the labels to store complex information. Thanks!

                          Comment


                          • #14
                            On #13: As before in this thread: notes, more generally characteristics, and string variables are other ways to store the information. The practical question for you is not whether you've bumped against the 80 character limit -- because you have -- it is how longer variable labels are compatible with civilised display of tables, graphs and other output.

                            Comment


                            • #15
                              Nick, Thank you for the super quick reply. I will experiment with characteristics and see if they solve the problem. From the earlier posts I had not been able to locate the relevant documentation but I tried harder after your reply - here it is for future forum readers: http://www.stata.com/manuals13/u12.p...haracteristics.

                              Comment

                              Working...
                              X