Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Feature Request : export delimited , firstrow(varlabels)

    Not sure if anyone else has figured out any elegant solutions to export data to a delimited text file using the variable labels as column headers, but this would definitely be a good feature to have. Currently running into issues because some files exceed the maximum number of records that can be included in an MS Excel file (creating the need for a delimited flat file in its place instead). The only potential work around I can think of is to cast all the data as string, insert a new record containing the variable labels, add a new variable to define the sort order to place this newly inserted row in the first record, and then exporting the file as delimited text using the novarnames option.

    With the exception of functionality that is specific to MS Excel (e.g., worksheet names, etc...), it would be nice if the -export excel- and -export delimited- commands used the same option set.

  • #2
    I feel ambivalent about this. I can see where it can be useful in the situation you describe. But it is potentially dangerous here. The content of variable labels is almost entirely unconstrained. They can (and at least in my work often do) contain commas or other characters that might be confused with the delimiter. The resulting exported file might then be difficult or impossible to use subsequently because the first row would mislead the importing application about the number of variables.

    I think that if we are forced to do cumbersome things to get those variable labels in place as the column heads, perhaps that will also force us to think about whether they are suitable for that purpose. I worry that if it is too easy to do this, it will be done carelessly and, from time to time, wreak havoc.

    Added: I suppose if I were writing a program that wrapped -export delimited- and had a -firstrow(varlabels)- option, I would build in a check to be sure that the labels didn't contain the delimiter character. That would be safe.

    Comment


    • #3
      Clyde Schechter
      While I completely understand your reservations, it doesn't prevent users from embedding commas in string variables which could lead to the same problematic behavior. For my specific use case, I need the variable labels to be present so they can help others working on development tasks for some public facing visualizations Kentucky School Report Cards that my shop put together to make things easier for end users that need to include this information in subsequent administrative reporting tasks. In this case, all of the UI widgets are being labeled with the variable labels (not really much different from creating graphs within Stata in that sense) using language that is commonly used already to make it a bit more user friendly.

      The existing infrastructure that end users were previously provided access to from the state Kentucky Department of Education School Report Cards has a much clunkier user interface, doesn't provide any means for viewing the data longitudinally, and doesn't provide much in the way of comparative capabilities (e.g., to provide some normative context about the school relative to the district relative to the state). If we had more time and staff that could do other types of web development I could potentially have exported everything to JSON and used D3, but there would have been some rather nasty performance drawbacks; even after optimizing storage and normalizing the files the total amount of data is about 667MB, which is still a huge improvement over the 989.4MB of files from the state.

      I'm not sure if most/all parsers will do this, but it seems that quoting strings that contain the delimiters might help to avoid them being interpreted incorrectly in other systems downstream.

      Comment


      • #4
        it doesn't prevent users from embedding commas in string variables which could lead to the same problematic behavior.
        Very true, and I have encountered this frequently.

        I'm not sure if most/all parsers will do this, but it seems that quoting strings that contain the delimiters might help to avoid them being interpreted incorrectly in other systems downstream.
        Some parsers will correctly handle quoted strings with embedded commas, and some won't--it really depends on the application that reads the CSV file.

        The CSV format is, for these reasons, problematic. I suppose if there were standards for it that dealt with these issues and all application developers adhered to those it would work better. (For all I know there may be standards, but they just aren't followed.) The popularity of the CSV format mystifies me, given how difficult it can be to work with. And it can be improved on with fairly simple tweaks: the use of a pipe (|) delimiter works better because it is truly rare for the pipe character to be part of the data itself. A tab-delimited layout also works better for the same reason. I suppose these are deprecated because they don't display nicely on screens when just typed out (and fewer applications accept a pipe delimeter). But for machine-reading, they are orders of magnitude better than CSV.

        Well, just getting my gripe off my chest. No reflection on what you're doing.

        Comment


        • #5
          The problem of embedded delimited characters is already addressed by the -export delimited- command. Strings containing the delimiter are quoted. This is the correct solution, and perfectly standard. Strings with embedded quote characters are still a problem, but not likely to be desired as part of a variable name.

          That said, why isn't the OP satisfied to use -rename- and allow the Stata variable names to be put to the first line of the csv file? Is there a desire to have names with embedded spaces? That isn't do my taste and will just annoy subsequent users of the data.

          Comment


          • #6
            The problem of embedded delimited characters is already addressed by the -export delimited- command. Strings containing the delimiter are quoted. This is the correct solution, and perfectly standard.
            Yes and no. I wish I had a nickel for every csv data set I've seen that doesn't bind the strings in quotes when they contain the delimiter. And I can also attest that not all applications that claim to read csv files can read them correctly even when the strings that contain delimiters are bound in quotes. Maybe there is some official standard that governs this, but it is far from universally honored in implementation.

            Comment


            • #7
              [email protected]
              The link I posted previously shows the specific use case that I have. The variable labels are used to construct elements in the UI and are much more user friendly than the variable names that are only visible to folks on my team. If some of the files didn’t contain as many observations as they do, I’d Ben able to handle this by exporting to excel but several of the files contain too many observations to be written to an excel file.

              Comment

              Working...
              X