Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    wbuchanan, I and some of my colleagues are definitely interested in your JSON project. I will try it and let you know if I have any feedback worth mentioning.

    Comment


    • #17
      Hua Peng (StataCorp) Awesome. Any input is definitely welcome. My longer term goal is to be able to fold the JSON serialization/deserialization processes into an interface with the D3js library, Prefuse, or other visualization libraries to make it easier to build out data visualizations that are interactive and/or have other options for colors (e.g., alpha transparency, etc...).

      Comment


      • #18
        wbuchanan

        I also don't see how Stata's storage format isn't designed for transfer since "help dta" seems to imply that the .dta file is an XML type format where the data are stored in binary format.


        I am not a computer scientist, so my familiarity with the intricacies of data storage is pretty minimal. For social scientists like me, reproducibility is important. Stata's format changes from version to version (though less radically than SAS, I gather)


        http://www.stata.com/support/faqs/da...vious-version/

        and XML and JSON are similarly evolving standards, I assume (not that there's anything wrong with that).

        Even taken at a snapshot in time, dta is a format designed for storing and loading data for use in a Stata environment. To this end, extra metadata is included and storage size is reduced in various ways, I'm sure. In R, tables of data may have attributes like "grouped by" or "sorted by", depending on the package; and string storage is reduced through the use of a "cache" of unique string values to which strings point. Also, Stata may use row-major order in storing data (does it?), while the destination environment uses column-major. Whatever idiosyncracies Stata has in storing data for its own use will not naturally translate elsewhere, so I would be surprised if dta is well-suited to data transfer. Ditto for R's.

        Comment


        • #19
          wbuchanan
          regarding the XML, Stata saves the data in a pure text version of XML (not the binary .dta) when you use xmlsave command:
          Code:
          sysuse auto, clear
          xmlsave "C:\data\auto.xml", doctype(dta)
          Although clear, open and compatible, the format has one big disadvantage: the standard file commands (use, append, merge) do not understand it, so users have all the incentives to save in a binary dta format to avoid the need to convert later.

          Best, Sergiy Radyakin

          Comment


          • #20
            Hey Sergiy Radyakin I thought you'd be chiming in. I wonder if there'd be a way of developing an XML Schema file that could be used to generate Java objects to handle these types of operations? The program I've been working on recently stores things in a fairly similar way to your program (e.g., https://github.com/wbuchanan/StataJS...ster/test.json). The difference is mostly with the handling of value labels, which I am thinking about reorganizing anyway; in the java program I wrote all of the meta data is stored based on the variable indices in the hopes that it would make it easier to iterate over the data once it is exported.

            Frank Erickson R doesn't actually have a "true" definition of what a "table" is, strictly speaking. Because of the underlying architecture of R, data tables (e.g., tbl_df, data.table, data.frame, etc...) are an object that has a two dimensional representation that we would associate with a table of data. R has pretty horrible memory management (currently at least), but Stata, SAS, and SPSS are almost all thinking fairly critically about how to relieve some IO pressure via optimization (I've not looked to hard, but the -compress- command in Stata is magical when it comes to reducing file size if the memory allocation isn't necessary). While I don't think any major analytic platform would make significant breaking changes to the major data storage objects/formats, a big concern should be the degree to which the platform developers are concerned with backwards compatibility with their data formats; StataCorp does an admirable job at trying to maintain backwards compatibility. It is entirely possible to store the metadata in a purely text-based format, but then using the file would become a huge task for folks without stronger programming chops. It would be nice if there was some standard (e.g., open data foundation, etc...) that would allow better interchange across platforms without losing data. And the string values thing that you were referencing I think is specific to factor variables (I could be wrong) and how they are handled as a distinctly different type (and was the reason behind the stringsAsFactors = TRUE defaults for most packages).

            Comment


            • #21
              wbuchanan note that xmlsave is a standard Stata command.

              xmlsave has an option to generate DTD into the XML file that it produces (add option dtd). I assume you can reconstruct the JSON structure by investigating it with a tool like this one:
              https://github.com/ncbi/DtdAnalyzer/...onversion-XSLT
              though I didn't try it out.

              You may also want to look at performance at some point. All these serialization/deserialization text operations can be very slow.

              Reading your response to Frank, if you think about preserving your data for ages to come, make sure you have a plain old tab-delimited ASCII data dump somewhere.
              Preservation format is not necessarily the same format that would be most convenient for processing. The benefit of the plain text XML is that you can print it after all, and retype a hundred years later, without worrying whether you can still read-in that 16mm tape somewhere.

              As long as there are sufficiently many tools that allow conversion from the preservation (data archival) format to another it should be ok, even if not super efficient. Zip archives are de-facto standard, even though there are formats that are more space-efficient, resource-efficient, and equipped with more features.

              Over the last few years Stata has beaten two major limitations of its previous data formats: long strings and unicode. There are a few inconveniences left, which do not allow accommodating some data from other formats, but those features are kind of rare from my perspective.

              Best, Sergiy Radyakin

              Comment


              • #22
                Sergiy Radyakin I've not tested it with a large dataset and know that it will have some issues as the dataset size increases (at least in cases where the user wants to convert the entire dataset to a JSON object/file). When I tried to parallelize operations for the entire dataset previously it caused some really strange issues when printing to the console and also affected the structure of the data itself (e.g., it wasn't writing the data as an array of objects but as distinct - and in many cases malformed - objects). If I'm able to figure out how to for the DataSet object to insert beginning/ending braces for an array it would likely help to give performance a boost and avoid potential issues with running out of memory later. I'll definitely check out the the link you mentioned above as well.

                Comment


                • #23
                  Originally posted by Sergiy Radyakin View Post
                  Zip archives are de-facto standard, even though there are formats that are more space-efficient, resource-efficient, and equipped with more features.
                  Apologies for getting off-topic, but does anyone know whether there is a zip-file standard that can archive files with multibyte (Unicode) file names? All of the zip software that I've run across can handle only singe-byte (ANSI, Shift-JIS etc.) file names. I've resorted to RAR and have been happy with it, but some correspondents only know zip files and are chary of self-extracting RAR archives as e-mail attachments.

                  Comment


                  • #24
                    Here is a very ambitious wish. I wish that there could be more embedded estimate functions in Mata. That is, there will be some Mata version of Stata commands, for example, qreg() function.

                    My recent experience with Stata/Mata is to write a estimation command. Most of the codes are written in Mata, because it is very convenient to use matrix language to write estimation codes. But this could bring great inconvenience, especially when one wants to call some basic estimation command in Stata. Switching between two environments requires more time and efforts. For example, I want to call qreg for 300 times in a for loop, but I want to store the results in a matrix. To fulfill this task in either environment could be troublesome (There could be a better way that I didn't know).

                    Finally, I write a Mata version of the matlab function doing quantile regression. Although the codes work at last but this solution has many disadvantages. First, the codes are not efficient, because it is not optimized by professional programmers. Second, it takes a lot of time to get the expected outcome.

                    Now Stata has many embedded language platforms, including Mata, C, R, Python and so on. Among these embedded platforms, Mata should be the official one and the most popular and mature one. If this is true, then I think it is worth StataCorp's efforts to write built-in estimate functions in Mata.

                    By the way, this may be a little bit rude, but I still want to share my own feelings. We can see that Stata has many official and unofficial embedded languages, which is not a good signal. It means that users think Stata's own ability is not strong enough to fulfill their goals, especially for developers. I think StataCorp should think carefully how to design Stata in the coming days so that it is user-friendly to both applied economists and command developers.

                    My understanding could be wrong due to my limited knowledge. Please correct me if I made things wrong. Also hope that Stata could be a more user-friendly and popular software.

                    Comment


                    • #25
                      Ingrid Qiu Stata has only two embedded languages - Stata and Mata. There are C and Java APIs, which expose parts of the Stata infrastructure to other language interfaces. The Python plugin is a pythonic interface around the C API contributed by a user James Fiedler and the R interface is another user-written program. So, officially Stata has two embedded languages: Mata and a higher level abstraction called Stata (many Stata functions are written in Mata). Abstraction of lower level programming tools occurs all the time and is absolutely not a justification for whether or not something is/is not more/less mature or popular than another, nor does it convey a "not good" signal. By the logic you have mentioned, C should be the only language that any one programs in (since the programs may be wrappers to underlying C libraries and binaries). Lastly, I wouldn't say that users think the ability of Stata isn't high enough to fulfill their goals, but rather that Stata user programmers are probably more likely to think about the best tool for the job. For example, while several people have written several great commands for working with geospatial data, I would tend towards using a combination of PostGIS functions and QGIS because those tools are highly specialized and design optimized for those applications. What is it that you find not to be user-friendly? Rather than the generic statement that you hope Stata could be more user-friendly, providing specific details would be more valuable and may cause others to chime in with existing solutions in cases where one exists.

                      Comment


                      • #26
                        We can see that Stata has many official and unofficial embedded languages, which is not a good signal. It means that users think Stata's own ability is not strong enough to fulfill their goals, especially for developers. I think StataCorp should think carefully how to design Stata in the coming days so that it is user-friendly to both applied economists and command developers.
                        I think Stata has always been very friendly to applied researchers and practitioners with little need for novel statistical methods.

                        A small subset has been active in developing packages (whom you call developers, but really they're almost all practitioners, unless they work at Stata Corp). With Mata, that group can now expand. Its syntax is similar to other programming languages, integrated into the Stata language, and rather powerful. I see no downsides to giving users access to Mata and look forward to further improvements.

                        Both of these languages are official. As wbuchanan mentioned, Stata may have official or unofficial interfaces with other statistical software, but why would you be against that? It would be a red flag if no one was working on such functionality; it's pretty standard with stats software, in my experience.

                        Comment


                        • #27
                          It's StataCorp jargon that those people who work on extending Stata at the company are called "developers". (Other people at StataCorp work in marketing, etc., etc.) This matches common terminology in software development, naturally.

                          There are no rules forbidding any term you like, but people not working for StataCorp who contribute extra programs to the community (by publication in the Stata Journal, via SSC, GitHub or their own websites, or just informally by posting on Statalist, have usually been called "user-programmers" in the Stata community.

                          My own suggestion is that using the term developers to include user-programmers could only lead to ambiguity at best and confusion at worst.

                          Comment


                          • #28
                            Originally posted by Joseph Coveney View Post
                            Apologies for getting off-topic, but does anyone know whether there is a zip-file standard that can archive files with multibyte (Unicode) file names? All of the zip software that I've run across can handle only singe-byte (ANSI, Shift-JIS etc.) file names. I've resorted to RAR and have been happy with it, but some correspondents only know zip files and are chary of self-extracting RAR archives as e-mail attachments.
                            Unicode file name storage is part of the ZIP file format specification since 2006 and UTF-8 file names are supported since 2007 (source, see section 2.2). Have you tried 7-zip?

                            Comment


                            • #29
                              Thank you, Friedrich. I did not know that. I will look into 7-zip.

                              Comment


                              • #30
                                wbuchanan Frank Erickson
                                I agree that Stata is user friendly. Stata commands are user-friendly for data management and estimation. So I used the term 'more user-friendly' but not 'be user-friendly'.

                                The most anti-user-friendly part is data transfer between Stata and Mata, especially when one wants to use a Stata command in a Mata environment. Please go to #24 paragraph 2. That is where I think Stata should make some improvement.

                                wbuchanan Is it really natural for statistic software to have many embedded languages? I do feel puzzled about this phenomenon. Take Stata as an example. It has Python plugin and C plugin. But if a user-programmer uses python plugin to write an estimation command, it would be hard to apply this command, especially for those who don't have any experience with python. He/She has to download Python first and load python plugin into Stata by following a complicated steps. Even for user programmers, it takes a long time to figure out what's the strengthens and weaknesses of different plugins, because there is too little materials explaining the differences between these plugins. This could raise ambiguity. That's my feelings. But I would like to know what you guys think about this.

                                Comment

                                Working...
                                X