Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wishlist for Stata 15

    Advised by Christophe, I created this new post for Statlisters' wishes for Stata 15. Feel free to drop your wishes here. Hope that our discussions here can make Stata/Mata a more user-friendly software.

    Ingrid

  • #2
    I wish Mata can be more compatible with Matlab, especially for some built-in functions. For example, high-dimensional matrices and ndgrid() function.

    Comment


    • #3
      As a user of both Stata and the R IDE RStudio, I would like to see more standardization of convenience features. Paying attention to R's rapid expansion is an important aspect and software developers, including Stata, can help users switch back and forth between these two packages.
      • Ctrl-L to clear result window
      • Ctrl-C to comment out lines (insert *)
      • Shift or Ctrl+Shift to select/deselect multiple variables in data browser/editor
      • Arrow up/down instead of Page-up/down to cycle through command history (under Win)
      • Standardize interface shcemes (esp. editor) according to common editor schemes (e.g. Merbivore, Mono Industrial, see: https://docs.c9.io/docs/syntax-highlighting-themes)
      • A couple more official-Stata graph schemes, especially one that makes full use of ColorBrewer palettes
      • Import/Export R data *.Rdata (currently, the options are using a third party middleware like Stat/Transfer, or export to a middle common format with loss of metadata. For R, there are no plans to update the excellent foreign package which can only read Stata 12)

      Comment


      • #4
        Thomas Speidel, have you tried the R Haven package? https://github.com/hadley/haven, which claims to work with Stata 13 and 14 files. Disclaimer: I have not tried it myself.

        Comment


        • #5
          @Hua Peng Yes. I only tested write_dta and had number of problems especially around storage types. It also produced strange behaviour in Stata's data browser. The foreign package remains the most flexible (even better that Stat-Transfer when going from dta to RData), but it does require Stata to saveold [...], version(12). For instance, I have a habit of using underscores in naming my var in Stata; in R, I prefer to use periods (underscores can cause problems, especially with knitr/Sweave). Foreign can convert those on the fly for me.

          Comment


          • #6
            Thanks for sharing the experience.

            Comment


            • #7
              It would be great to have a variable format for percents similar to what is available for commas.

              I would find it helpful to have a straightforward way to add information to graphs that isn't necessarily what is being shown in the graph itself (see this link).

              Also, Stata's graphics are great in general, but having graphic schemes that are cleaner and a bit more modern-looking would be nice.

              Finally, taking output from Stata and putting it into Word or other programs requires quite a bit of formatting. Making it easier to drop results into other programs (without having to do a ton of formatting) would be very helpful.

              Comment


              • #8
                I use the Haven package for R on almost a daily basis and have not had any issues reading/writing data. I'd recommend Thomas Speidel not use periods in variable names since the period is a semi-reserved character that could cause issues with some of the object oriented paradigms implemented in R. Typically, people use underscores/periods to delimit distinct words in variable names, so maybe using lowerCamelCase would work for you to make the names easier to read. Hua Peng (StataCorp) I've also put in a ticket with the author of the Haven package about using the headers from the Stata plugins page to make the data types more consistent with those already defined by Stata, but don't remember what reason he had for not doing so. Also, Thomas Speidel and Hua Peng (StataCorp), although the foreign package is "flexible" there are long existing bugs in the way it manages/converts data; fairly recently, I ended up winning a bet with a colleague because they were using a wrapper for the foreign package to read SPSS data which ended up writing a bunch of binary zeros into the data set due to encoding issues with the source file. That said, I would assume the program would probably have similar issues if the encoding of the Stata file differed across the platforms on which the function was being called. I couldn't agree with the Statement about it being more flexible/better than StatTransfer since StatTransfer typically performs fast and does type casting optimization to minimize the memory footprint of the dataset (which is even more important in a language like R that does a horrible job with memory management).

                Erika Kociolek have you tried using the - brewscheme - package available on SSC? I'm still in the middle of making some final adjustments before submitting an update, but the program is designed to do exactly what your problem is (e.g., make it easier to build out schemes that are better suited to personal tastes and based on research in data visualization). There are also several user written programs (e.g., rtfutil, estout, tabout, outreg, outreg2, sutex, listtex, etc...) all available on the SSC that have varying capabilities to pushout output to a word friendly format (or to word directly). If you're wanting standardized formatting, the best suggestion would be to write a wrapper around one of the programs with the default formatting arguments that you would want specified, but this only saves a bit of time with typing.

                Comment


                • #9
                  RAM usually isn't a constraint for me, so I'd like to keep multiple data sets on hand in a single instance of Stata, especially when exploring interactively. Maybe secondary data sets could be kept in the "background" without breaking too much of Stata..? Here's some syntax I'm imagining:

                  Code:
                      use event.dta, clear
                      use person.dta, background("person") 
                      ** ^ leaves event.dta as the "current" data, but reads in person.dta in the "background", storing it with label "person"
                  
                      ds
                      ** event_id date location
                      ds, using("person")
                      ** person_id age gender location
                  
                      tab gender, using("person")
                      br, using("person")
                  
                      merge m:m location using_bk("person"), gen_bk("person_event")
                      ** ^ creates "person_event" table in memory in the "background", leaving currently loaded data alone
                      use_bk "person", switch("event")
                      ** ^ takes table "person" from the background as the current data, puts current data into background as "event"
                  With this syntax, all data sets in memory would be treated symmetrically, except that the "current" data (not in the "background") would not require that any "using" options passed to function calls. (And presumably not all functions would be extended to add new using options). Anyway, the ability to juggle multiple tables is what I miss most when using Stata.

                  This thread is posted in the Mata forum, so: It would also be nice to get the set operation functions implemented (union, intersection, Cartesian product). I've made some of my own, though, so it's not a big deal. And, as Ingrid mentioned, higher dimensional arrays (I mean analogous to matrices, not to associative arrays) would be a big plus.

                  Comment


                  • #10
                    Thanks @wbuchanan for the detailed overview. After reading your comments I gave Haven another try but I'm still running in some issues. I had commented a few weeks ago on these bugs to one of the author's site or GitHub and recall someone else was having similar issues. I also tested the R package readstata13 which uses similar syntax as foreign. I would be curious to hear your thoughts.

                    Comment


                    • #11
                      Seems both are making some fundamental changes to the underlying data structures. Since there's only one way to read in a Stata 14 file I used:

                      Code:
                      # R object containing reference to the auto data set for comparisons
                      autoin <- "/Applications/Stata/ado/base/a/auto.dta"
                      
                      # Read with Haven
                      autoHaven <- haven::read_dta(autoin)
                      
                      # Write out separate versions of the files to further test
                      foreign::write.dta(autoHaven, "/Applications/Stata/ado/base/a/autoforeign.dta")
                      haven::write_dta(autoHaven, "/Applications/Stata/ado/base/a/autohaven.dta")
                      To read the auto data set into R and write the file to disk using the foreign and haven packages. I also took the same data set and transferred it to an R workspace image in StatTransfer and then converted the workspace image back to a Stata dataset afterwards. Then I used - hexdump - to look at any differences in the data at a fairly low level.

                      Code:
                      foreach i in auto.dta autoStatTransfer.dta autoforeign.dta autohaven.dta {
                          di `"File `i' Hexadecimal Dump: "'
                          hexdump `i', analyze
                      } 
                      File auto.dta Hexadecimal Dump: 
                         
                          Line-end characters                        Line length (tab=1)
                            \r\n         (Windows)              0      minimum                        2
                            \r by itself (Mac)                 13      maximum                    3,180
                            \n by itself (Unix)                19
                          Space/separator characters                 Number of lines                 32
                            [blank]                           132      EOL at EOF?                  yes
                            [tab]                               8
                            [comma] (,)                         5    Length of first 5 lines
                          Control characters                           Line 1                     3,180
                            binary 0                        2,816      Line 2                        42
                            CTL excl. \r, \n, \t              531      Line 3                       168
                            DEL                                17      Line 4                       129
                            Extended (128-159,255)            133      Line 5                         3
                          ASCII printable
                            A-Z                               306
                            a-z                             1,464    File format                 BINARY
                            0-9                               143
                            Special (!@#$ etc.)               565
                            Extended (160-254)                304
                                                  ---------------
                          Total                             6,443
                         
                          Observed were:
                             \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W
                             ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5
                             6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P R S T U V W X Y Z
                             [ \ ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { } DEL
                             128 E^B E^C E^D E^E E^F E^H E^J E^L E^M E^N E^O E^P E^R E^S E^T E^U E^V
                             E^W E^X E^Y E^Z 155 156 157 159 160 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
                             ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
                             ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 255
                        File autoStatTransfer.dta Hexadecimal Dump: 
                         
                          Line-end characters                        Line length (tab=1)
                            \r\n         (Windows)              0      minimum                        1
                            \r by itself (Mac)                 20      maximum                    3,090
                            \n by itself (Unix)                29
                          Space/separator characters                 Number of lines                 50
                            [blank]                           120      EOL at EOF?                   no
                            [tab]                              15
                            [comma] (,)                         5    Length of first 5 lines
                          Control characters                           Line 1                        71
                            binary 0                        2,996      Line 2                     3,090
                            CTL excl. \r, \n, \t              508      Line 3                        43
                            DEL                                 4      Line 4                       141
                            Extended (128-159,255)            192      Line 5                        31
                          ASCII printable
                            A-Z                               257
                            a-z                             1,150    File format                 BINARY
                            0-9                               308
                            Special (!@#$ etc.)               539
                            Extended (160-254)                283
                                                  ---------------
                          Total                             6,406
                         
                          Observed were:
                             \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W
                             ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5
                             6 7 8 9 : < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
                             [ \ ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z } DEL 128
                             E^B E^C E^D E^E E^F E^H E^J E^L E^M E^N E^O E^P E^R E^S E^T E^U E^W E^X
                             E^Y E^Z 155 156 157 159 160 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
                             ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
                             ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 255
                        File autoforeign.dta Hexadecimal Dump: 
                         
                          Line-end characters                        Line length (tab=1)
                            \r\n         (Windows)              0      minimum                        2
                            \r by itself (Mac)                 19      maximum                    2,172
                            \n by itself (Unix)                29
                          Space/separator characters                 Number of lines                 49
                            [blank]                           103      EOL at EOF?                   no
                            [tab]                              11
                            [comma] (,)                         5    Length of first 5 lines
                          Control characters                           Line 1                     2,172
                            binary 0                        4,759      Line 2                        68
                            CTL excl. \r, \n, \t              519      Line 3                       225
                            DEL                                 5      Line 4                        44
                            Extended (128-159,255)             84      Line 5                       207
                          ASCII printable
                            A-Z                               196
                            a-z                               807    File format                 BINARY
                            0-9                                70
                            Special (!@#$ etc.)               344
                            Extended (160-254)                234
                                                  ---------------
                          Total                             7,166
                         
                          Observed were:
                             \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W
                             ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . 0 1 2 3 4 5 6
                             7 8 9 < = > ? @ A B C D E F G H I J L M N O P Q R S T U V W X Y Z [ \ ^
                             _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z } DEL 128 E^B
                             E^C E^D E^E E^F E^H E^J E^L E^M E^N E^O E^P E^R E^S E^T E^U E^W E^X E^Y
                             E^Z 155 156 157 159 160 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
                             ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
                             ? ? ? ? ? ? ? 255
                        File autohaven.dta Hexadecimal Dump: 
                         
                          Line-end characters                        Line length (tab=1)
                            \r\n         (Windows)              0      minimum                        2
                            \r by itself (Mac)                 19      maximum                    2,172
                            \n by itself (Unix)                29
                          Space/separator characters                 Number of lines                 49
                            [blank]                           108      EOL at EOF?                   no
                            [tab]                              12
                            [comma] (,)                         5    Length of first 5 lines
                          Control characters                           Line 1                     2,172
                            binary 0                        4,782      Line 2                        68
                            CTL excl. \r, \n, \t              523      Line 3                       225
                            DEL                                 5      Line 4                        44
                            Extended (128-159,255)             80      Line 5                       207
                          ASCII printable
                            A-Z                               213
                            a-z                               841    File format                 BINARY
                            0-9                                58
                            Special (!@#$ etc.)               343
                            Extended (160-254)                248
                                                  ---------------
                          Total                             7,247
                         
                          Observed were:
                             \0 ^A ^B ^C ^D ^E ^F ^G ^H \t \n ^K ^L \r ^N ^O ^P ^Q ^R ^S ^T ^U ^V ^W
                             ^X ^Y ^Z Esc 28 29 30 31 blank ! " # $ % & ' ( ) * + , - . 0 1 2 3 4 5 6
                             7 8 9 : < = > ? @ A B C D E F G H I J L M N O P Q R S T U V W X Y Z [ \
                             ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z } DEL 128 E^B
                             E^C E^D E^E E^F E^H E^J E^L E^M E^N E^O E^P E^R E^S E^T E^U E^W E^X E^Y
                             E^Z 155 156 157 159 160 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
                             ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
                             ? ? ? ? ? ? ? ? ? 255
                      So there is something inherently different about the structure of the data in R that is resulting in fairly substantial changes/differences. Although the foreign and haven functions for writing data were fairly similar, its clear that there are a ton of binary zeros in the data that did not previously exist, but even with the translation StatTransfer. No solution is perfect, but it does appear that StatTransfer is most similar to the Stata dataset than the result from reading/writing from R packages specifically.

                      Comment


                      • #12
                        Neither Stata nor R's data-storage formats are designed for data transfer. Safer and more robust than trying to catch "bugs" in the transfer packages would be to simply write to and read from csvs (or similar). You can also run an R script from Stata with the shell command, which is pretty handy. Anyway, so much for a discussion of a wishlist...

                        Comment


                        • #13
                          I find this is a very interesting discussion. The problem with "universal" format like csv is the loss of the metadata, in the case of Stata, data types, formats, value labels, characteristics, and sort order are all lost. Also with the newer dataset format, strL can hold binary data which is simply not suitable for text files.

                          Comment


                          • #14
                            I've not tested it with strLs, but I've been working on a JSON serializer/deserializer that provides options to write a JSON object with all of the relevant metadata to disk or prints it to the Stata console (in each case as much of the data is returned in a local macro as possible as well). The differences with StatTransfer seem to make a decent amount of sense since it tries to optimize how the data are stored. I also don't see how Stata's storage format isn't designed for transfer since
                            Code:
                            help dta
                            seems to imply that the .dta file is an XML type format where the data are stored in binary format. Maybe future versions of Stata will leverage Java a bit more and .dta files will essentially be a binary POJO? Or, the Java API could allow Stata to make use of POJOs that would make data access/objects a bit more similar to R's OOP-based model?

                            Comment


                            • #15
                              Hua Peng (StataCorp) Just finished pushing some changes to the repository. Any thoughts you and/or any of the other developers at StataCorp had regarding the serialization/deserialization would definitely be welcome. Still haven't tested anything with a strL yet, but it seems to work reasonably well in other cases. The biggest change I'm thinking about at the moment is how I am storing value label objects to make it a bit easier to identify the connection between value labels and variables, but for the moment it still does the trick and wouldn't require too much effort to manage using the value labels afterwards:

                              https://github.com/wbuchanan/StataJSON

                              There are also examples of calling the ado program that wraps the calls to javacall and what the output looks like in the README.

                              Comment

                              Working...
                              X