Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Phil Clayton
    replied
    Originally posted by László View Post
    Oh, and please allow merging-joining on variables having different names in the two datasets. It is a bit dangerous, but caveat emptor. Again, it is messy (equally dangerous?) to rename variables temporarily either in master or in using that one can forgot to undo.
    I agree this would be nice, but in the meantime you can do this using the user-written mmerge package (SSC) using its umatch() option

    Leave a comment:


  • László Sándor
    replied
    And here's one more: I already wrote that (e.g.?) monetDB's column-based storage and smarter indexing allows orders of magnitude faster selection of the data and also some calculations. Even if Stata does not tap into this (free) resource like monetDB-R does for R, at least it could allow a bit smarter use via -odbc-: please let us -odbc merge-. Again, it is very inefficient to -preserve-, then -odbc load-, -save- and then -merge-.

    Leave a comment:


  • László Sándor
    replied
    Also, here is a thought, something different: As I know many people struggling with Stata on big data, and memory is *still* not-unlimited, I could imagine an option for -do-, or at least a tool in the do-file editor, for the cases when the code ends with -keep- before a save (or a new -keep- option for save). For these cases, Stata would offer to quickly analyze which variables would be loaded and used (-describe using- is very fast, one of the very few things Stata's limited indexing allows to do without holding the entire data in memory), then generated and used, and finally kept. Even for varlists this is doable quickly. Everything that is not kept in the end, could be offered to be dropped after their last use, or if never used, then not loaded (merged etc.). This might be a bit lazy for garbage collection, but would really save tremendous resources for teams working on big datasets who just cannot (do not) keep track of specifying variables for each use and merge or meticulously -drop- all the time (saving memory for the ensuing operations). Esp. editing these -drop-s all the time once edits the code…

    Oh, while I'm at it, it is crazy that -save- cannot just save a subset of observations even in version 14, 30 years after the first release. -preserve-keep-save-restore- is needlessly costly with big data. Preparing clean extracts can be a nightmare.

    Oh, and please allow merging-joining on variables having different names in the two datasets. It is a bit dangerous, but caveat emptor. Again, it is messy (equally dangerous?) to rename variables temporarily either in master or in using that one can forgot to undo.

    Leave a comment:


  • Alan Acock
    replied
    Ricahard Williams wrote "* Also, I understand that some other programs (e.g. MPLUS) let you specify auxiliary variables that help improve the handling of missing data."

    To add auxiliary variables all you need to do is get them in the covariance matrix you are analyzing. One way to do this is to add a line to the sem command as shown below
    sem (bmi <- age numberchildren incomeln educ quickfood) ///
    (<- gender minority alienation), ///
    method(mlmv) standardized
    estat eqgof

    the (<- gender minority alienation) line adds three auxiliary variables.

    Leave a comment:


  • Michael Anbar
    replied
    I would like to see consistent application of date literals. For example, to import data between certain dates from a Haver database (an external data source), the "fin" option requires date literals in the desired frequency format, like this:

    Code:
    import haver gdph@usna, fin(2000q4, 2010q4) tvar(date)
    I can't pass in date literals using "tq", like this:

    Code:
    import haver gdph@usna, fin(tq(2000q4), tq(2010q4)) tvar(date)
    which would greatly simplify do files that use local macros with date literals as the parameters.

    To work with a subset of the data in an "if" expression, I can use either of these expressions:

    Code:
    list if tin(2002q4, 2005q4)
    list if date >= tq(2002q4) & date <= tq(2005q4)
    but not this:

    Code:
    local start_date = yq(2002, 4)
    local end_date   = yq(2005, 4)
    list if tin(`start_date', `end_date')
    or this:

    Code:
    list if date >= 2002q4 & date <= 2005q4
    I don't understand the discrepancy here, especially in the limitations on "tin". tin already reads from tsset to find the name of the time variable, so ideally, it would also read the frequency and interpret integers that it's passed in the context of that frequency. For the Haver command, a separate option could be added that takes a date frequency, e.g. as in the case of tsset, and numerics that are passed in could be interpreted in the context of that frequency (%tq, %tm, etc.)

    Thank you,
    Michael Anbar
    Last edited by Michael Anbar; 15 Aug 2014, 12:20.

    Leave a comment:


  • László Sándor
    replied
    And one more: if StataCorp does not feel like their comparative advantage is in reinventing the wheel and writing up regex in C etc., they could invest in fast, easy, efficient back-and-forth with other environments. Python comes to mind, as it is free, and much of it a complement to Stata, not a substitute (unlike with R or Matlab or SAS etc.). It would be great to manipulate variables with essentially Python, esp. strings (and now with strL…).

    Leave a comment:


  • László Sándor
    replied
    One other thing: Could -margins- and (some) other factor-variable magic work with predefined interaction terms? Stata is lovely that it takes care of interactions for us (though numerical derivates are painfully slow sometimes), but repeated use of interaction terms in various models needlessly repeats the generation of those temporary variables, unless I am missing something. Why can't I save/define the interaction term, use it repeatedly, but still not fool -margins- thinking it's a separate variable?

    Leave a comment:


  • Clyde Schechter
    replied
    +1 for Laszlo's suggestion of a case() function!

    Leave a comment:


  • Adrian Sayers
    replied
    I would really like a few simple things.
    1. A vertical select in the do file editor
    2. A Do file editor that is divorced from the main package, so when Stata takes a nose dive, it doesnt take your code with it.

    I basically want crimson editor to work with Stata

    Leave a comment:


  • László Sándor
    replied
    And one simpler thing, much less frustrating for StataCorp than my previous suggestions: a new case(,,…) function, or anything to replace nested cond(,,) functions. And this should replace the everday use of repeated -replace if- and other uses (or the recode command!), which are painfully slow, esp. on bigger data.

    Leave a comment:


  • wbuchanan
    replied
    Native JDBC support analogous to ODBC. Although the Java API and/or C plugin capabilities for end user development are great, some end users (myself in particular) have little to no experience in these languages. Depending on the data system, JDBC can also be a bit better performing than ODBC. Support for multiple datasets residing in memory simultaneously. With sufficient ODBC/JDBC support, this could potentially be a matter of interfacing with whichever flavor SQL end users prefer (maybe have a set sql [mysql, sqlite, postgre, oracle, sqlserver] [, permanently]) to make integration of Stata within the "big data" space a bit easier on the development team at Stata as well as retaining flexibility. Using graph names as a method for creating multilayered graphs. For example, we can already use : tw scatter x y || scatter abc z But having a way to use something like: tw scatter x y, name(gr1) tw scatter abc z, name(gr2) tw scatter varx vary, name(gr3) // use gr1 as base, layer gr3 on top of that, and gr2 on top of all of them gr layer gr1 gr3 gr2 It doesn't seem as though alpha transparency is something that will be coming up in the immediate future, but making it easier for the end users to layer things without having to rewrite code could be a good stop gap that would also make it easier for people to create complex graphs more easily. Better documentation of the graphics engine to make it easier for the user/programmer community to add to the graphics capabilities. Better documentation/net courses for programming mata. Latent class/transition analysis (categorical latent variables more generally) modeling. Native support for 3PL IRT (allowing a single fixed and individually constrained/estimated c parameters in the model).

    Leave a comment:


  • László Sándor
    replied
    Originally posted by Jeph Herrin View Post
    Large data is typically maintained in relational databases, accessed via some flavor of SQL. Stata has some support for SQL, in that one can submit an SQL statement and retrieve the output, but this is not helpful if one wants to, eg, use an index file to match IDs against the SQL database. Instead (and this often leads to problems of type 1 above) one must retrieve every record and do the matching locally in Stata, which means working with even larger datasets than one would need to otherwise.
    I would even add that my impression is that other tools like monetdb (esp. linked to R) has shocking performance advantages for data retrieval, selection, manipulation and aggregation, let alone "MapReduce" operations on many cores (btw, even Hadoop will be 10 years old when Stata 14 comes out). I understand that Stata is fantastic for -regress- and -ivregress- and -teffects- and clustered standard errors and what not, but it is all the sadder if we waste time with -by- looping over the data all the time, spend time with disk I/O for preserving, rarely transparent to the user, doing needless dances for merges, waiting to have the data resorted again and again because there is no better index for the data etc.

    ​I am obviously no database scientist, but I do see other tools (not only SQL, but also pandas in Python, e.g.) doing "merges, egens, collapses, ifs, bys" much, much more efficiently. I really wonder why StataCorp does not invest in this.

    Leave a comment:


  • Richard Williams
    replied
    Originally posted by Nicholas Winter View Post
    I'm coming late to the new forum, but: just a quick note that my combomarginsplot does provide some functionality to combine the multiple calls to marginsplot.
    Combomarginsplot is a great command. I do hope that Stata corps doesn't think that just because somebody has written a command that it doesn't have to worry about the issue anymore. For a user it would be easier to add an option to margins than to learn a whole new command.

    Leave a comment:


  • Sergiy Radyakin
    replied
    Originally posted by Nick Cox View Post

    You are presumably referring to ==, >, <, >=, <=, !=.

    The problem you identify isn't (to me) at all clear....
    I guess he is referring to precision control in the lines of http://docwiki.embarcadero.com/RADSt...s_%28Delphi%29 , meaning in short that the system should know what it is comparing (types of LHS and RHS) and know it's own limitations (what it can and cannot compare).

    Best, Sergiy Radyakin

    Leave a comment:


  • Jeph Herrin
    replied
    The single most important "feature" Stata 14 could add is better support for so called Big Data(c). This means 2 things in particular:

    1. More and more I find myself working with very large databases - where analytic files are on the order of 100gb, say - and in this environment Stata does not shine. In the long history of Little Data, the memory model employed by Stata had a clear and evident advantage. But no more - I find myself not only using SAS for data management, but (reluctantly) suggesting it to others who are working in the same space.

    2. Large data is typically maintained in relational databases, accessed via some flavor of SQL. Stata has some support for SQL, in that one can submit an SQL statement and retrieve the output, but this is not helpful if one wants to, eg, use an index file to match IDs against the SQL database. Instead (and this often leads to problems of type 1 above) one must retrieve every record and do the matching locally in Stata, which means working with even larger datasets than one would need to otherwise.

    J
    Last edited by Jeph Herrin; 07 Aug 2014, 08:52.

    Leave a comment:

Working...
X