Wishlist for Stata 18

wbuchanan replied

12 Apr 2022, 06:36
William Lisowski
The SAS "SQL" implementation is not what I would consider to be anything but an example of how not to approach the problem. There are packages in several languages that implement various types of SQL functionality, particularly with regards to joins, but they do so using their own native approaches that fit within their current programming paradigms. Leonardo Guizzetti is correct that SAS is implementing something like their own version of a subset of SQL that is pretty far removed from the full ANSI specification.
Leave a comment:
Bjarte Aagnes replied

12 Apr 2022, 04:23
Stata Installation Qualification Tool (IQT): adding API, or command line arguments, to integrate IQT in automated deployment processes.
2 likes
Leave a comment:
Leonardo Guizzetti replied

11 Apr 2022, 15:55
I don't know if SAS implemented their own version of SQL with PROC SQL (I suspect so) or if they mask some backend to some other database. Whatever it is, they are their own "dialect", adding on features that only make sense in a SAS context and not quite allowing full ANSI-specification SQL. I would really be happy to see something like an SQL interface since there are clear advantages for some needs. But, I think this is where Stata seems to philosophically follow the idea of embracing existing frameworks -- by allowing ODBC/JDBC interfaces to these databases -- which means users who desire those tools and functionality must implement the setup steps themselves.
2 likes
Leave a comment:
William Lisowski replied

11 Apr 2022, 14:10
I support the intent of #339 but not the compromise implicit in it.

My preference would be not to add SQL-ish features to existing commands, but rather to support SQL operations on native SAS datasets within a Mata-like coding syntax. It would recognize SQL syntax in one-line SQL commands and in SQL programs, but allow direct access to Stata datasets both on disk and in other frames, without the push-pull of data structure conversions. Those who come to Stata from SQL will have access to SQL functionality in the form they are familiar with, and those who want to learn and adopt SQL functionality will be able to reference copious commercial documentation.

My experience - some years past now, thankfully, but perhaps outdated - working in a SAS environment back-ended by an Oracle database led me to appreciate the differing strengths of a relational database and its SQL commands and the "flat file" data structures used by conventional statistical programming with sorting and merging. What I suggest is no different than how SAS approached the problem back in the day, by creating PROC SQL to understand SQL commands (and interface directly with SQL databases) rather than shoehorn SQL capabilities into existing DATA step commands.

To be clear, there are tasks that I've undertaken in Stata that would have been more straightforward with access to SQL syntax, and I have missed access to that tool, as I had earlier missed access to the full power of contemporary regular expression syntax now embedded in the Unicode regular expression functions. What I suggest is intended to mimic the recent integration of Python functionality into Stata to extend Stata's capabilities to tools written in Python, and before that, the separation of Mata into a purpose-built language rather than gluing it onto the existing Stata matrix commands.

Where it makes sense, Stata should embrace existing programming frameworks rather than strive to become the Esperanto of statistical programming languages.
3 likes
Leave a comment:
wbuchanan replied

11 Apr 2022, 13:02
Clyde Schechter
That was definitely what I was after. Ideally, any expression that resolves to a boolean should be acceptable to define a join operation on two sets.
1 like
Leave a comment:
Leonardo Guizzetti replied

11 Apr 2022, 11:54
Agreed, it's not a drop-in replacement for the breadth of an SQL query syntax, but very helpful for an important subset of that type of work.
Leave a comment:
Clyde Schechter replied

11 Apr 2022, 10:20
Re #341. It is true that -rangejoin- accommodates the specific functionality of that example. And, don't get me wrong--I'm a huge fan of -rangejoin-. It literally revolutionized my workflow when Robert Picard released it, enabling me to do work on Stata that I previously had to do in other (compiled) frameworks. But I think the functionality wbuchanan is referring to is more general: being able to restrict the match based on any kind of Boolean condition, not just ones based on ordering relationships. While I don't think my own work would make much use of that, I can imagine that others might benefit greatly.
2 likes
Leave a comment:
Leonardo Guizzetti replied

11 Apr 2022, 07:12
Originally posted by wbuchanan View Post

Along the lines of other requests related to merge (and recognizing my previous asks to adopt something analogous to more standard SQL type capabilities), it’d be nice to be able to use expressions that resolve to booleans in merge commands. For example, being able to specify that you want to merge based on a common ID but only for records at a later/earlier date. The syntax might look something like:

merge 1:1 id == otherid & using.newdatefield > master.date

Using and master are intended to indicate which frame/file the field would be found in.

Alternatively, -rangejoin- (SSC by Robert Picard) supports this concept, but requires a bit of setup for the task.
Leave a comment:
Jesse Wursten replied

11 Apr 2022, 06:12
Originally posted by wbuchanan View Post

Along the lines of other requests related to merge (and recognizing my previous asks to adopt something analogous to more standard SQL type capabilities), it’d be nice to be able to use expressions that resolve to booleans in merge commands. For example, being able to specify that you want to merge based on a common ID but only for records at a later/earlier date. The syntax might look something like:

merge 1:1 id == otherid & using.newdatefield > master.date

Using and master are intended to indicate which frame/file the field would be found in.

For inspiration, I recently discovered Pandas (Python) supports this through the merge_asof method: https://pandas.pydata.org/pandas-doc...erge_asof.html
Leave a comment:
wbuchanan replied

10 Apr 2022, 07:28
Along the lines of other requests related to merge (and recognizing my previous asks to adopt something analogous to more standard SQL type capabilities), it’d be nice to be able to use expressions that resolve to booleans in merge commands. For example, being able to specify that you want to merge based on a common ID but only for records at a later/earlier date. The syntax might look something like:

merge 1:1 id == otherid & using.newdatefield > master.date

Using and master are intended to indicate which frame/file the field would be found in.
2 likes
Leave a comment:
John Mullahy replied

09 Apr 2022, 15:12
The current v17 documentation file STATA FUNCTIONS REFERENCE MANUAL RELEASE 17 (fn.pdf) shows the functional forms of some but not all of the probability distributions it covers (see pp. 109 and beyond in the pdf file). For example, compare betaden and cauchyden.

Might future editions of the documentation show the functional forms for all distributions?
4 likes
Leave a comment:
Justus Franz-Christoph Meyer replied

08 Apr 2022, 12:39
I think one could add further distributions to the standard set of distributions available in the statistical distribution functions, even if only to make it more comfortable to use them (e.g. various beta-type distributions, skewed normal etc.).
1 like
Leave a comment:
Zachary Brown replied

05 Apr 2022, 09:02
Two items for the built-in mixed logit choice models command:
An internal implementation of the user- 'mixlogitwtp' by Hole, i.e. within 'cmmixlogit' and 'cmxtmixlogit'. This would facilitate additional postestimation work with these models, -for example, routines that rely on scores outputted from standard ML commands.

With the panel command 'cmxtmixlogit', an option to specify that a given choice is replicated multiple times within decisionmaker. With conditional logit, this option is unnecessary because the replicated tasks can simply be implemented as weights. But with panel mixed logit, each decisionmaker's likelihood is integrated over the random coefficients before taking logs, and so weights can't vary within decisionmaker. The use of replications would essentially a way to reduce the memory burden of executing the following:

expand choicereps
bysort DMid taskid: gen crep = _n
egen taskid_expanded = group(taskid crep)
cmset DMid taskid_expanded altvar
cmxtmixlogit chosen x

Besides being cumbersome to code the above, it also can dramatically and unnecessarily increase the memory burden when decisionmakers are making the same choice across many identical choice occasions. (An example would be recreational site choice data or transportation mode choices. In the dataset I'm working with, the number of rows in the data goes from ~150K to 2M+.)

Instead, with task replicates it would be nice if we could implement something like:
cmxtmixlogit chosen x, nchoices(choicereps)

Computationally, this would be easy to implement, since we can just compute the DM's conditional likelihood as by taking the probability of each distinct choice made, raised to a power equal to the number of replicates that choice was made on identical choice occasions, and then taking product of each of these distinct choice probabilities.
Leave a comment:
Jesse Wursten replied

04 Apr 2022, 02:42
Originally posted by Chris Stefancik View Post

How about a merge command syntax where we can specify which variable to link between the master and the using instead of requiring both variables to have the same name.

Code:

merge 1:1 patientid == patient_num using "demo_data.dta", keep(master match)

You might like the -join- command includes in the -ftools- package.
Leave a comment:
Leonardo Guizzetti replied

01 Apr 2022, 18:54
Originally posted by Chris Stefancik View Post

How about a merge command syntax where we can specify which variable to link between the master and the using instead of requiring both variables to have the same name.

Code:

merge 1:1 patientid == patient_num using "demo_data.dta", keep(master match)

This is what frames partially solves. I say partially because frlink only supports m:1 or 1:1 linking, and sometimes the 1:m situation is more cumbersome than using merge.
2 likes
Leave a comment:

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: