Announcement

Collapse
No announcement yet.
This is a sticky topic.
X
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Being able to integrate better with LLMs. I don't think students want to study STATA anymore, if it claude + python is easier to use than claude + STATA

    Comment


    • #121 overlaps with #118 to which #119 is a very personal answer.

      I don't see StataCorp as aiming Stata at anyone who wants to run Python code and needs AI assistance to write it. Is that a foundation for good quality original research leading to advanced degrees or publications in reputable journals?

      Or -- on another level -- what precisely would "integrate better" mean for Stata developers?

      Comment


      • Or -- on another level -- what precisely would "integrate better" mean for Stata developers?
        Just playing devil's advocate, I can see two ways Stata could be better integrated with an LLM. First, LLMs could be better at Stata. That is ultimately limited by the availability of training data, and there is simply more training data for python, but could be augmented with some fine-tuning/transfer-learning techniques. One could also build an interface between an LLM and the interpreter, allowing the interpreter to check the correctness of the syntax before the predicted code is served to the user. It's a bit like handling the fact that LLMs are bad at arithmetic by giving one access to a calculator behind the scenes. This points to a subtle hidden-cost of relying on LLMs: The LLM will never be up-to-date on the latest features and technologies because the body of code to train them on doesn't exist yet. One might think a combination of emergence and searching for and providing the LLM the relevant documentation as part of the prompt might solve that problem. Setting aside the magical thinking implicit in "emergence," all of that works fairly well for simple problems or syntax questions, but it doesn't scale well to more complex and synthetic code where the LLM will likely default to older solutions. The problem is even worse when you are working in a proprietary codebase, as many professionals do. It seems like modern AI companies are getting around these issues for now by throwing resources at the problem, by training larger and larger models, and by focusing on building for languages where a ton of training data already exists, like python. The problem is that training these models is extremely expensive and training entirely new models as an update strategy doesn't scale well given the costs involved. I'm not saying issues of scale will prevent these companies from continually updating there models, but it will place some practice limits on how they do it, which may mean that LLMs continue to be fairly good for basic python, but less good for Julia or the latest version of Stata. If that prevents people from learning Stata, it is to their own detriment, because I would take Stata's library of statistical models over python's statesmodels package for essentially every regression related task. Stata is simply much better with a much more extensive library of models than Python for those kinds of modeling tasks.

        In addition to improving the LLM output, you could literally integrate an LLM into the do file editor. Let the LLM try to predict the next token as you type and hit tab when the LLM has made the correct guess to fill in the next piece of code. Then create a way to prompt the LLM from the do file editor if you have a question about the code or if you'd like it to write some template code, with the response inserted directly into the do file editor as a comment or as code. The first feature especially already exists in a primitive form. The do file editor will try to guess the token you are typing in based on the tokens that already exist in the do file. Next-token-prediction from an LLM would be a much more sophisticated version of that. Of course, you would essentially need to repeatedly and automatically query a huge data structure at a massive datacenter. That'll likely cost you (or someone) at least a few cents per token plus some per-query overhead, and you'd be contributing to all the negative externalities from LLMs, but I suppose that is the cost of outsourcing your intelligence to a tech company.

        As I've said in the past, I don't support the introduction of these features right now because I think we'd be better served by simpler extensions to the do file editor that have existed in other software for decades. An obvious start would be more informative syntax errors. Stata often returns "syntax error" at runtime, but I'd love to see the closest possible approximate location of the syntax error highlighted in the text of the do file and an informative error message presented in a tool tip after a second or so on mouseover. Such a feature would also be useful to augment an LLM when the time comes. Moreover, students may not want to learn Stata, but they should still do so for a number of reasons. The biggest issue is that LLMs don't scale well to complex tasks. Anyone who is literate can prompt an LLM to get code it's seen a million times before in its training set, but once you get sophisticated enough that stops working, and you'll need real expertise to move forward.

        Comment


        • I wish graph matrix had options to include regression lines and confidence intervals for rho. Here is an example of what I have in mind (created via JASP).

          Click image for larger version

Name:	JASP_scatterplot_matrix.png
Views:	1
Size:	186.8 KB
ID:	1785505
          --
          Bruce Weaver
          Email: [email protected]
          Version: Stata/MP 19.5 (Windows)

          Comment


          • Moreover, students may not want to learn Stata, but they should still do so for a number of reasons.
            Daniel Schaefer I hope some of what you say is true, but my industry is very much embracing AI and tools like Claude Code, which for programming does work well with Python and R for many tasks. Is the purpose to be at the bleeding edge of coding complexity using Claude Code? In many cases it is just to accelerate productivity with an excellent coding assistant who also has excellent general knowledge. Do we know how best to use Claude Code? No, but the push is to learn immediately. I do think it a possibility that LLMs like Claude and Claude Code will lead to more users of R and Python and less with Stata and SAS even if both Stata and SAS have excellent and deserved market positions in several fields and are well liked by patrons.

            Comment


            • Dave Airey Don't misunderstand, I very much like Claude Code for many tasks, and I think agentic AI looks very plausible for large-scale rapid-prototyping tasks. It's definitely undeniable that industry is embracing these tools, though I think it is still sometimes hard to separate the genuine use cases from the hype. I've also used these tools enough to see them make mistakes, and when they do the output is often confident and plausible sounding. I guarantee right now people in industry are using LLMs to produce dashboards with plausible looking graphics and a clear story that are nonetheless straightforwardly wrong, and no one is bothering to check. My point isn't that LLMs aren't useful, it is that students need to develop enough background knowledge and expertise to use an LLM effectively and to avoid being fooled by incorrect but compelling output.

              Comment


              • Daniel Schaefer Well said. Validation (responsible use) of all AI artifacts and assistance should be required for sure.

                Comment


                • When using navigator in the do-file editor, I would appreciate highlighting of current "chapter" in navigator window. User could quickly see in which part of larger do-file is he at the moment.

                  Comment


                  • Two quality of life improvements for etable:

                    1) A 'wide' option that puts stats next to each other (in separate columns) rather than below each other in one column
                    2) Option to remove leading zeros from p values

                    Comment


                    • mi estimate can fully return the theta and related stats when using shared() with stcox

                      Comment


                      • If you try to print the Help file, the text often does not fit properly on the page. The only way to fix this is to resize the Help window. Please fix this issue in Stata 20. Pretty much in any other existing software, if you hit a "print" button, it will fit the content to the page.

                        Comment


                        • Radion Svynarenko (#131): My take on this is: What fits "properly" on a page for you personally may be different for others. I think it would be problematic to set a specific format for everyone—besides, there are different paper sizes (Legal, Letter, A4, etc.). The easiest way to adjust this is actually to resize the viewer window. Also note that help files for user-written programs use SMCL encoding differently and to varying degrees of quality, so a "proper" fit may look different depending on the help file.

                          For me personally, the help files in the viewer are perfectly sufficient for a quick overview—and the help from the manuals (PDF) can be printed (at least for me) in a neat format without any problems.

                          Comment


                          • https://www.statalist.org/forums/for...-word-retained gives further (or rather previous) discussion around the issue raised by Radion Svynarenko in #131. I tend to agree with Dirk Enzmann -- but the question is as usual aimed primarily at StataCorp developers.

                            Comment


                            • Will be nice to add an option to have transparent background for graphs. Makes it easy then to use graphs in presentations where one can paste visual on top of a custom background

                              Comment

                              Working...
                              X