Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using large language model with Stata to help in coding

    I just spent a couple of hours trying to figure out how to get a large language model (Claude 3.5 Sonnet, in my case) to do Stata. I did not find any useful tutorials, so I just tried to figure it out myself using Visual Studio Code as an editor. This is mind-blowingly good, so I thought it was worth sharing.

    Here is a screencast that first shows how the system works and then how it is set up. The first 2-3 minutes is just a demo and a small complaint how Stata is not doing this. The Stata demo starts at around 3 minutes.

    https://youtu.be/cOwmYXYkxWA

    Claude can also competently convert Stata code to R and to some extent also other way, but it is a bit less capable with Stata than with R .


    (My YouTube challenge does not run ads and I hold no commercial interest in this content.)

  • #2
    Thanks Mikko. I've been using Claude.ai (at the suggestion of Scott Cunningham) and find it to be helpful, unlike ChatGPT (at least, the earlier versions). I have also used it for R (and ChatGPT) with some success, and I don't know much about R.

    Comment


    • #3
      I may be coming to this conversation late. I agree with Mikko Rönkkö that Claude is very good. I would say better than GPT. That said I have a private GPT that I have trained on the stata manuals and that does stata well. I am experimenting with running local LLMs on my machine to use with code that cannot be shared to the cloud. I think some day more research institutions will host private local LLMs for use with secure research servers to help with coding. For the most part most everyone would be happy with claude, I just like experimenting with new tech.
      Owner of StataTutor.com

      Comment


      • #4
        I wanted to share my experience using Claude—first through the chat interface and more recently through Claude Code—for Stata work, in case it's useful to others on this thread.
        Like many here, I was initially told to keep expectations in check given the relatively limited Stata training data available to LLMs compared to, say, Python or R. My first use case was modest: tweaking error messages in complex twoway graph calls of various types. Sometimes the end result was to modify an .ado file previously downloaded from SSC so that I could generate a graph in just the format I wanted.
        Over the past two months, I have used the chat interface extensively for coding and have been impressed with the results on the whole. Never has my code been so well documented, and coding with Claude has taken a lot of the drudgery out of the process.
        Here is a more substantive example. In any empirical project, there are typically layers upon layers of data construction choices. When one is finally ready to conduct analysis, it has usually proved very onerous to retrace the steps and think about the impact of those various alternative, equally sensible choices (at least from an ex ante perspective) on the results. My analysis do-files now contain a sequence of nested loops that allow me to explore in a very flexible way the impact of any particular data construction choice on the final estimates. Of course I could have done this by myself before, but it was error-prone. I often didn't get it right, and the result was that I was always exploring from a much narrower menu of potential data construction choices than I should have been.
        Over the past two weeks, I have shifted most of my coding work to Claude Code, which has been a nice incremental improvement. It can see my entire hierarchy of folders, subfolders, and analysis files for a given project and is very skilled at making targeted, incremental edits as opposed to needing to regenerate an entire do-file that might be several hundred or even several thousand lines long.
        The main limitation I'm running into is that I can't get Claude Code to run Stata itself and correct errors on its own. Has anyone here found a way to close that loop? Any pointers would be greatly appreciated.
        Separately, I hope the gears are turning at StataCorp. I would venture the proposition that anything making it easier for Stata to leverage agentic coding tools—not just Claude Code, but others that will surely follow—is going to have a very large impact on the entire community, perhaps more so than adding any particular new feature to the next release.

        Comment


        • #5
          VS Code + Claude Code + Stata Workbench extension allows for a seamless integration of Claude Code and Stata. Alternatively, you can use Stata Run extension. The key insight is to move your coding from Stata's do file editor to VS Code. I will post a screencast about this setup on my YouTube channel when I have time to record it.

          Comment


          • #6
            That would be much appreciated Mikko. It's the first time I hear of the Stata Workbench Extension so clearly I'm missing out.

            Comment


            • #7
              Originally posted by Pierre Azoulay View Post
              Over the past two months, I have used the chat interface extensively for coding and have been impressed with the results on the whole. Never has my code been so well documented, and coding with Claude has taken a lot of the drudgery out of the process.
              Here is a more substantive example. In any empirical project, there are typically layers upon layers of data construction choices. When one is finally ready to conduct analysis, it has usually proved very onerous to retrace the steps and think about the impact of those various alternative, equally sensible choices (at least from an ex ante perspective) on the results. My analysis do-files now contain a sequence of nested loops that allow me to explore in a very flexible way the impact of any particular data construction choice on the final estimates. Of course I could have done this by myself before, but it was error-prone. I often didn't get it right, and the result was that I was always exploring from a much narrower menu of potential data construction choices than I should have been.
              While I can see the appeal of this approach, it leaves the question whether this is just making p-hacking easier.
              https://www.kripfganz.de/stata/

              Comment


              • #8
                While I can see the appeal of this approach, it leaves the question whether this is just making p-hacking easier.
                While I agree that this can be readily misused for p-hacking, deep-sixing the "non-significant" results and publishing the "significant" ones, this can also serve a positive purpose.

                Despite our best intentions to pre-specify a data management and analysis protocol and adhere to it, experience suggests that in many studies the data we end up with presents surprising situations that simply were not anticipated ex ante but need to be dealt with in some way. All such surprises combined can lead to a large number of possible data management/analysis pathways. One honest approach, when feasible, is to do them all and to present, at least in summary if not in detail, the results of all of them. This would be, in effect, a sensitivity analysis: how much do the ex post decisions made influence the conclusions of the study.

                Comment

                Working...
                X