Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Anyone running benchmarks across LLMs for Stata skill?

    Curious if anyone has already tried comparing how well different LLMs perform on Stata coding tasks—not just casually, but using formal benchmarks. I’m considering whether it’s worth building one myself, but wanted to check if something already exists in practice or in the early stages.

    I think my wishlist of things I'd like to know whether an LLM can do consistently are:
    • Reshaping and aggregating data (e.g. reshape, collapse, egen, merge), while handling the syntax correctly and by() statements logically
    • A wide range of regression tasks, including thinking clearly about standard errors and applying the syntax correctly
    • Post-estimation commands (margins, estimates, predict, lincom), including extracting and interpreting results
    • Looping or macro-driven routines (foreach, forvalues, local macros)
    • Creating formatted tables for publication (table and collect)
    • A wide range of plotting commands/techniques, including with community contributed commands and suites of commands (e.g. schemes, palettes, stata-schemepack, coefplot, etc.)
    • Writing functional, clean, reusable .do files—more than one-offs; modular coding design
    • Correct data manipulation with string functions, date handling, and factor-level processing
    • Knowledge and use of community contributed commands for newer quasi-experimental design techniques (e.g., rdrobust, csdid, ivreg2, etc.)
    But I'm sure there are a lot of ways to design a general purpose test of Stata coding ability that isn't skewed toward my applied micro needs. Anyway - just wondering if there's anything out there or if this is something people would like to have or have thoughts about how to design.

  • #2
    Have you checked this?

    https://www.statalist.org/forums/for...-dedicated-gpt

    Comment


    • #3
      I did see that, thanks! But it struck me as a tool to be benchmarked, rather than a set of benchmarks itself.

      Comment


      • #4
        I had a go, see what you think / contributions v welcome https://github.com/dbann/statabench

        Motivated by 1) not finding an existing benchmark and 2) wanting to explore the capability of local LLMs for routine data analysis tasks (see below for a range of open-weight models)

        Evaluating LLMs is complex - this implementation is currently multiple choice questions (quick to evaluate yet may not translate entirely to performance in real taks e.g., generating code)

        Click image for larger version

Name:	plot_overall_accuracy.png
Views:	2
Size:	188.0 KB
ID:	1781399

        Attached Files

        Comment

        Working...
        X