Anyone running benchmarks across LLMs for Stata skill?

CJ Libassi

Join Date: May 2020

Posts: 46
#1

Anyone running benchmarks across LLMs for Stata skill?

20 Aug 2025, 17:17

Curious if anyone has already tried comparing how well different LLMs perform on Stata coding tasks—not just casually, but using formal benchmarks. I’m considering whether it’s worth building one myself, but wanted to check if something already exists in practice or in the early stages.

I think my wishlist of things I'd like to know whether an LLM can do consistently are:
Reshaping and aggregating data (e.g. reshape, collapse, egen, merge), while handling the syntax correctly and by() statements logically

A wide range of regression tasks, including thinking clearly about standard errors and applying the syntax correctly

Post-estimation commands (margins, estimates, predict, lincom), including extracting and interpreting results

Looping or macro-driven routines (foreach, forvalues, local macros)

Creating formatted tables for publication (table and collect)

A wide range of plotting commands/techniques, including with community contributed commands and suites of commands (e.g. schemes, palettes, stata-schemepack, coefplot, etc.)

Writing functional, clean, reusable .do files—more than one-offs; modular coding design

Correct data manipulation with string functions, date handling, and factor-level processing

Knowledge and use of community contributed commands for newer quasi-experimental design techniques (e.g., rdrobust, csdid, ivreg2, etc.)

But I'm sure there are a lot of ways to design a general purpose test of Stata coding ability that isn't skewed toward my applied micro needs. Anyway - just wondering if there's anything out there or if this is something people would like to have or have thoughts about how to design.
Tags: None
Tiago Pereira

Join Date: Jan 2016

Posts: 409
#2

22 Aug 2025, 06:21

Have you checked this?

https://www.statalist.org/forums/for...-dedicated-gpt
Comment
CJ Libassi

Join Date: May 2020

Posts: 46
#3

27 Aug 2025, 10:24

I did see that, thanks! But it struck me as a tool to be benchmarked, rather than a set of benchmarks itself.
Comment
david bann

Join Date: Aug 2025

Posts: 1
#4

29 Aug 2025, 03:52

I had a go, see what you think / contributions v welcome https://github.com/dbann/statabench

Motivated by 1) not finding an existing benchmark and 2) wanting to explore the capability of local LLMs for routine data analysis tasks (see below for a range of open-weight models)

Evaluating LLMs is complex - this implementation is currently multiple choice questions (quick to evaluate yet may not translate entirely to performance in real taks e.g., generating code)

Attached Files
1 like
Comment

Announcement

Anyone running benchmarks across LLMs for Stata skill?

Comment

Comment

Comment