You are not logged in. You can browse but not post. Login or Register by clicking 'Login or Register' at the top-right of this page. For more information on Statalist, see the FAQ.
The term reproducible research refers to the idea that scientific results should be documented in such a way that their deduction is fully transparent. This requires a detailed description of the methods used to obtain the data and making the full dataset and the code to calculate the results easily accessible. This is the essential part of open science.
This has nothing to do with the software that you are using.
If you make a mess that at the worst you cannot reproduce yourself, or if you fake your data and methods on purpose, no software can save you.
So it is not clear at all what you want from any software.
Stata is a standard scripting language, like TSP, RATS, R etc. If you have not lost track of the data file that you input, and if you have not faked your data file, any time you run the same Stata script (.do file) on the same input file, you will get the same results. This is how it has always been since the creation of Stata in the 80ies I think.
I am not going to participate further in a discussion where the opening statement is "This has nothing to do with the software that you are using." I believe anyone who has worked with SPSS will acknowledge that reproducibility is indeed related to software.
I also believe that those who have worked extensively with notebooks (mixing code and narrative text for instance in Jupiter or RStudio) will question the statement. For complex projects with lots of code, I would certainly prefer to have all the code in one document rather than in several do-files. For me, having one file rather than many increases reproducibility. I'm fine with people having other views.
BTW: Stata has a particularly good option for reproducibility: setting the version option. Again an example that software matters for reproducibility.
Reproducibility is not about the researcher themselves getting the same result. It's about transparency.
I also believe that those who have worked extensively with notebooks (mixing code and narrative text for instance in Jupiter or RStudio) will question the statement.
I have in fact worked extensively with Jupyter notebooks as part of my day job. I do not like them. That is being polite. When I use R, I use plain R with a good text editor. I have had employees use RStudio, but with the way I work and debug I find it an annoyance.
I ensure reproducibility with my own projects by writing clear, commented do-files using a good ASCII text editor, storing all necessary input datasets, ado-files, help files, etc. in the same folder (with subfolders if necessary), and saving plain-text log files of all final output. At the top of the folder I'll have a file named README or master.do or something that shows the sequence of do-files that need to be run to reproduce all of the output.
I have as recently as six months ago shared code that I wrote as part of my thesis more than 20 years ago, and the results remain identical.
If you like notebook-style workflows and find them useful, then there's no issue there. However, I find this discussion has conflated two distinct concepts, which I think are a bit at odds. Those are reproducibility and literate programming.
Reproducibility, follows from the scientific method. Are the methods and data clearly and completely described enough that anyone can do the same thinks with the same input data and get the same results. Good documentation practices suggest that each component be separated and described distinctly. Literate programming, a concept described by Donald Knuth, is take reproducibility to an extreme by blending documentation, data, results and code into a single file. The end goal is to produce one common narrative that is primarily intended for humans to read, but not necessarily as easy for humans to inspect the underlying code. This concept seems to have spurred on the notion of various notebooks.
I dislike the literate programming paradigm because in any real project I've worked on, it is much easier to borrow the computer science concept of separation of concerns, tackling distinct steps of a project in stages, and later assembling the relevant results and documentation together, either in form of a manuscript, study report, analysis and results datasets, a book of tables and figures, or a mix of these, which could easily amount to tens or hundreds of output pages. This way, each component can be independently developed and tested, and it is much easier to collaborate with others, if applicable. Of course, a clever programming can always keep in mind a modular structure that will allow for automating the assembly, storage, reporting or collating of outputs along the way.
With that said, I find Stata an excellent platform for reproducibility, in just about any output flavour you might want, along with clean output logs that aid in documenting your code. Closer Python integration may also bring better support for literate programming, but the foundational aspect is first and foremost reproducibility.
[...] it is much easier to borrow the computer science concept of separation of concerns, tackling distinct steps of a project in stages, and [...] assembling the relevant results and documentation together, either in form of a manuscript, study report, analysis and results datasets, a book of tables and figures, or a mix of these [...].
Basically, this is very similar to how I use a notebook (currently in RStudio, I am not able to achieve a similar flow in Jupyter Notebook or Jupyter Lab). I believe an important distinction is whether one prefers to have all the code in one file, even in a complex project. I definitely prefer to have all code in a single file, a file that mixes narrative text (written with markdown) and chunks of code, and some output from the code. I prefer a single file as a coder, and certainly as a reader!
But the code can be hidden, only shown if the reader requests to see it:
My current work uses narrative text and code to produce supplemental material and analyses reported in an article. That supplemental material comes as a single HTML file. Within that HTML file are plots and tables in addition to text, but no code: I have decided to hide all code. But at the top of the HTML document is a button. Push that button, and you download the file that has produced the supplemental material you are reading - including all the code for data management and all analyses. For me this is the ultimate solution for reproducibility.
I have a hard time understanding that anyone who has learned what advanced notebooks can do would still prefer coding long, complex projects in separate files. But we are different. Whatever approach one uses, we should encourage reproducibility and transparency.
In recent projects I've adopted a hybrid method. All code is in one dofile, but all code is captured in module blocks. This way, I can easily instruct the file to run a particular section and collapse any code blocks that are currently irrelevant. Indentation and naming allow for nesting of modules. I wrote a little helper program that automatically generates the barebone code structure (bookmark, if statement, log open and closing). Not sure yet if the new Stata 17 bookmark functionality will be useful, so far results are mixed.
Sounds like I should review Statamarkdown once I get a copy of Stata 17. I also need to figure out how to work more gracefully with RStudio's notebook mode. I developed this mainly to write workshop documentation, and how-to's for our clients. Of the available tools, this is the one I use the most ... but I also spend about 40% of my time in R.
If you are mainly writing papers for publication, or developing slides for presentations, markstat (mentioned above) is probably the more fully developed tool? Or maybe the simpler tool, since you do not have to wade through all knitr documentation that is only relevant to R.
If Stata now offers a kernel for Jupyter Notebooks, and you like to work in Notebooks this might actually be a greatly improved solution for many people.
I also wrote stmd for use directly from Stata, to make dyndoc easier to use. These are probably the least fully realized solutions. Does dyndoc have any enhancements for working with Python?
On a more general note, I think the advantage of separate files over all-in-one-file is the focus it enables, and the way it encourages you to set up clear waypoints to navigate from from one file to the next. I hate it when clients come in to my office with a file that is 5 screen-fulls and want help interpreting something or debugging something in the middle. With demographers and Census data, it can take 10-15 minutes just to run code up to where they have their question. But if it's an RA working for a professor who wants everything in one file ....
On the other hand, I find it really annoying to work with software like MPlus, where I can only have one problem per file.
It seems to me the sweet spot is often 3-5 files per research project or classroom session. But that's me.
Thanks, Doug Hemken. I gave up on Statamarkdown because it doesn't keep results in memory from chunk to chunk.
I think that without an outline where you easily can jump between sections (RStudio style), notebooks quickly become incomprehensible. In a Stata do-file I need to keep things short.
I assume you know this, but I just wanted to point out that most of the issues with Mplus are resolved by using -runmplus- in Stata or MplusAutomation in R.
Comment