Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Python, Mata, and Stata v16

    Colleagues: With the release of v16 and its Python-interface capabilities I'm wondering if I should now consider learning some Python.

    However, as a dedicated Mata user/programmer for many years I'm also wondering what might be some of the main advantages of Python over Mata.

    If you are experienced in both languages, would you be willing to suggest what might be some such advantages? Or perhaps point me to any relevant links that might provide such information?

    Thanks in advance.

  • #2
    John Mullahy
    There are some fairly extensive Python libraries for all sorts of scientific computing related tasks. I would say some of the biggest benefit would probably be leveraging those existing libraries. There are also some lower level interfaces to C, Java, and/or other languages that may be a bit easier to work with for some developers.

    Comment


    • #3
      Disclaimer: I use Python since around 2001, but I have only used numpy/scipy/pandas/matplotlib on a few occasions. I started my professional statistical work at the french statistical institute in 2009, working mainly with SAS, now mainly with R and since 2017 Stata. I use Python for data pre- and post-processing, data cleansing and file utilities. All the "true" statistical stuff (data analysis, modeling, inference) is currently done either in R or Stata, preferably Stata. All the following is my point of view. Younger Python users likely started with pandas for data science, and would probably have another point of view.

      A few things that come to mind. Note there is some overlap with Stata/Mata, but also many capabilities not found, or not easily done, in Mata.

      Python is a general purpose programming language: its primary goal was not statistics or data science, but easy teaching and learning. However, it's used for web programming, data science, as a scripting language (for QGIS and ArcGIS), as "glue code" for larger projects that include many parts in C or Fortran (GUI in Python, computations in C/F90)...

      Some features (some obscure terms can be found in tutorials or simply with Google):

      * a matrix language provided by the numpy package, similar to Mata, but with multidimensional arrays (numpy uses LAPACK, like Mata, Matlab or R). With some additional packages it provides dataframe manipulation (pandas), modeling (statsmodels), machine learning (scikit-learn). There is no builtin dataframe file storage format (there is pickle, to store almost anything, but it's not widespread for this purpose). However pandas can read/write Stata dta, among others.
      * 2D and 3D graphics with matplotlib.

      The next 6 are the main reason I like Python:
      * list, dictionary and set data structures, with "comprehensions" (like for sets in mathematics), which allow for very clean and short code.
      * generators: a way to generate data on the fly, usually from a large (implicit) data structure. For instance, a generators that returns all permutations of a set, one at a time.
      * functional programming: a function g can be defined inside a function f, depending on the parameters of f, and returned. Python has also closures (basically such an "inner" function has its own namespace each time you produce one instance by calling f).
      * exception handling (try/except)
      * namespaces: each module has its own namespace, which reduces the risk of name clash. A module is simply a .py file, with class or functions definitions or other declarations. Thus it's very easy to write a reusable library.
      * very good string handling, encoding conversion, regular expressions. Python 3 uses Unicode for all strings.

      * several modules to read and write text data (json, yaml, avro...)
      * several modules to easily read and write arbitrary binary data and decode basic types and arrays of basic types (integers, floats)
      * read/write Excel files: see http://www.python-excel.org/
      * big integers, and a package for big floats (mpmath), and some computer algebra (sympy). Note that all Python integers are big integers. However numpy and pandas provide short types (8/16/32/64 bytes) for optimized storage.
      * graphical user interfaces with either Tk (builtin) or WxWidgets or Qt
      * threads, but with some drawbacks due to the fact Python is compiled to bytecode (so-called "global interpreter lock").
      * easy to call a DLL function with the ctypes module, or to build plugins in C or Fortran (with f2py). There is also SWIG to create plugins almost automatically from C code. A plugin usually provides a set of compiled functions.
      * the pywin32 project offers among others a COM client, which means you can connect to Stata, SAS and Excel, and many others, on Windows, and control them.
      * networking capabilities (either client or server, with sockets, and builtin libraries for ftp or http connections)
      * graph theory: https://networkx.github.io/
      * pip to download and upgrade packages: similar to ssc and cran, but like them not everything is in pip.

      There are also a few drawbacks
      * lack of true missing values. There is NaN (especially with numpy), but it's not its primary purpose.
      * I don't think pandas allows labels
      * very limited modeling capabilities (far behind Stata and R)
      * Python is free, which means: support only via forums, additional packages have no globally consistent structure - however, in my experience it's less a problem than with R, as Python has widely accepted conventions: https://www.python.org/dev/peps/pep-0008/

      Regarding speed, my experience on "pure" loops (without underlying vectorized computations with array syntax) is that they are roughly equivalent, and Python is a bit faster (not more that 2x). With array syntax they should be equivalent, unless you use a numpy built with Intel MKL, which should be much faster (as it's parallelized).

      My main use of Python related to statistical work, to give an idea:

      * utilities: find and delete duplicate files, compute and check file hashes, look for string or other data in files (similar to the well-known Linux 'grep' utility), find differences between files, hex dump, etc.
      * read and transform text data: usually if it fits in memory I prefer Stata, but if it doesn't I use Python to read line by line, and for instance split the file.
      * producing complex "tabulate" output on many files, with html or Excel output (when there are weird subtotals patterns or overlapping data between result cells - such as data from overlapping geographic regions).
      * automatic reporting: reading data from Excel, producing several Excel files, including formatting and graphs (using COM).
      * preparing data for Stata: medical emergency data, with many duplicates, badly encoded dates, etc. (heavy use of dictionaries)
      * cleaning city names (st/saint problems and others): heavy use of dictionaries together with manual corrections
      * extracting data from Excel files, sometimes with some alignment or naming problems (on one occasion the Python program produced a Stata program which imported the files).
      * scanning a CSV-like file to check for problems (remove bad characters, check separator in fields, etc.), and to extract bad rows in order to deal with them by hand.

      There have been more "casual" uses: renaming a bunch of files, looking up files on a disk to find a specific one hidden among hundreds of gigabytes. For these it's often "one shot" code less than 10 lines.

      A few resources to learn Python (just a suggestion, there are many tutorials online)

      The bible of Python builtin modules: https://docs.python.org/3/py-modindex.html
      https://www.tutorialspoint.com/python/
      Stack Overflow: https://stackoverflow.com/questions/tagged/python (the community is said to be a bit rude to newcomers, but the answers are in general of high quality).
      Rosetta Code (tasks written in many languages, including Python and Stata - disclaimer: so far I wrote all the Stata solutions, and I'm the only Stata user there). Useful to see how to do in a language what you understand in another one. There are only a handful of statistical tasks, it's mainly about "programming in general". See https://rosettacode.org/wiki/Category:Python and https://rosettacode.org/wiki/Category:Stata

      I hope this wasn't too long
      Last edited by Jean-Claude Arbaut; 28 Jun 2019, 14:11.

      Comment


      • #4
        Thank you very much, William and Jean-Claude.

        Comment

        Working...
        X