Python Code in Stata Loop

Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#1

Python Code in Stata Loop

20 Sep 2022, 12:15

I need Stata to recognize that it's using Python code in a loop. Precisely, I want it to simply print "moo" and "oink"

Code:

cls loc sounds moo oink foreach i of loc sounds { python: from sfi import Macro ansound = Macro.getLocal('i') print(ansound) end }

When I try this though, Stata simply reports the error code "Break". Why? How might I solve this? The below, however, works

Code:

loc i moo python: from sfi import Macro ansound = Macro.getLocal('i') print(ansound) end
Tags: None
wbuchanan

Join Date: Mar 2014

Posts: 1362
#2

20 Sep 2022, 14:42

Jared Greathouse
Can I ask why you would want to do this? Consider the overhead of repeatedly calling out to the API and trying to reload a module that may still be loaded in the underlying Python interpreter. You could just as easily use a list comprehension in Python to do it all and call it once from Stata.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#3

20 Sep 2022, 15:07

wbuchanan Certainly, the explanation will be a little involved, but I'll explain fully when I'm home. Long story short for now though, I'm adapting the python syntax of this command to Stata, and I might need the user to specify their treated unit as an option, e.g.

Code:

unit(California)

in the ado-code. I know it's sort of unclear at the moment, but it'll make, I hope, much more sense when I'm back home and can give a full explanation.

To give a very short reason though, in synthetic control research, we oftentimes want to do placebo tests where we iteratively apply the estimation to all units and check and see if the one treated unit is the most extreme relative to its donors. This'll involve looping over the donors, where I'd internally call levelsof and feed each unit (e.g., California, Alabama) to the Python algorithm.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3911
#4

20 Sep 2022, 23:31

I think wbuchanan has a point: build the list of arguments first, then pass that to Python. Something like

Code:

levelsof ... , local(the_levels) python : code_fetching_the_levels_from_Stata

Concerning the error that you are getting, you cannot use end inside a loop. This has been discussed on several occasions, typically when calling Mata inside loops or when trying to define a program inside a loop. Typically, there are better options to tackle the underlying problem, but we have covered that point already.

Edit: Aside from doing the looping (or using lists) in Python, better solutions involve defining the Python function(s) once, below your ado code inside the ado-file. Alternatively, store the Python code/script as a separate py-file that you distribute with your ado-file. In both cases, you call one (main) Python function or one Python scrip from inside the loop.

Last edited by daniel klein; 20 Sep 2022, 23:41.
1 like
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#5

21 Sep 2022, 08:56

daniel klein I agree, I thought of this solution last night before bed. Levelsof seems to be the sensible solution, and then passing that macro to a python list.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

22 Sep 2022, 00:51

Test using auto.dta:

Code:

sysuse auto, clear
levelsof make, clean sep("|") local(levels)

python : 

from sfi import Macro
levels = Macro.getLocal("levels").split("|")
print(levels)

end

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1362
#7

22 Sep 2022, 12:28

As an alternative to what Bjarte Aagnes suggested you could also:

Code:

sysuse auto.dta, clear python: from sfi import Data levels = set(Data.get(var = "make")) print(levels) end
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

24 Sep 2022, 05:01

Ref #7: Nice to see a solution not using levelsof which can be slow for "large" data (the set can be made into a list). Below I add saving a Stata local resembling r(levels) from Python, and time four alternatives for two sizes of data (reporting mean time in seconds after 10 repetitions):

Code:

-------------------------------------------------- 
      740,000     7,400,000
--------------------------------------------------
  10:  0.2358        2.6908  sfi Data.get
  11:  0.1026        1.1268  sfi Frame.connect 
  21:  2.5417       42.9631  levelsof
  22:  0.1412        1.5253  frame

Code:

timer on 10

python:
from sfi import Data, Macro
levels = sorted(list(set(Data.get(var = "make"))))
Macro.setLocal("levels", ' '.join(f'"{w}"' for w in levels) ) 
end

timer off 10

Code:

timer on 11

sort make 
frame put make if make != make[_n-1] , into(make) 

python:
from sfi import Frame 
make=Frame.connect("make")
levels = sorted(list(set(make.get(var = "make"))))
Macro.setLocal("levels", ' '.join(f'"{w}"' for w in levels) )  
end

timer off 11

Code:

timer on 21
qui levelsof make
timer off 21

Code:

timer on 22
sort make 
frame put make if make != make[_n-1] , into(make)   
frame make : levelsof make
timer off 22

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1362
#9

24 Sep 2022, 07:04

Bjarte Aagnes this is really interesting. Out of curiosity, was the Python environment already initialized when you started running the timings? I’m asking mostly because there should be more overhead with creating a frame, moving data into the frame, and then reading from it versus reading the data directly. The last solution is especially surprising given how slow sort can be at times. Also for timer 11 above, you could just pass sorted(make.get(var = “make”)) as an argument for the list comprehension at the end since you had already addressed duplicate records when storing in the frame and the get method should return a list.
Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2172

#10

24 Sep 2022, 07:27

Much of this is quite interesting. To give the code I promised about why we'd want this, there exists a synthetic control algorithm that I may want to make available in Stata, by running its estimation in Python under the hood. The github source code is here, if you'd like to play with it yourself. A minimum worked example can be found here by one of the coauthors.

Code:

cd "E:\Robust SCM"
clear *

loc timevar year


import delim "https://raw.githubusercontent.com/jehangiramjad/tslib/master/tests/testdata/basque.csv", clear

drop if region == "Spain (Espana)"

drop v1

qui su `timevar'

loc mintime = r(min)

loc maxtime = r(max)
cls


loc int_time = 1975

g rel = `timevar' - `int_time'
recast int rel

qui su rel

loc min = `r(min)'

loc max = `r(max)'
di `max'

sa rsc, replace

cls

loc treated "Basque Country (Pais Vasco)"

loc outcome gdpcap

loc panel regionname
cls

// Beginning Python

python:
import sys, os
from sfi import Macro
import numpy as np
import pandas as pd
import copy

from tslib.src import tsUtils
from tslib.src.synthcontrol.syntheticControl import RobustSyntheticControl
from tslib.tests import testdata

filename = 'rsc.dta' #Maybe just read the current dataframe as the dataset of interest?

df = pd.read_stata(filename)

yvar = Macro.getLocal('outcome')

panelvar = Macro.getLocal('panel')

minrel = df['rel'].min()
maxrel = df['rel'].max()+1

pivot = df.pivot_table(values=yvar, index=panelvar, columns=['rel'])

polframe = pd.DataFrame(pivot.to_records())

allColumns = polframe.columns.values

pd.set_option('display.max_columns', 10)

states = list(np.unique(polframe[panelvar])) #Defines our panel

years = np.delete(allColumns, [0]) #Defines our time, presented as time-to-treatment

treat = Macro.getLocal('treated') #Our Treated Unit of Interest

states.remove(treat)

donors = states #Our 16 donors

yearStart = minrel
yearTrainEnd = 0
yearTestEnd = maxrel

p = 1


trainingYears = []

for i in range(yearStart, yearTrainEnd, 1):

 trainingYears.append(str(i)) # Defines our pre-intervention period

testYears = []

for i in range(yearTrainEnd, yearTestEnd, 1):

 testYears.append(str(i)) # Defines our post-intervention period

trainDataMasterDict = {}

trainDataDict = {}

testDataDict = {}

for key in donors:
 series = polframe.loc[polframe[panelvar] == key] #Indexes our donors to their time periods

 trainDataMasterDict.update({key: series[trainingYears].values[0]})

 # randomly hide training data
 (trainData, pObservation) = tsUtils.randomlyHideValues(copy.deepcopy(trainDataMasterDict[key]), p)

 trainDataDict.update({key: trainData})
 testDataDict.update({key: series[testYears].values[0]})

series = polframe[polframe[panelvar] == treat]
trainDataMasterDict.update({treat: series[trainingYears].values[0]})

trainDataDict.update({treat: series[trainingYears].values[0]})
testDataDict.update({treat: series[testYears].values[0]})

trainMasterDF = pd.DataFrame(data=trainDataMasterDict)

# Makes dataframes from our two arrays, keyed to the dictionaries defined above, pre/post

trainDF = pd.DataFrame(data=trainDataDict)

testDF = pd.DataFrame(data=testDataDict)


singvals = 2

# !! The all important estimation
rscModel = RobustSyntheticControl(treat, singvals, len(trainDF), probObservation=1.0, modelType='svd', svdMethod='numpy', otherSeriesKeysArray=donors)
rscModel.fit(trainDF)
denoisedDF = rscModel.model.denoisedDF()

predictions = []
predictions = np.dot(testDF[donors], rscModel.model.weights)
actual = polframe.loc[polframe[panelvar] == treat]
actual = actual.drop(panelvar, axis=1)
actual = actual.iloc[0]
model_fit = np.dot(trainDF[donors][:], rscModel.model.weights)

#rscModel.model.weights

combined = np.concatenate((model_fit, predictions)) # Combines our pre/post model predctions

data = np.vstack([actual,combined])

data = np.swapaxes(data,0,1)

np.savetxt(treat+".csv", data, delimiter=",") #Saves this to a csv file


end
// ends Python

// Back to Stata

erase rsc.dta

import delimited "E:\Robust SCM\\`treated'.csv", clear

egen `timevar' = seq(), f(`mintime') t(`maxtime')

g relative = `timevar' -`int_time'

rename (v1 v2) (real rsc_cf)

tsset `timevar', y

g diff = real-rsc

su diff if rel >=0, mean

loc ATT: di %6.4g r(mean)

cls

twoway (tsline real, lcolor(black)) (tsline rsc_cf, lcolor(red)), ///
legend(order(1 "`treated'" 2 "RSC `treated'") ///
ring(0) pos(9)) xli(`int_time', lcol(black) lpat(solid)) caption("ATT = `ATT'")

The difficulty of course would come when under the hood I loop over all 17 units to do in-space placebo studies... But considering that I can do this in Python, or pass the macro from levelsof to Python below, this would make the Stata loop unnecessary since Python's list comprehension can easily do this.

Comment

Bjarte Aagnes

Join Date: Apr 2014

Posts: 786
#11

24 Sep 2022, 07:58

Re #9: In #8 the Python environment was initialized once at start of first timing of timer "10". Below timings of modified code are reported running -python clear- before Python code to be timed. Also, a preserve/restore wraps each part to be timed, and timer 11 is modified according to suggestion in #9.

Code:

-------------------------------------------------- 740,000 7,400,000 -------------------------------------------------- 10: 0.1466 1.4561 sfi Data.get 11: 0.2648 3.8284 sfi Frame.connect 21: 1.6249 27.4899 levelsof 22: 0.2674 3.8361 frame

code

(Stata/MP 17.0 for Windows (64-bit x86-64) MP16 running on 8 processors, Intel64 Family 6 Model 142 Stepping 10 GenuineIntel ~1600 Mhz)

Last edited by Bjarte Aagnes; 24 Sep 2022, 08:08.
Comment

Announcement