Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Python Code in Stata Loop

    I need Stata to recognize that it's using Python code in a loop. Precisely, I want it to simply print "moo" and "oink"
    Code:
    cls
    
    
    loc sounds moo oink
    
    foreach i of loc sounds {
    
    python:
    from sfi import Macro
    
    
    ansound = Macro.getLocal('i')
    
    print(ansound)
    
    end
    }
    When I try this though, Stata simply reports the error code "Break". Why? How might I solve this? The below, however, works
    Code:
    loc i moo
    
    
    python:
    from sfi import Macro
    
    
    ansound = Macro.getLocal('i')
    
    print(ansound)
    
    end

  • #2
    Jared Greathouse
    Can I ask why you would want to do this? Consider the overhead of repeatedly calling out to the API and trying to reload a module that may still be loaded in the underlying Python interpreter. You could just as easily use a list comprehension in Python to do it all and call it once from Stata.

    Comment


    • #3
      wbuchanan Certainly, the explanation will be a little involved, but I'll explain fully when I'm home. Long story short for now though, I'm adapting the python syntax of this command to Stata, and I might need the user to specify their treated unit as an option, e.g.
      Code:
      unit(California)
      in the ado-code. I know it's sort of unclear at the moment, but it'll make, I hope, much more sense when I'm back home and can give a full explanation.

      To give a very short reason though, in synthetic control research, we oftentimes want to do placebo tests where we iteratively apply the estimation to all units and check and see if the one treated unit is the most extreme relative to its donors. This'll involve looping over the donors, where I'd internally call levelsof and feed each unit (e.g., California, Alabama) to the Python algorithm.

      Comment


      • #4
        I think wbuchanan has a point: build the list of arguments first, then pass that to Python. Something like

        Code:
        levelsof ... , local(the_levels)
        python : code_fetching_the_levels_from_Stata

        Concerning the error that you are getting, you cannot use end inside a loop. This has been discussed on several occasions, typically when calling Mata inside loops or when trying to define a program inside a loop. Typically, there are better options to tackle the underlying problem, but we have covered that point already.


        Edit: Aside from doing the looping (or using lists) in Python, better solutions involve defining the Python function(s) once, below your ado code inside the ado-file. Alternatively, store the Python code/script as a separate py-file that you distribute with your ado-file. In both cases, you call one (main) Python function or one Python scrip from inside the loop.
        Last edited by daniel klein; 20 Sep 2022, 23:41.

        Comment


        • #5
          daniel klein I agree, I thought of this solution last night before bed. Levelsof seems to be the sensible solution, and then passing that macro to a python list.

          Comment


          • #6
            Test using auto.dta:
            Code:
            sysuse auto, clear
            levelsof make, clean sep("|") local(levels)
            
            python : 
            
            from sfi import Macro
            levels = Macro.getLocal("levels").split("|")
            print(levels)
            
            end

            Comment


            • #7
              As an alternative to what Bjarte Aagnes suggested you could also:

              Code:
              sysuse auto.dta, clear
              python:
              from sfi import Data
              levels = set(Data.get(var = "make"))
              print(levels)
              end

              Comment


              • #8
                Ref #7: Nice to see a solution not using levelsof which can be slow for "large" data (the set can be made into a list). Below I add saving a Stata local resembling r(levels) from Python, and time four alternatives for two sizes of data (reporting mean time in seconds after 10 repetitions):
                Code:
                -------------------------------------------------- 
                      740,000     7,400,000
                --------------------------------------------------
                  10:  0.2358        2.6908  sfi Data.get
                  11:  0.1026        1.1268  sfi Frame.connect 
                  21:  2.5417       42.9631  levelsof
                  22:  0.1412        1.5253  frame
                Code:
                timer on 10
                
                python:
                from sfi import Data, Macro
                levels = sorted(list(set(Data.get(var = "make"))))
                Macro.setLocal("levels", ' '.join(f'"{w}"' for w in levels) ) 
                end
                
                timer off 10
                Code:
                timer on 11
                
                sort make 
                frame put make if make != make[_n-1] , into(make) 
                
                python:
                from sfi import Frame 
                make=Frame.connect("make")
                levels = sorted(list(set(make.get(var = "make"))))
                Macro.setLocal("levels", ' '.join(f'"{w}"' for w in levels) )  
                end
                
                timer off 11
                Code:
                timer on 21
                qui levelsof make
                timer off 21
                Code:
                timer on 22
                sort make 
                frame put make if make != make[_n-1] , into(make)   
                frame make : levelsof make
                timer off 22

                Comment


                • #9
                  Bjarte Aagnes this is really interesting. Out of curiosity, was the Python environment already initialized when you started running the timings? I’m asking mostly because there should be more overhead with creating a frame, moving data into the frame, and then reading from it versus reading the data directly. The last solution is especially surprising given how slow sort can be at times. Also for timer 11 above, you could just pass sorted(make.get(var = “make”)) as an argument for the list comprehension at the end since you had already addressed duplicate records when storing in the frame and the get method should return a list.

                  Comment


                  • #10
                    Much of this is quite interesting. To give the code I promised about why we'd want this, there exists a synthetic control algorithm that I may want to make available in Stata, by running its estimation in Python under the hood. The github source code is here, if you'd like to play with it yourself. A minimum worked example can be found here by one of the coauthors.
                    Code:
                    cd "E:\Robust SCM"
                    clear *
                    
                    loc timevar year
                    
                    
                    import delim "https://raw.githubusercontent.com/jehangiramjad/tslib/master/tests/testdata/basque.csv", clear
                    
                    drop if region == "Spain (Espana)"
                    
                    drop v1
                    
                    qui su `timevar'
                    
                    loc mintime = r(min)
                    
                    loc maxtime = r(max)
                    cls
                    
                    
                    loc int_time = 1975
                    
                    g rel = `timevar' - `int_time'
                    recast int rel
                    
                    qui su rel
                    
                    loc min = `r(min)'
                    
                    loc max = `r(max)'
                    di `max'
                    
                    sa rsc, replace
                    
                    cls
                    
                    loc treated "Basque Country (Pais Vasco)"
                    
                    loc outcome gdpcap
                    
                    loc panel regionname
                    cls
                    
                    // Beginning Python
                    
                    python:
                    import sys, os
                    from sfi import Macro
                    import numpy as np
                    import pandas as pd
                    import copy
                    
                    from tslib.src import tsUtils
                    from tslib.src.synthcontrol.syntheticControl import RobustSyntheticControl
                    from tslib.tests import testdata
                    
                    filename = 'rsc.dta' #Maybe just read the current dataframe as the dataset of interest?
                    
                    df = pd.read_stata(filename)
                    
                    yvar = Macro.getLocal('outcome')
                    
                    panelvar = Macro.getLocal('panel')
                    
                    minrel = df['rel'].min()
                    maxrel = df['rel'].max()+1
                    
                    pivot = df.pivot_table(values=yvar, index=panelvar, columns=['rel'])
                    
                    polframe = pd.DataFrame(pivot.to_records())
                    
                    allColumns = polframe.columns.values
                    
                    pd.set_option('display.max_columns', 10)
                    
                    states = list(np.unique(polframe[panelvar])) #Defines our panel
                    
                    years = np.delete(allColumns, [0]) #Defines our time, presented as time-to-treatment
                    
                    treat = Macro.getLocal('treated') #Our Treated Unit of Interest
                    
                    states.remove(treat)
                    
                    donors = states #Our 16 donors
                    
                    yearStart = minrel
                    yearTrainEnd = 0
                    yearTestEnd = maxrel
                    
                    p = 1
                    
                    
                    trainingYears = []
                    
                    for i in range(yearStart, yearTrainEnd, 1):
                    
                     trainingYears.append(str(i)) # Defines our pre-intervention period
                    
                    testYears = []
                    
                    for i in range(yearTrainEnd, yearTestEnd, 1):
                    
                     testYears.append(str(i)) # Defines our post-intervention period
                    
                    trainDataMasterDict = {}
                    
                    trainDataDict = {}
                    
                    testDataDict = {}
                    
                    for key in donors:
                     series = polframe.loc[polframe[panelvar] == key] #Indexes our donors to their time periods
                    
                     trainDataMasterDict.update({key: series[trainingYears].values[0]})
                    
                     # randomly hide training data
                     (trainData, pObservation) = tsUtils.randomlyHideValues(copy.deepcopy(trainDataMasterDict[key]), p)
                    
                     trainDataDict.update({key: trainData})
                     testDataDict.update({key: series[testYears].values[0]})
                    
                    series = polframe[polframe[panelvar] == treat]
                    trainDataMasterDict.update({treat: series[trainingYears].values[0]})
                    
                    trainDataDict.update({treat: series[trainingYears].values[0]})
                    testDataDict.update({treat: series[testYears].values[0]})
                    
                    trainMasterDF = pd.DataFrame(data=trainDataMasterDict)
                    
                    # Makes dataframes from our two arrays, keyed to the dictionaries defined above, pre/post
                    
                    trainDF = pd.DataFrame(data=trainDataDict)
                    
                    testDF = pd.DataFrame(data=testDataDict)
                    
                    
                    singvals = 2
                    
                    # !! The all important estimation
                    rscModel = RobustSyntheticControl(treat, singvals, len(trainDF), probObservation=1.0, modelType='svd', svdMethod='numpy', otherSeriesKeysArray=donors)
                    rscModel.fit(trainDF)
                    denoisedDF = rscModel.model.denoisedDF()
                    
                    predictions = []
                    predictions = np.dot(testDF[donors], rscModel.model.weights)
                    actual = polframe.loc[polframe[panelvar] == treat]
                    actual = actual.drop(panelvar, axis=1)
                    actual = actual.iloc[0]
                    model_fit = np.dot(trainDF[donors][:], rscModel.model.weights)
                    
                    #rscModel.model.weights
                    
                    combined = np.concatenate((model_fit, predictions)) # Combines our pre/post model predctions
                    
                    data = np.vstack([actual,combined])
                    
                    data = np.swapaxes(data,0,1)
                    
                    np.savetxt(treat+".csv", data, delimiter=",") #Saves this to a csv file
                    
                    
                    end
                    // ends Python
                    
                    // Back to Stata
                    
                    erase rsc.dta
                    
                    import delimited "E:\Robust SCM\\`treated'.csv", clear
                    
                    egen `timevar' = seq(), f(`mintime') t(`maxtime')
                    
                    g relative = `timevar' -`int_time'
                    
                    rename (v1 v2) (real rsc_cf)
                    
                    tsset `timevar', y
                    
                    g diff = real-rsc
                    
                    su diff if rel >=0, mean
                    
                    loc ATT: di %6.4g r(mean)
                    
                    cls
                    
                    twoway (tsline real, lcolor(black)) (tsline rsc_cf, lcolor(red)), ///
                    legend(order(1 "`treated'" 2 "RSC `treated'") ///
                    ring(0) pos(9)) xli(`int_time', lcol(black) lpat(solid)) caption("ATT = `ATT'")
                    The difficulty of course would come when under the hood I loop over all 17 units to do in-space placebo studies... But considering that I can do this in Python, or pass the macro from levelsof to Python below, this would make the Stata loop unnecessary since Python's list comprehension can easily do this.

                    Comment


                    • #11
                      Re #9: In #8 the Python environment was initialized once at start of first timing of timer "10". Below timings of modified code are reported running -python clear- before Python code to be timed. Also, a preserve/restore wraps each part to be timed, and timer 11 is modified according to suggestion in #9.
                      Code:
                      --------------------------------------------------
                            740,000     7,400,000
                      --------------------------------------------------
                        10:   0.1466    1.4561    sfi Data.get
                        11:   0.2648    3.8284    sfi Frame.connect
                        21:   1.6249   27.4899    levelsof
                        22:   0.2674    3.8361    frame
                      code

                      (Stata/MP 17.0 for Windows (64-bit x86-64) MP16 running on 8 processors, Intel64 Family 6 Model 142 Stepping 10 GenuineIntel ~1600 Mhz)
                      Last edited by Bjarte Aagnes; 24 Sep 2022, 08:08.

                      Comment

                      Working...
                      X