Dear forum, I want to replicate a loop using Python within Stata. Why? Because I have more than 7 million obs and the loop in Stata is endless, so, I really hope that Python would be faster. I am using Stata 16 and have downloaded Python 3.10. Notice that I have never worked in Python, so I do not know nothing about it (reason why I want to do it from Stata).
Let’s keep it simple. I want to read the data and generate a new variable, to keep it in the data when the Python code ends. But I get errors I don`t understand (see below). Can you please help me to figure out how at least start Python from Stata and create something to know it is working? And then I plan to start playing to see if I manage to replicate the loop.
I will put the loop and a small dataex below just for if any of you think it can be easily replicated in Python code and want to help me.
Many thanks in advance!
Loop
Let’s keep it simple. I want to read the data and generate a new variable, to keep it in the data when the Python code ends. But I get errors I don`t understand (see below). Can you please help me to figure out how at least start Python from Stata and create something to know it is working? And then I plan to start playing to see if I manage to replicate the loop.
I will put the loop and a small dataex below just for if any of you think it can be easily replicated in Python code and want to help me.
Many thanks in advance!
Code:
. sum _all Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- citing_app~d | 27 3.07e+08 1.51e+08 208669 5.25e+08 cited_appl~d | 27 646.2963 262.2381 522 1186 citing_fir~1 | 27 1692219 1183887 140411 2931813 citing_fir~2 | 5 2114763 1767572 238661 4073560 citing_fir~3 | 1 3096471 . 3096471 3096471 -------------+--------------------------------------------------------- citing_fir~4 | 0 citing_fir~5 | 0 citing_fir~6 | 0 citing_fir~7 | 0 citing_fir~8 | 0 -------------+--------------------------------------------------------- citing_fir~9 | 0 citing_fi~10 | 0 cited_firm~1 | 27 2219700 1227168 140411 2931813 citing_year | 27 2011.889 4.43182 2003 2020 ccc | 27 8.444444 6.216191 1 20 . . python ----------------------------------------------- python (type end to exit) ----------------------------- >>> import pandas as pd >>> from pandas import DataFrame as df >>> df['vvv'] = df['citing_year'] + 7 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'type' object is not subscriptable r(7102); >>> end ------------------------------------------------------------------------------------------------------- .
Code:
clear input double(citing_appln_id cited_appln_id) float(citing_firm_id1 citing_firm_id2 citing_firm_id3 citing_firm_id4 citing_firm_id5 citing_firm_id6 citing_firm_id7 citing_firm_id8 citing_firm_id9 citing_firm_id10 cited_firm_id1 citing_year ccc) 273682949 522 2931813 . . . . . . . . . 2931813 2010 1 315557296 522 1709046 2931813 . . . . . . . . 2931813 2010 2 274384595 522 2931813 . . . . . . . . . 2931813 2011 3 333687126 522 2931813 . . . . . . . . . 2931813 2012 4 334378122 522 2931813 . . . . . . . . . 2931813 2012 5 336160680 522 2931813 4073560 . . . . . . . . 2931813 2013 6 337102271 522 2931813 . . . . . . . . . 2931813 2013 7 337102281 522 2931813 . . . . . . . . . 2931813 2013 8 352267359 522 2931813 . . . . . . . . . 2931813 2013 9 379237343 522 1394838 . . . . . . . . . 2931813 2013 10 380717448 522 296197 . . . . . . . . . 2931813 2014 11 381189080 522 2931813 . . . . . . . . . 2931813 2014 12 412500492 522 908975 . . . . . . . . . 2931813 2015 13 415658469 522 908975 . . . . . . . . . 2931813 2015 14 418707239 522 2931813 . . . . . . . . . 2931813 2015 15 419016602 522 2931813 . . . . . . . . . 2931813 2015 16 438831616 522 908975 . . . . . . . . . 2931813 2016 17 439633449 522 908975 . . . . . . . . . 2931813 2016 18 474527945 522 918273 . . . . . . . . . 2931813 2018 19 524645129 522 2931813 . . . . . . . . . 2931813 2020 20 56840315 540 296791 3091119 3096471 . . . . . . . 296791 2009 1 273300774 540 296791 . . . . . . . . . 296791 2010 2 208669 1186 140411 . . . . . . . . . 140411 2003 1 15909980 1186 557078 . . . . . . . . . 140411 2003 2 15999323 1186 140411 238661 . . . . . . . . 140411 2003 3 16031254 1186 981997 . . . . . . . . . 140411 2004 4 336876707 1186 140411 238661 . . . . . . . . 140411 2011 5 end
Code:
ds citing_firm_id* local nwords : word count `r(varlist)' display `nwords' display wordcount("`r(varlist)'") sum ccc if cited_firm_id1!=. & citing_firm_id1!=. local max_k = r(max) forvalues k = 1/`max_k' { forvalues x = 1/`nwords' { forvalues y = 1/`nwords' { sort cited_appln_id citing_year citing_appln_id bys cited_appln_id: gen count4_`x'_`y'_`k' = 1 if citing_firm_id`x'==citing_firm_id`y'[_n+`k'] & !missing(citing_firm_id`x',citing_firm_id`y'[_n+`k']) } } egen self_cit_v_k`k' = rowtotal(count4_*), missing drop count4_* }