Dealing and struggling with Python within Stata

Doris Rivera

Join Date: Feb 2020
Posts: 172

Dealing and struggling with Python within Stata

20 Dec 2022, 06:12

Dear forum, I want to replicate a loop using Python within Stata. Why? Because I have more than 7 million obs and the loop in Stata is endless, so, I really hope that Python would be faster. I am using Stata 16 and have downloaded Python 3.10. Notice that I have never worked in Python, so I do not know nothing about it (reason why I want to do it from Stata).

Let’s keep it simple. I want to read the data and generate a new variable, to keep it in the data when the Python code ends. But I get errors I don`t understand (see below). Can you please help me to figure out how at least start Python from Stata and create something to know it is working? And then I plan to start playing to see if I manage to replicate the loop.

I will put the loop and a small dataex below just for if any of you think it can be easily replicated in Python code and want to help me.

Many thanks in advance!

Code:

. sum _all

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
citing_app~d |         27    3.07e+08    1.51e+08     208669   5.25e+08
cited_appl~d |         27    646.2963    262.2381        522       1186
citing_fir~1 |         27     1692219     1183887     140411    2931813
citing_fir~2 |          5     2114763     1767572     238661    4073560
citing_fir~3 |          1     3096471           .    3096471    3096471
-------------+---------------------------------------------------------
citing_fir~4 |          0
citing_fir~5 |          0
citing_fir~6 |          0
citing_fir~7 |          0
citing_fir~8 |          0
-------------+---------------------------------------------------------
citing_fir~9 |          0
citing_fi~10 |          0
cited_firm~1 |         27     2219700     1227168     140411    2931813
 citing_year |         27    2011.889     4.43182       2003       2020
         ccc |         27    8.444444    6.216191          1         20

. 
. python
----------------------------------------------- python (type end to exit) -----------------------------
>>> import pandas as pd
>>> from pandas import DataFrame as df
>>> df['vvv'] = df['citing_year'] + 7
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'type' object is not subscriptable
r(7102);
>>> end
-------------------------------------------------------------------------------------------------------

.

Code:

clear
input double(citing_appln_id cited_appln_id) float(citing_firm_id1 citing_firm_id2 citing_firm_id3 citing_firm_id4 citing_firm_id5 citing_firm_id6 citing_firm_id7 citing_firm_id8 citing_firm_id9 citing_firm_id10 cited_firm_id1 citing_year ccc)
273682949  522 2931813       .       . . . . . . . . 2931813 2010  1
315557296  522 1709046 2931813       . . . . . . . . 2931813 2010  2
274384595  522 2931813       .       . . . . . . . . 2931813 2011  3
333687126  522 2931813       .       . . . . . . . . 2931813 2012  4
334378122  522 2931813       .       . . . . . . . . 2931813 2012  5
336160680  522 2931813 4073560       . . . . . . . . 2931813 2013  6
337102271  522 2931813       .       . . . . . . . . 2931813 2013  7
337102281  522 2931813       .       . . . . . . . . 2931813 2013  8
352267359  522 2931813       .       . . . . . . . . 2931813 2013  9
379237343  522 1394838       .       . . . . . . . . 2931813 2013 10
380717448  522  296197       .       . . . . . . . . 2931813 2014 11
381189080  522 2931813       .       . . . . . . . . 2931813 2014 12
412500492  522  908975       .       . . . . . . . . 2931813 2015 13
415658469  522  908975       .       . . . . . . . . 2931813 2015 14
418707239  522 2931813       .       . . . . . . . . 2931813 2015 15
419016602  522 2931813       .       . . . . . . . . 2931813 2015 16
438831616  522  908975       .       . . . . . . . . 2931813 2016 17
439633449  522  908975       .       . . . . . . . . 2931813 2016 18
474527945  522  918273       .       . . . . . . . . 2931813 2018 19
524645129  522 2931813       .       . . . . . . . . 2931813 2020 20
 56840315  540  296791 3091119 3096471 . . . . . . .  296791 2009  1
273300774  540  296791       .       . . . . . . . .  296791 2010  2
   208669 1186  140411       .       . . . . . . . .  140411 2003  1
 15909980 1186  557078       .       . . . . . . . .  140411 2003  2
 15999323 1186  140411  238661       . . . . . . . .  140411 2003  3
 16031254 1186  981997       .       . . . . . . . .  140411 2004  4
336876707 1186  140411  238661       . . . . . . . .  140411 2011  5
end

Loop

Code:

ds citing_firm_id*
local nwords :  word count `r(varlist)'
display `nwords'
display wordcount("`r(varlist)'")

sum ccc if cited_firm_id1!=. & citing_firm_id1!=.
local max_k = r(max)
   
forvalues k = 1/`max_k'  { 
forvalues x = 1/`nwords' { 
forvalues y = 1/`nwords' {   

sort cited_appln_id citing_year citing_appln_id
bys cited_appln_id: gen count4_`x'_`y'_`k' = 1 if citing_firm_id`x'==citing_firm_id`y'[_n+`k'] & !missing(citing_firm_id`x',citing_firm_id`y'[_n+`k'])
}
}
egen self_cit_v_k`k' = rowtotal(count4_*), missing
drop count4_*
}

Tags: None

Announcement

Dealing and struggling with Python within Stata