Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing and struggling with Python within Stata

    Dear forum, I want to replicate a loop using Python within Stata. Why? Because I have more than 7 million obs and the loop in Stata is endless, so, I really hope that Python would be faster. I am using Stata 16 and have downloaded Python 3.10. Notice that I have never worked in Python, so I do not know nothing about it (reason why I want to do it from Stata).

    Let’s keep it simple. I want to read the data and generate a new variable, to keep it in the data when the Python code ends. But I get errors I don`t understand (see below). Can you please help me to figure out how at least start Python from Stata and create something to know it is working? And then I plan to start playing to see if I manage to replicate the loop.

    I will put the loop and a small dataex below just for if any of you think it can be easily replicated in Python code and want to help me.

    Many thanks in advance!

    Code:
    . sum _all
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
    citing_app~d |         27    3.07e+08    1.51e+08     208669   5.25e+08
    cited_appl~d |         27    646.2963    262.2381        522       1186
    citing_fir~1 |         27     1692219     1183887     140411    2931813
    citing_fir~2 |          5     2114763     1767572     238661    4073560
    citing_fir~3 |          1     3096471           .    3096471    3096471
    -------------+---------------------------------------------------------
    citing_fir~4 |          0
    citing_fir~5 |          0
    citing_fir~6 |          0
    citing_fir~7 |          0
    citing_fir~8 |          0
    -------------+---------------------------------------------------------
    citing_fir~9 |          0
    citing_fi~10 |          0
    cited_firm~1 |         27     2219700     1227168     140411    2931813
     citing_year |         27    2011.889     4.43182       2003       2020
             ccc |         27    8.444444    6.216191          1         20
    
    . 
    . python
    ----------------------------------------------- python (type end to exit) -----------------------------
    >>> import pandas as pd
    >>> from pandas import DataFrame as df
    >>> df['vvv'] = df['citing_year'] + 7
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'type' object is not subscriptable
    r(7102);
    >>> end
    -------------------------------------------------------------------------------------------------------
    
    .
    Code:
    clear
    input double(citing_appln_id cited_appln_id) float(citing_firm_id1 citing_firm_id2 citing_firm_id3 citing_firm_id4 citing_firm_id5 citing_firm_id6 citing_firm_id7 citing_firm_id8 citing_firm_id9 citing_firm_id10 cited_firm_id1 citing_year ccc)
    273682949  522 2931813       .       . . . . . . . . 2931813 2010  1
    315557296  522 1709046 2931813       . . . . . . . . 2931813 2010  2
    274384595  522 2931813       .       . . . . . . . . 2931813 2011  3
    333687126  522 2931813       .       . . . . . . . . 2931813 2012  4
    334378122  522 2931813       .       . . . . . . . . 2931813 2012  5
    336160680  522 2931813 4073560       . . . . . . . . 2931813 2013  6
    337102271  522 2931813       .       . . . . . . . . 2931813 2013  7
    337102281  522 2931813       .       . . . . . . . . 2931813 2013  8
    352267359  522 2931813       .       . . . . . . . . 2931813 2013  9
    379237343  522 1394838       .       . . . . . . . . 2931813 2013 10
    380717448  522  296197       .       . . . . . . . . 2931813 2014 11
    381189080  522 2931813       .       . . . . . . . . 2931813 2014 12
    412500492  522  908975       .       . . . . . . . . 2931813 2015 13
    415658469  522  908975       .       . . . . . . . . 2931813 2015 14
    418707239  522 2931813       .       . . . . . . . . 2931813 2015 15
    419016602  522 2931813       .       . . . . . . . . 2931813 2015 16
    438831616  522  908975       .       . . . . . . . . 2931813 2016 17
    439633449  522  908975       .       . . . . . . . . 2931813 2016 18
    474527945  522  918273       .       . . . . . . . . 2931813 2018 19
    524645129  522 2931813       .       . . . . . . . . 2931813 2020 20
     56840315  540  296791 3091119 3096471 . . . . . . .  296791 2009  1
    273300774  540  296791       .       . . . . . . . .  296791 2010  2
       208669 1186  140411       .       . . . . . . . .  140411 2003  1
     15909980 1186  557078       .       . . . . . . . .  140411 2003  2
     15999323 1186  140411  238661       . . . . . . . .  140411 2003  3
     16031254 1186  981997       .       . . . . . . . .  140411 2004  4
    336876707 1186  140411  238661       . . . . . . . .  140411 2011  5
    end
    Loop
    Code:
    ds citing_firm_id*
    local nwords :  word count `r(varlist)'
    display `nwords'
    display wordcount("`r(varlist)'")
    
    sum ccc if cited_firm_id1!=. & citing_firm_id1!=.
    local max_k = r(max)
       
    forvalues k = 1/`max_k'  { 
    forvalues x = 1/`nwords' { 
    forvalues y = 1/`nwords' {   
    
    sort cited_appln_id citing_year citing_appln_id
    bys cited_appln_id: gen count4_`x'_`y'_`k' = 1 if citing_firm_id`x'==citing_firm_id`y'[_n+`k'] & !missing(citing_firm_id`x',citing_firm_id`y'[_n+`k'])
    }
    }
    egen self_cit_v_k`k' = rowtotal(count4_*), missing
    drop count4_*
    }
Working...
X