Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • nwset command - macro substitution results in line that is too long - Social Networking Analysis

    Greetings.

    I have been a long time follower of this group and the solutions given here have saved me from stata related troubles numerous times!!

    Very recently, I have stepped my foot into the world of Social Networking Analysis. I have downloaded and now using the nwcommands for my social networking analysis in stata. My ultimate goal is to create the create centrality coefficient (Degree, Eigenvector, Betweeness etc) for my dataset.

    I am facing the following difficulty while declaring my data set as "network data". I have arranged my data set as the edgelist (from to ) format. However, whenever I use the nwset command to declare my dataset as edgelist, the following error occurs and states that macro substitutions results in line that is too long.




    nwset from to, edgelist
    macro substitution results in line that is too long
    The line resulting from substituting macros would be longer than allowed. The maximum allowed length is 645,216 characters, which is calculated on the
    basis of set maxvar.

    You can change that in Stata/SE and Stata/MP. What follows is relevant only if you are using Stata/SE or Stata/MP.

    The maximum line length is defined as 16 more than the maximum macro length, which is currently 645,200 characters. Each unit increase in set maxvar
    increases the length maximums by 129. The maximum value of set maxvar is 32,767. Thus, the maximum line length may be set up to 4,227,159 characters if
    you set maxvar to its largest value.



    I understand that the error might be occurring for having too many observations as the edgelist format. What can I do to get rid of the issue.
    Here I am adding sample data set. In my original data set there are 8,894,287 observations.



    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(from to) float year
      204108357  16012019302 1975
      222589793   6082512704 1987
      403354690  11352773522 2008
      403404706  13282936484 1999
      403994789  10620041263 2016
      406515137 185652011391 2001
      406515137    439548594 1986
      409605538   8891308267 2000
      411165735 104643712683 1995
      413365994  29815110847 2004
      415556243 176628010694 1997
      716212921  52054011899 2009
      423827129  15408118739 2015
      425577311  15982009274 2000
      435888267 185655811391 2008
      437858447 185467811375 2002
      447119219 170498010188 1997
      450529492  15815239117 2001
      456979992  13709286990 2008
     4640910522    413766047 1994
     4988112851  13546786801 1963
      522695552 171451110273 2000
      523815693   8966208977 2014
      551428525  13884287186 2010
     5704210078  30074111047 1978
     5805710831  40483711023 2010
     5906011537   3384113972 2006
     5958911899  11021882719 1991
     5974512003  13664056944 2012
      617702774   7854348186 2005
      624534125  14529237875 1998
      624644144 169084110078 2016
      630044982   5381791951 1995
      636815873    639526192 1988
      842003607 172002710315 2010
      976327060  11782134345 1998
     1589927969  14069657389 1997
    20188011521  13627026896 2013
    20275111582   3724748267 2000
    20436511696  94291912619 2001
    20471711719  12485615436 1980
      222568378  13915317220 1969
     2232791076  93471012039 1993
     2236401263   9030059545 1966
     2244001664  15914529210 1992
     2254562062 192378111891 2014
     2258772201  11265793321 2010
     2263352333  63102312404 2011
     2269862518  11643044087 1995
     2283202880  13036746179 1980
     2298933251  30197011134 2004
     2302773321  20552511779 2016
     3512485927 172109510323 1997
     3512615927  10638031374 1994
     3565076571    516744740 2005
     3601386990  94114612497 2012
     3642357433  11733654255 2014
     3694167969  13850437152 1998
     3722978246   6658796371 1992
     3759408594  93772912259 2012
     3796278940    433198032 1996
    40183610791   4888379401 1984
     4764338267   5743767695 1998
     4790898516   8516093132 2013
     4838448958  15855149154 1992
     4845189024 169969510146 1989
     4908249572  11351893522 2012
     4916349643  15079818427 2008
    51637411598 179528710920 1981
    51643011598    429707705 1976
    52876612483   7732986896 1996
     5371371526  14065077389 1992
     5401872606 188090911575 2016
     5436083434  14124797455 2013
     5480934291  16070369347 1979
     5703167266  13292496497 1981
     5715347400  15995959283 2000
     5715417400  81709210928 2005
     5715517400 181389211063 2016
     5791238175 168686310044 2001
     5869008911  81779610983 2005
     6437062635 194518812047 2005
     6438302691  29743610791 2008
     6439352719 178188710815 2016
     6656106346 200191412454 2017
     6837958317  94500412760 2000
    80660710086 169196210086 2009
    80838210239 195665412128 2009
    94351912661  11391943607 1994
    10993562663 103481111861 2012
    11222143228 170925310230 1989
    11267963321   2275732691 1981
    11298133389  92475611305 1998
    11434613691    769598226 1996
    11464763753 188866411635 1990
    13214096409  16009239292 2003
    13598806861   5818010920 1995
    13841517141 171396210264 2005
    13882377186   5788298145 1985
    13976547288  52119311943 2011
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 100 out of 8894287 observations

  • Mike Lacy
    replied
    I'm not an expert here, having only played around with -nwcommands-, but my understanding is that compute time for centrality measures can go up exponentially with N, so I don't know if what you want is possible with 5k nodes. Again, posting on the -nwcommands- formum mentioned previously is your best bet, as I've had helpful responses from its author before. It would be nice if you did that and reported back to StataList.

    Leave a comment:


  • Ahmed Ameya Prapan
    replied
    Dear Mike ,

    Thanks a lot for the reply. After cleaning my data set I have about 5.7 million edges with 5219 nodes. As my ultimate goal is to calculate the centrality measures, I have also checked the netsis and netsummarize commands. Just like nwset, netsis and netsummarize also report that I have too many observations . So my revised question should be how to calculate centrality measures for large data set in Stata.

    I have checked with small number of observations that both the commands, netsis and nwset work for the sub sample of the observations. But I am worried that, by taking the sub sample , I might not be getting the correct centrality measures (degree, eigenvector, betweenness and closeness) as the dataset would be only partial here.

    Any idea what I can do regarding this issue ?

    Thanks again.





    Leave a comment:


  • Mike Lacy
    replied
    I have some thoughts here, but in the end, I suspect you will need to post your question at the -nwcommands- forum at http://nwcommands.org, as this is a package not commonly discussed on Stata list. (For other readers: -nwcommands- is a network analysis package available at that URL.)

    1) From a little investigating, I discovered that -nwset- creates a Mata matrix of size 8 * N^2, where N is the number of distinct nodes in your dataset. So, if your approximately 9e6 edges arose from (say) 9e5 nodes with an average of 10 connections per node, -nwset- would require a Mata matrix using 8*9e10 bytes, so it's easy to have much too large a network for your machine capacity. Knowing how many distinct nodes you have in your data set could tell a lot about where your problem arises.

    2) Using doubles for your "from, to" identifiers is not a good idea, as they use much more space than the int or long that is likely large enough to identify your nodes. It's conceivable to me that having huge identifiers for nodes, as you do, might impose a burden on some aspect of -nwcommands-. You could use -egen int idfrom = group(from)- (and etc.) to create a smaller node identifier. From what you cite above, it sounds like -nwcommands- might be constructing some macro with a list of node identifiers, which even with short node identifiers, could blow up even before you hit the memory problem described in 1.

    3) You might iteratively experiment with subsets of your data, say 10%, then 20%, ... to see if there is a certain size at which your problem starts occurring. If indeed there is some bug in -nwcommands-, that information would be helpful. Again given my point about the number of nodes being important, I'd suggest you start by using subsets of your nodes rather than subsets of edges within nodes. If you had your node identifers numbered from 1 to N, you could do this with something like:
    keep if (idfrom <= 0.10* `N') & (idto <= 0.10 * `N')

    Leave a comment:

Working...
X