Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • I would like to turn my imperfect edgelist into a perfect one (with all possible pairs of nodes of a network)

    Hi everyone. To put it very briefly, I have data on different individuals and their connections with other individuals of a given network. In my data, an individual can either :

    1) Be connected to one or more individuals, in which case there will be an observation for each connection
    2) Remain without any connection, in which case there will be one observation, but with missing data for the different variables var1, var2, var3... that give information on the connections.

    My request is quite simple but I can't figure it out with my code. I would like to have a dataset with n*n observations, with all the possible pairs of individuals possible and a dummy variable equal to 1 if there is a connection between the two individuals. Here is a sample of my data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str3(source target) str2(freq1 freq_2) str4 time
    "A1"  ""    ""  ""  ""  
    "A10" ""    ""  ""  ""  
    "A11" ""    ""  ""  ""  
    "A12" ""    ""  ""  ""  
    "A13" "A4"  "1" "2" "12"
    "A13" "A14" "1" "3" "12"
    "A13" "A3"  "1" "1" "12"
    "A13" "A8"  "5" "1" "12"
    "A13" "A15" "5" "1" "12"
    "A13" "A18" "2" "5" "18"
    end
    individuals A1, A10, A11 and A12 do not have connections. A13 does have connections, and the individuals he is connected with each represent one line. Assume there are 18 individuals in my dataset. Basically I would like 18*18 observations and 18 missing lines for A1, 18 missing lines for A10, 12 missing lines for A13, etc.

    Can someone help me figure this problem out?

  • #2
    Code:
    fillin source target
    gen byte connection = !missing(freq1, freq_2, time)
    drop _fillin
    isid source target, sort missok

    Comment


    • #3
      Thanks Clyde! I did not know about the fillin command. By any chance, is there a way I could use it separately for different groups of my database? Notice that my individuals have letters in front of their numbers, and each letter symbolizes a different network. So using fillin without any regard to the network wouldn't make a lot of sense. I forgot to mention this part, apologies!

      Edit : I guess that in this context, dropping all pairs of individual observations which do not share the same letter works just fine. drop if substr(source, 1, 1) != substr(target,1,1) seems appropriate!
      Last edited by Adam Sadi; 15 Dec 2022, 17:11.

      Comment


      • #4
        Edit : I guess that in this context, dropping all pairs of individual observations which do not share the same letter works just fine. drop if substr(source, 1, 1) != substr(target,1,1) seems appropriate!
        Well, that will get you the end result you want. The concern is that if your data set is large enough, you may run out of working memory to form the intermediate data set that precedes the -drop if substr(source, 1, 1) != substr(target, 1, 1)- command. If you had, say, 30 individuals, in 3 groups of 10 each, doing it this way requires first creating a data set of 30x30 = 900 pairs, whereas if each were done separately you would need only 3x10x10 = 300 pairs. Now, obviously, at that scale, it isn't a problem. But if you are dealing with hundreds of thousands of individuals and hundreds of groups, it could easily make the difference between success and crash!

        If you run into a memory problem (or even just really excessive amount of execution time) post back and I'll show you other code that does it group by group, very efficiently.

        Comment


        • #5
          Thank you for providing useful remarks, Clyde. Fortunately I'm not concerned with these memory issues as I'm dealing with a relatively small sample. It only took a second to drop the observations!

          Comment

          Working...
          X