Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count the number of inner ids

    Hi all,

    I have a database made as follows: it has an id variable (docdb_family_id) and a list of ids cited by the id variable (cited_ids):

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long strL docdb_family_id cited_ids
     569328                              [1239483.0, 340820.0, 1340488.0, 19383012.0] 
     574660                              [1239483.0, 563839.0] 
    1187498                             [679028.0, 1334478.0, 1239277.0, 3801039130.0, 73193891.0, 1187498.0] 
    1226468                             [1334478.0, 569328.0] 
    1236571                             []
    1239098                             [39201329.0, 8281.0, 3993093.0, 3793247.0, 37818738.0, 38913793.0, 38239238.0, 218173923.0, 13893701.0] 
    1239277                             [1239622.0] 
    1239483                             []
    1239622                             [574660.0, 1226468.0, 19383012.0] 
    1239624                             [1239749.0,1187498.0, 230983290.0, 11039932.0, 33298230.0, 329083.0] 
    1239749                             [1226468.0] 
    1334478                             []
    end
    Now what I would like to obtain is the number of times that an id present in the docdb_family_id variable is present in the list cited_ids. In other words, the output variable (nr_green) should be:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long strL int docdb_family_id cited_ids nr_green
     569328                              [1239483.0, 340820.0, 1340488.0, 19383012.0] 2
     574660                              [1239483.0, 563839.0] 1
    1187498                             [679028.0, 1334478.0, 1239277.0, 3801039130.0, 73193891.0, 1187498.0] 3
    1226468                             [1334478.0, 569328.0] 2
    1236571                             []
    1239098                             [39201329.0, 8281.0, 3993093.0, 3793247.0, 37818738.0, 38913793.0, 38239238.0, 218173923.0, 13893701.0] 0
    1239277                             [1239622.0] 1
    1239483                             []
    1239622                             [574660.0, 1226468.0, 19383012.0] 2
    1239624                             [1239749.0,1187498.0, 230983290.0, 11039932.0, 33298230.0, 329083.0] 2
    1239749                             [1226468.0] 1
    1334478                             []
    end
    where, for instance, the nr_green associate with docdb_family_id 1187498 is 3 because in the list cited_ids appear three indices present in docdb_family_id, namely: 1334478,1239277 and 1187498.

    Thank you

  • #2
    Here is some code that should do it. I had to clean up your data example since you seem to have tried to manually fix it and introduced errors in the process.

    Code:
    clear
    input long docdb_family_id strL cited_ids
     569328 "[1239483.0, 340820.0, 1340488.0, 19383012.0]"
     574660 "[1239483.0, 563839.0]"
    1187498 "[679028.0, 1334478.0, 1239277.0, 3801039130.0, 73193891.0, 1187498.0]"
    1226468 "[1334478.0, 569328.0]"
    1236571 "[]"
    1239098 "[39201329.0, 8281.0, 3993093.0, 3793247.0, 37818738.0, 38913793.0, 38239238.0, 218173923.0, 13893701.0]"
    1239277 "[1239622.0]"
    1239483 "[]"
    1239622 "[574660.0, 1226468.0, 19383012.0]"
    1239624 "[1239749.0,1187498.0, 230983290.0, 11039932.0, 33298230.0, 329083.0]"
    1239749 "[1226468.0]"
    1334478 "[]"
    end
    
    isid docdb_family_id // confirm that docdb_family_id uniquely identifies an observation
    
    preserve
        keep docdb_family_id
        rename docdb_family_id _cited_id
        tempfile ids
        save `ids'
    restore
    
    gen strL _cited_id = cited_ids
    
    replace _cited_id = subinstr(_cited_id,".0","",.)
    replace _cited_id = subinstr(_cited_id,"[","",1)
    replace _cited_id = subinstr(_cited_id,"]","",1)
    split _cited_id, parse(,)
    destring _cited_id*, replace
    
    drop _cited_id
    
    reshape long _cited_id, i(docdb_family_id cited_ids) j(num)
    merge m:1 _cited_id using `ids', keep(1 3)
    bys docdb_family_id: egen nr_green = total(_merge == 3)
    drop _merge num _cited_id
    
    duplicates drop docdb_family_id, force
    replace nr_green = . if cited_ids == "[]"
    which produces:
    Code:
    . li, noobs ab(20) string(20) sep(0)
    
      +-----------------------------------------------------+
      | docdb_family_id   cited_ids                nr_green |
      |-----------------------------------------------------|
      |          569328   [1239483.0, 340820.0..          1 |
      |          574660   [1239483.0, 563839.0]           1 |
      |         1187498   [679028.0, 1334478.0..          3 |
      |         1226468   [1334478.0, 569328.0]           2 |
      |         1236571   []                              . |
      |         1239098   [39201329.0, 8281.0,..          0 |
      |         1239277   [1239622.0]                     1 |
      |         1239483   []                              . |
      |         1239622   [574660.0, 1226468.0..          2 |
      |         1239624   [1239749.0,1187498.0..          2 |
      |         1239749   [1226468.0]                     1 |
      |         1334478   []                              . |
      +-----------------------------------------------------+
    Last edited by Hemanshu Kumar; 22 Dec 2022, 08:22.

    Comment


    • #3
      Thanks a lot! Sorry for the manual fixing but it was necessary for the MWE.

      Comment

      Working...
      X