Hi all,
my db looks as follows:
In particular, as you can see I have a string called tech_fields_cited which consists of a string composed by many acronyms (e.g.: ['Y02B 30/13' 'Y02B 30/12' 'Y02E 60/14' 'Y02B 10/70'] has 'Y02B 30/13', 'Y02B 30/12', 'Y02E 60/14' and 'Y02B 10/70'). What I would like to do is to create N dummy variables (being N the total number of acronyms in the string tech_fields_cited) taking 1 whenever a docdb has that acronym inside the variable tech_fields_cited.
So, for instance, taking only the first 3 observations for the sake of simplicity, the result should be something like:
Of course it might be that two docdb has the same tech_field in the string. In that case, only one dummy should be created having a 1 for both of them.
Thank you
my db looks as follows:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input long docdb_family_id strL tech_fields_cited int earliest_publn_year 1187498 "['Y02P 20/10']" 1998 1226468 "['Y10T 156/1023']" 1992 1236571 "['Y02B 30/13' 'Y02B 30/12' 'Y02E 60/14' 'Y02B 10/70']" 1991 1239098 "['Y10S 426/805' 'Y02A 40/81']" 1992 1239277 "['Y02E 60/10' 'Y02T 90/12' 'Y02T 10/7072' 'Y02T 90/14' 'Y02T 10/70']" 1993 1239483 "['Y10T 29/49014']" 1996 1239622 "['Y10T 29/49833' 'Y10S 100/913' 'Y10S 52/06' 'Y10T 29/5343']" 1997 1239624 "['Y10S 128/903']" 1994 1239749 "['Y10T 428/31938' 'Y10T 428/31692' 'Y10T 428/24355' 'Y10T 428/31']" 1993 1340405 "['Y10T 137/7759' 'Y10T 137/7761']" 1992 1340418 "['Y02T 10/7072' 'Y02T 90/12' 'Y02T 90/14' 'Y02T 10/70']" 1995 1340462 "['Y10T 156/1052']" 1994 1340471 "['Y10T 83/9317' 'Y10S 76/02']" 1993 1340485 "['Y10S 362/80' 'Y10S 362/806']" 1995 1340488 "['Y10T 428/249981' 'Y10T 442/15' 'Y10T 428/249985' 'Y10T 428/249984' 'Y10T 428/249982']" 1994 1340508 "['Y10T 428/30' 'Y10T 428/31663' 'Y10T 428/249928' 'Y10S 428/902']" 1993 1340519 "['Y10S 297/08' 'Y10T 137/86517' 'Y10T 137/86509' 'Y10T 137/86501']" 1995 1340541 "" 1994 1340647 "['Y10S 435/968']" 1996 1340659 "['Y02T 10/72' 'Y02T 10/64']" 1995 1340673 "['Y10T 428/2998' 'Y10T 428/2991']" 1996 1340826 "['Y02T 10/30' 'Y02T 10/12' 'Y02E 20/14']" 1995 1341006 "['Y02E 10/30']" 1997 1341170 "['Y10S 148/124']" 1996 1341175 "['Y10T 29/49146' 'Y10T 29/49171']" 1998 1578920 "['Y10T 403/472' 'Y10T 403/4602']" 1971 1763337 "['Y10T 428/3154' 'Y10T 428/31663' 'Y10T 428/31794' 'Y10T 428/31507']" 1995 2463680 "['Y10S 526/91' 'Y10S 526/911']" 1995 3411930 "['Y10S 242/03']" 1996 3459918 "['Y10S 428/909' 'Y10T 428/233' 'Y10T 428/24355' 'Y10T 428/24479' 'Y10T 428/249953' 'Y10T 428/24496' 'Y10T 428/24182' 'Y10T 428/239' 'Y10T 428/24174']" 1998 3460237 "['Y02P 70/50']" 1995 3460621 "['Y02E 10/50']" 2008 3460785 "['Y02T 30/00']" 2001 3460796 "['Y02T 10/12']" 2001 3460811 "['Y10T 70/7904' 'Y10T 70/7057' 'Y10T 70/5863']" 2002 3460812 "['Y10S 514/963' 'Y10S 514/965' 'Y10T 428/2985']" 2004 3460829 "['Y02T 10/30']" 2002 3460972 "['Y10S 426/806' 'Y10S 203/18']" 1998 3460979 "['Y02E 60/10']" 2000 3460980 "['Y10T 24/12' 'Y10T 24/44769']" 2000 3460981 "['Y10T 403/42' 'Y10T 403/347' 'Y10T 403/341' 'Y10T 403/342' 'Y10T 403/73' 'Y10T 403/343']" 2000 3461010 "['Y10T 428/24124' 'Y10T 428/26' 'Y10T 428/24' 'Y10T 428/12333']" 2000 3461018 "['Y02P 10/20']" 2002 3461040 "['Y10T 74/1967']" 2001 3461044 "['Y10T 74/1836' 'Y10T 74/2057']" 1999 3461088 "['Y10S 606/907']" 1999 3461099 "['Y10T 137/9464']" 2001 3461127 "['Y10T 428/26' 'Y10T 428/12924' 'Y10T 428/12569']" 2003 3461172 "['Y10S 5/928' 'Y10S 5/90']" 2000 3461192 "['Y02W 30/62']" 2002 3461205 "['Y02T 10/90' 'Y02E 10/728' 'Y02T 10/7072' 'Y02E 10/74']" 2005 3461208 "['Y02E 10/72' 'Y02B 10/30']" 2001 3464855 "['Y10S 516/906']" 1974 3466402 "['Y02T 10/12']" 1974 3467767 "['Y02E 10/30' 'Y02E 10/20']" 1980 3467833 "['Y02E 10/46']" 1980 3467977 "['Y10T 428/218']" 1976 3469171 "['Y10T 24/2593' 'Y10T 24/2598']" 1976 3469272 "['Y10S 623/90']" 1978 3470490 "['Y10T 403/7003']" 1979 3471419 "['Y10T 74/19121']" 1979 3471545 "['Y02T 10/12']" 1978 3472136 "['Y10T 137/3127' 'Y10T 137/86236']" 1980 3472994 "['Y10S 292/57' 'Y10T 292/0934']" 1981 3473428 "['Y10S 623/90']" 1980 3474014 "['Y02B 10/20' 'Y02B 10/70']" 1986 3474470 "['Y10T 29/53783']" 1980 3475793 "['Y10T 403/7015' 'Y10T 403/16']" 1984 3477524 "['Y10S 507/931' 'Y10S 507/93']" 1984 3477979 "['Y02T 10/12']" 1985 3478203 "['Y02E 10/30' 'Y10T 428/1324' 'Y02E 10/20']" 1991 3478301 "['Y10S 165/90' 'Y10S 165/905']" 1989 3478373 "['Y02A 40/90']" 1987 3478451 "['Y02T 70/10']" 1988 3478460 "['Y10T 137/87957']" 1990 3478484 "['Y02T 70/10']" 1989 3478514 "['Y10S 52/10']" 1992 3478520 "['Y10S 55/30' 'Y02A 50/20' 'Y02T 10/12']" 1990 3478534 "['Y02B 20/40']" 1992 3478653 "['Y10S 150/90']" 1990 3478700 "['Y02T 10/12' 'Y02T 10/30']" 1991 3478703 "['Y02A 10/23']" 1991 3478717 "['Y02T 10/12']" 1993 3478720 "['Y10T 137/86445' 'Y10T 137/87161' 'Y10T 137/87716']" 1992 3478738 "['Y10S 424/08' 'Y10S 71/04' 'Y10T 428/2982']" 1993 3478763 "['Y02W 30/80']" 1991 3478796 "['Y02P 10/20']" 1991 3478814 "['Y10T 29/53913']" 1994 3478822 "['Y10S 315/05']" 1994 3478830 "['Y10T 137/86823' 'Y10T 137/5109']" 1994 3478833 "['Y10S 427/105' 'Y10S 427/103']" 1994 3478863 "['Y02E 60/10']" 1994 3478875 "['Y10S 55/17']" 1997 3478921 "['Y10S 5/934']" 1998 3478946 "['Y10S 128/918']" 1996 3478957 "['Y10S 160/90']" 1998 3478981 "['Y10T 16/3543']" 1999 3478991 "['Y02W 30/62']" 1997 3479012 "['Y10S 128/918']" 1998 3479021 "['Y02E 60/10']" 1998 end
So, for instance, taking only the first 3 observations for the sake of simplicity, the result should be something like:
Code:
docdb tech_fields_cited dummy_Y02P_20/10 dummy_Y10T_156/1023 dummy_Y02B_30/13 dummy_Y02B 30/12 dummy_Y02E_60/14 dummy_Y02B_10/70 1187498 "['Y02P 20/10']" 1 0 0 0 0 0 1226468 "['Y10T 156/1023']" 0 1 0 0 0 0 1236571 "['Y02B 30/13' 'Y02B 30/12' 'Y02E 60/14' 'Y02B 10/70']" 0 0 1 1 1 1
Thank you
Comment