counting names?

River Huang

Join Date: Mar 2016

Posts: 1908
#1

counting names?

19 May 2019, 03:09

Dear All, Suppose that I have the following data

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id int year str3 name byte(number1 number2) 2 1992 "a,b" 1 1 2 1993 "b,a" 2 2 2 1994 "c,b" 1 3 2 1995 "a,b" 1 4 2 1996 "d,e" 1 1 2 1997 "d,e" 2 2 2 1998 "d,e" 3 3 2 1999 "d,e" 4 4 2 2000 "d,e" 5 5 2 2001 "f,g" 1 1 2 2002 "f,h" 2 1 2 2003 "h,g" 2 1 2 2004 "h,i" 3 1 2 2005 "j,i" 1 2 2 2006 "j,i" 2 3 2 2007 "" . . 2 2008 "i,l" 1 1 2 2009 "m,j" 1 1 2 2010 "j,n" 2 1 2 2011 "j,o" 3 1 2 2012 "j,o" 4 2 2 2013 "j,o" 5 3 2 2014 "p,q" 1 1 2 2015 "p,q" 2 2 end

For each `id', there are (a pair of) `name's for each `year'. The purpose is to construct `number1' and `number2'. For id=2, starting from year=1992, two names a and b appear in the first time. As a result, number1=1 and number2=1. In the next year (1993), both a and b appear the second time, and thus number1=2 and number2=2. In the third year (1994), c appears in the first time, number1=1 but b appears in the third time, number2=3. Any suggestions? Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

19 May 2019, 04:54

There appears to be another unstated assumption. The counting restarts for any name that skips a year; thus in the fourth year number1 is 1 as a did not appear in the previous year.

With that said, the following appears to do what is wanted. You may have to adapt the split command to use with realistic names.

Code:

split name, parse(,) generate(nm) reshape long nm, i(id year) j(j) generate num = . bysort id nm (year): replace num = 1 if !missing(nm) bysort id nm (year): replace num = num[_n-1]+1 if year==year[_n-1]+1 drop nm reshape wide num, i(id year) j(j) sort id year assert num1==number1 & num2==number2 list, clean noobs
5 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#3

19 May 2019, 04:57

Dear William, My bad. I should have stated the additional assumption that you mentioned. Thanks a lot for your help reply.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

19 May 2019, 14:05

Alternative:

Code:

********************************************************************************
isid id year
sort id year

split name , p(",") gen(name)

levelsof name1, local(name1)
levelsof name2 , local(name2)

local names : list name1 | name2
local names : list uniq names

gen  N1 = .
gen  N2 = .

qui forvalues i = 1/`: list sizeof names' {
    
    tempvar n`i' s`i'
    gen `n`i'' = "`: word `i' of `names' '"
    gen `s`i'' = inlist(`n`i'', name1, name2)  
    by id : replace `s`i'' = `s`i'' + `s`i''[_n-1] if ( `s`i'' & _n > 1 )  
    replace N1 = `s`i'' if ( name1 == `n`i'' )
    replace N2 = `s`i'' if ( name2 == `n`i'' )
}
********************************************************************************
assert number1 == N1 & number2 == N2

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

19 May 2019, 14:49

I'm reluctant to place somewhat arbitrary strings into local macros that are later expanded in commands. The following dataset

Code:

input byte id int year str10 name byte(number1 number2) 2 1994 "o'leary,b" 1 3 end

produces unexpected results with the code in post #4.
1 like
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#6

19 May 2019, 18:03

Dear Bjarte, Thanks for the suggestion.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

20 May 2019, 04:36

Thanks William, for spotting the unstated assumption, and providing a nice solution. Also, thanks for spotting a limitation in my solution in post #4. Below is a revised version of #4 with changes in bold. A solution avoiding reshapes will run faster.

Code:

* Adding name with '
replace name = subinstr(name,"a","o'leary o'hoy",1)

Code:

********************************************************************************
isid id year
sort id year

* Replacing ' and ` is not necessary for "o'leary o'hoy",
* but is added to avoid possible issues with parsing

gen name0 = ustrregexra(name, "['`]","__")    
split name0 , p(",") gen(name)
levelsof name1, local(name1)
levelsof name2, local(name2)
local names : list name1 | name2
tokenize `"`names'"' 

gen  N1 = .
gen  N2 = .

qui forvalues i = 1/`: list sizeof names' {
    
    tempvar n`i' s`i'
    gen `n`i'' = "``i''"
    gen `s`i'' = inlist(`n`i'', name1, name2)  
    by id : replace `s`i'' = `s`i'' + `s`i''[_n-1] if ( `s`i'' & _n > 1 )  
    replace N1 = `s`i'' if ( name1 == `n`i'' )
    replace N2 = `s`i'' if ( name2 == `n`i'' )
}

drop name?
********************************************************************************
assert number1 == N1 & number2 == N2

Last edited by sladmin; 20 May 2019, 12:48.

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

21 May 2019, 09:42

A last version: using mata to avoid parsing problems of strings with left single quotes (changes in bold):

Code:

replace name = subinstr(name, "a","o'leary o'hoy `Oh No!", 1)

Code:

********************************************************************************
isid id year
sort id year

split name , p(",") gen(name)

mata: names = uniqrows(st_sdata(.,"name1",0) \ st_sdata(.,"name2",0))
mata: st_local("nnames", strofreal(length(names),"%10.0f"))

gen  N1 = .
gen  N2 = .

qui forvalues i = 1/`nnames' {

    tempvar n`i' s`i'
    tempname sc_name`i'
    mata : st_strscalar( "`sc_name`i''" , names[`i'])  
    gen `n`i'' = `sc_name`i''
    gen `s`i'' = inlist(`n`i'', name1, name2)  
    by id : replace `s`i'' = `s`i'' + `s`i''[_n-1] if ( `s`i'' & _n > 1 )  
    replace N1 = `s`i'' if ( name1 == `n`i'' )
    replace N2 = `s`i'' if ( name2 == `n`i'' )
}

drop name?
********************************************************************************
assert number1 == N1 & number2 == N2

Code:

. list id year name N? in 1/5

     +-----------------------------------------------+
     | id   year                      name   N1   N2 |
     |-----------------------------------------------|
  1. |  2   1992   o'leary o'hoy `Oh No!,b    1    1 |
  2. |  2   1993   b,o'leary o'hoy `Oh No!    2    2 |
  3. |  2   1994                       c,b    1    3 |
  4. |  2   1995   o'leary o'hoy `Oh No!,b    1    4 |
  5. |  2   1996                       d,e    1    1 |
     +-----------------------------------------------+

Comment

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#9

24 May 2019, 16:07

I suspect you can do this with egen and group - generate the group counter, and then the max is the number of combinations.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

25 May 2019, 10:08

Phil Bromiley Not sure what you mean by "do this" - do you refer to what was accomplished in post #2 or to some part of the code in posts 4, 7, and 8 (which I admit to not fully exploring, since the code in post #2 averages 50,000 observations per second on my system)?

In the former case, I don't see how group helps, because each pair of names is being used to determine two separate spells, and the pair "x,y" are the same names as the pair "y,x". Consider the following example, where num1 and num2 are the results from the code in post #2.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id int year str3 name byte(num1 num2) 42 1982 "a,b" 1 1 42 1983 "b,c" 2 1 42 1984 "c,a" 2 1 end
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment