Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • counting names?

    Dear All, Suppose that I have the following data
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id int year str3 name byte(number1 number2)
    2 1992 "a,b" 1 1
    2 1993 "b,a" 2 2
    2 1994 "c,b" 1 3
    2 1995 "a,b" 1 4
    2 1996 "d,e" 1 1
    2 1997 "d,e" 2 2
    2 1998 "d,e" 3 3
    2 1999 "d,e" 4 4
    2 2000 "d,e" 5 5
    2 2001 "f,g" 1 1
    2 2002 "f,h" 2 1
    2 2003 "h,g" 2 1
    2 2004 "h,i" 3 1
    2 2005 "j,i" 1 2
    2 2006 "j,i" 2 3
    2 2007 ""    . .
    2 2008 "i,l" 1 1
    2 2009 "m,j" 1 1
    2 2010 "j,n" 2 1
    2 2011 "j,o" 3 1
    2 2012 "j,o" 4 2
    2 2013 "j,o" 5 3
    2 2014 "p,q" 1 1
    2 2015 "p,q" 2 2
    end
    For each `id', there are (a pair of) `name's for each `year'. The purpose is to construct `number1' and `number2'. For id=2, starting from year=1992, two names a and b appear in the first time. As a result, number1=1 and number2=1. In the next year (1993), both a and b appear the second time, and thus number1=2 and number2=2. In the third year (1994), c appears in the first time, number1=1 but b appears in the third time, number2=3. Any suggestions? Thanks.
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    There appears to be another unstated assumption. The counting restarts for any name that skips a year; thus in the fourth year number1 is 1 as a did not appear in the previous year.

    With that said, the following appears to do what is wanted. You may have to adapt the split command to use with realistic names.
    Code:
    split name, parse(,) generate(nm)
    reshape long nm, i(id year) j(j)
    generate num = .
    bysort id nm (year): replace num = 1 if !missing(nm)
    bysort id nm (year): replace num = num[_n-1]+1 if year==year[_n-1]+1
    drop nm
    reshape wide num, i(id year) j(j)
    sort id year
    assert num1==number1 & num2==number2
    list, clean noobs

    Comment


    • #3
      Dear William, My bad. I should have stated the additional assumption that you mentioned. Thanks a lot for your help reply.
      Ho-Chuan (River) Huang
      Stata 19.0, MP(4)

      Comment


      • #4
        Alternative:
        Code:
        ********************************************************************************
        isid id year
        sort id year
        
        split name , p(",") gen(name)
        
        levelsof name1, local(name1)
        levelsof name2 , local(name2)
        
        local names : list name1 | name2
        local names : list uniq names
        
        gen  N1 = .
        gen  N2 = .
        
        qui forvalues i = 1/`: list sizeof names' {
            
            tempvar n`i' s`i'
            gen `n`i'' = "`: word `i' of `names' '"
            gen `s`i'' = inlist(`n`i'', name1, name2)  
            by id : replace `s`i'' = `s`i'' + `s`i''[_n-1] if ( `s`i'' & _n > 1 )  
            replace N1 = `s`i'' if ( name1 == `n`i'' )
            replace N2 = `s`i'' if ( name2 == `n`i'' )
        }
        ********************************************************************************
        assert number1 == N1 & number2 == N2

        Comment


        • #5
          I'm reluctant to place somewhat arbitrary strings into local macros that are later expanded in commands. The following dataset
          Code:
          input byte id int year str10 name byte(number1 number2)
          2 1994 "o'leary,b" 1 3
          end
          produces unexpected results with the code in post #4.

          Comment


          • #6
            Dear Bjarte, Thanks for the suggestion.
            Ho-Chuan (River) Huang
            Stata 19.0, MP(4)

            Comment


            • #7
              Thanks William, for spotting the unstated assumption, and providing a nice solution. Also, thanks for spotting a limitation in my solution in post #4. Below is a revised version of #4 with changes in bold. A solution avoiding reshapes will run faster.
              Code:
              * Adding name with '
              replace name = subinstr(name,"a","o'leary o'hoy",1)
              Code:
              ********************************************************************************
              isid id year
              sort id year
              
              * Replacing ' and ` is not necessary for "o'leary o'hoy",
              * but is added to avoid possible issues with parsing
              
              gen name0 = ustrregexra(name, "['`]","__")    
              split name0 , p(",") gen(name)
              levelsof name1, local(name1)
              levelsof name2, local(name2)
              local names : list name1 | name2
              tokenize `"`names'"' 
              
              gen  N1 = .
              gen  N2 = .
              
              qui forvalues i = 1/`: list sizeof names' {
                  
                  tempvar n`i' s`i'
                  gen `n`i'' = "``i''"
                  gen `s`i'' = inlist(`n`i'', name1, name2)  
                  by id : replace `s`i'' = `s`i'' + `s`i''[_n-1] if ( `s`i'' & _n > 1 )  
                  replace N1 = `s`i'' if ( name1 == `n`i'' )
                  replace N2 = `s`i'' if ( name2 == `n`i'' )
              }
              
              drop name?
              ********************************************************************************
              assert number1 == N1 & number2 == N2
              Last edited by sladmin; 20 May 2019, 12:48.

              Comment


              • #8
                A last version: using mata to avoid parsing problems of strings with left single quotes (changes in bold):
                Code:
                replace name = subinstr(name, "a","o'leary o'hoy `Oh No!", 1)
                Code:
                ********************************************************************************
                isid id year
                sort id year
                
                split name , p(",") gen(name)
                
                mata: names = uniqrows(st_sdata(.,"name1",0) \ st_sdata(.,"name2",0))
                mata: st_local("nnames", strofreal(length(names),"%10.0f"))
                
                gen  N1 = .
                gen  N2 = .
                
                qui forvalues i = 1/`nnames' {
                
                    tempvar n`i' s`i'
                    tempname sc_name`i'
                    mata : st_strscalar( "`sc_name`i''" , names[`i'])  
                    gen `n`i'' = `sc_name`i''
                    gen `s`i'' = inlist(`n`i'', name1, name2)  
                    by id : replace `s`i'' = `s`i'' + `s`i''[_n-1] if ( `s`i'' & _n > 1 )  
                    replace N1 = `s`i'' if ( name1 == `n`i'' )
                    replace N2 = `s`i'' if ( name2 == `n`i'' )
                }
                
                drop name?
                ********************************************************************************
                assert number1 == N1 & number2 == N2
                Code:
                . list id year name N? in 1/5
                
                     +-----------------------------------------------+
                     | id   year                      name   N1   N2 |
                     |-----------------------------------------------|
                  1. |  2   1992   o'leary o'hoy `Oh No!,b    1    1 |
                  2. |  2   1993   b,o'leary o'hoy `Oh No!    2    2 |
                  3. |  2   1994                       c,b    1    3 |
                  4. |  2   1995   o'leary o'hoy `Oh No!,b    1    4 |
                  5. |  2   1996                       d,e    1    1 |
                     +-----------------------------------------------+

                Comment


                • #9
                  I suspect you can do this with egen and group - generate the group counter, and then the max is the number of combinations.

                  Comment


                  • #10
                    Phil Bromiley Not sure what you mean by "do this" - do you refer to what was accomplished in post #2 or to some part of the code in posts 4, 7, and 8 (which I admit to not fully exploring, since the code in post #2 averages 50,000 observations per second on my system)?

                    In the former case, I don't see how group helps, because each pair of names is being used to determine two separate spells, and the pair "x,y" are the same names as the pair "y,x". Consider the following example, where num1 and num2 are the results from the code in post #2.
                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input byte id int year str3 name byte(num1 num2)
                    42 1982 "a,b" 1 1
                    42 1983 "b,c" 2 1
                    42 1984 "c,a" 2 1
                    end

                    Comment

                    Working...
                    X