Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • noccur vs nss with long strings

    Hello,

    It seems that nss has problems with long strings, whereas noccur does not have problems with long strings. Both functions are part of "Extensions to generate" (egenmore) and count the number of occurrences of a string within another string. I am using Stata IC 13.1.

    Example:


    CODE:


    version 13


    clear


    set obs 1

    generate str test1 = "br01 br02 br03 br04 br05 br06 br07 br08 br09 br10 br11 br12 br13 br14 br15 br16 br17 br18 br19 br20 bottling bottling " in 1
    generate str test2 = "br01 br02 br03 br04 br05 br06 br07 br08 br09 br10 br11 br12 br13 br14 br15 br16 br17 br18 br19 bottling bottling " in 1
    egen check1_nss=nss(test1), find("bottling")
    egen check2_nss=nss(test2), find("bottling")
    egen check1_noccur=noccur(test1), string("bottling")
    egen check2_noccur=noccur(test2), string("bottling")

    tab check1_nss check1_noccur
    tab check2_nss check2_noccur



    OUTPUT:



    . version 13

    .
    . clear

    .
    .
    .
    . set obs 1
    obs was 0, now 1

    . generate str test1 = "br01 br02 br03 br04 br05 br06 br07 br08 br09 br10 br11 br12 br13 br14 br15 br16 br17 br18
    > br19 br20 bottling bottling " in 1

    . generate str test2 = "br01 br02 br03 br04 br05 br06 br07 br08 br09 br10 br11 br12 br13 br14 br15 br16 br17 br18
    > br19 bottling bottling " in 1

    . egen check1_nss=nss(test1), find("bottling")

    . egen check2_nss=nss(test2), find("bottling")

    . egen check1_noccur=noccur(test1), string("bottling")

    . egen check2_noccur=noccur(test2), string("bottling")

    .
    . tab check1_nss check1_noccur

    | check1_noc
    | cur
    check1_nss | 2 | Total
    -----------+-----------+----------
    1 | 1 | 1
    -----------+-----------+----------
    Total | 1 | 1


    . tab check2_nss check2_noccur

    | check2_noc
    | cur
    check2_nss | 2 | Total
    -----------+-----------+----------
    2 | 1 | 1
    -----------+-----------+----------
    Total | 1 | 1


    Regards,

    Carlos Eduardo Hernandez

  • #2
    nss() was written for Stata 6 in 2000, so it is not greatly surprising that it does not work with long strings introduced in Stata 13 in 2012. I will look at the code to try to identify precisely why it doesn't work. At this juncture, it seems more important to keep these functions as possibly useful for people who have old versions of Stata rather than keep them bang up to date for users of Stata 13 (up) who have other functionality available.

    See also http://www.stata-journal.com/sjpdf.h...iclenum=dm0056 for an approach requiring no user-written code.

    Comment


    • #3
      Looking at this in more detail:

      1. My mention of long strings introduced in Stata 13 was quite irrelevant here, indeed wrong, as the strings concerned here are much shorter than even 244 characters allowed in several recent versions of Stata.

      2. Carlos' example has unearthed a limitation, indeed bug, in the user-written egen function nss(), as the use of a byte variable to hold the position of a substring means that no substring starting after position 100 can be recorded as such. (That was my program, written in 2000.)

      3. The user-written egen function noccur() uses an int variable to hold the position of a substring and as such is good up to position 32740. (That was Nick Winter's program, written in 2002.)

      As the help for egenmore notes

      The inclusion of noccur() and nss(), two almost identical functions, was an act of sheer inadvertence by the maintainer.

      Thanks to Carlos for what turned out to be bug report. I'll fix nss() within egenmore in due course.

      Comment


      • #4
        I know it is years later, but Google keeps bringing me back to this thread. Now that we have entered the era of long strings, does noccur() work past position 32740?

        Comment


        • #5
          It looks like it does not.

          Code:
          clear
          set obs 1
          generate strL teststring = ""
          local a = "aaaaaaaaaa" // a character repeated 10 times
          
          forv i = 1/3274{ // 3274 times 10 'a's is 32,740 'a's
              replace teststring = teststring + "`a'"
          }
          
          egen num1 = noccur(teststring), string("a")
          replace teststring = teststring + "a" // just one more a.
          egen num2 = noccur(teststring), string("a")
          
          list num1 num2, clean noobs
          Code:
          . list num1 num2, clean noobs
          
               num1   num2  
              32740      .
          Last edited by Daniel Schaefer; 08 Sep 2023, 16:08.

          Comment


          • #6
            Thanks. I assume I could edit the "gen int" statements in the _gnoccur.ado file to be "gen long"?

            Comment


            • #7
              If this is the latest source, I don't see any reason why that shouldn't work, but Nick Cox is the real expert here. You might just try and see how it goes. You may also need to reload the program into memory from the file system after you've made these changes.

              Comment


              • #8
                The function noccur() was written by Nick Winter in 2002 for Stata 6. The best advice I can give is that it has been superseded by the advice cited in #2.

                Comment

                Working...
                X