Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting the appearance of numbers

    Hi,

    the following dataset consists of numbers in a string format that are either separated by a comma (,) or a dot (.). For every variable, I would like to create a sum variable which counts the number of numbers that appear in this variable.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str3 AcqPRitself str1 IM_ER_pos str5 IM_ER_neut str1(IM_ER_neg IM_EG_pos) str2 IM_EG_neut str1(IM_EG_neg IM_Div_pos) str3 IM_Div_neut str1 IM_Div_neg str3 IM_SEO str5 IM_Conference str4 IM_Sh_Meeting str1 IM_hiring str14 IM_new_prod
    "4"  "" "5"     "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "1"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "9"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  "8" "7.34"
    "1"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "4"  "" ""      "" "" "" "" ""  ""    "" ""  "6" ""  ""  ""    
    "2"  "" ""      "" "" "" "" ""  ""    "" "1" ""  ""  ""  ""    
    "10" "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  "12"  
    "1"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "3"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "2"  "" "3,4,5" "" "" "" "" ""  "3,8" "" ""  ""  "5" ""  ""    
    "2"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "1"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "2"  "" "3"     "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "1"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "7"  "" "8"     "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "3"  "" "5"     "" "" "" "" ""  "4"   "" ""  ""  ""  ""  ""    
    "3"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "5"  "" "4"     "" "" "" "" "2" ""    "" ""  ""  ""  ""  "3.1" 
    "2"  "" "1"     "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "3"  "" ""      "" "" "" "" ""  ""    "" ""  ""  "4" ""  ""    
    "5"  "" "6"     "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "2"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "3"  "" "6"     "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "14" "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "3"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "2"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "3"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "20" "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    "4"  "" ""      "" "" "" "" ""  ""    "" ""  ""  "5" ""  ""    
    "1"  "" ""      "" "" "" "" ""  ""    "" ""  ""  ""  ""  ""    
    end

    Example:
    For the first observation every sum variable should take on 0 except the sum variables for AcqPRitself and IM_ER_neut. In these two cases, the sum variables should take on the value 1 because each one contains 1 number.
    In the third observation, the sum variables of AcqPRitself and IM_hiring should take on the value 1. The sum variable for IM_new_prod has to equal 2, as two numbers (7 and 34) appear here. All other sum variables should equal 0.

    Is there an easier way than using the split command for every variable?

    Thanks in advance

  • #2
    This should do the job, though a bit clunky. You'll need to install egenmore first, the noccur inside would be useful to count the number of periods and commas:

    Code:
    ssc install egenmore
    And then here is the loop:

    Code:
    foreach x of varlist AcqPRitself-IM_new_prod{
        egen numcomma  = noccur(`x'), string(,)
        egen numperiod = noccur(`x'), string(.)
        gen  wordcount = strlen(`x')
        gen n_`x' = 1 if wordcount > 0 & numcomma == 0 & numperiod==0
        replace n_`x' = numcomma + numperiod + 1 if missing(n_`x') & wordcount > 0
        replace n_`x' = 0 if missing(n_`x')
        capture drop numcomma numperiod wordcount
    }
    Sample results:
    Code:
    . list AcqPRitself IM_ER_neut IM_new_prod n_AcqPRitself n_IM_ER_neut n_IM_new_prod, sep(0)
    
         +-----------------------------------------------------------------+
         | AcqPRi~f   IM_ER_~t   IM_new~d   n_AcqP~f   n~R_neut   n_IM_n~d |
         |-----------------------------------------------------------------|
      1. |        4          5                     1          1          0 |
      2. |        1                                1          0          0 |
      3. |        9                  7.34          1          0          2 |
      4. |        1                                1          0          0 |
      5. |        4                                1          0          0 |
      6. |        2                                1          0          0 |
      7. |       10                    12          1          0          1 |
      8. |        1                                1          0          0 |
      9. |        3                                1          0          0 |
     10. |        2      3,4,5                     1          3          0 |
     11. |        2                                1          0          0 |
     12. |        1                                1          0          0 |
     13. |        2          3                     1          1          0 |
     14. |        1                                1          0          0 |
     15. |        7          8                     1          1          0 |
     16. |        3          5                     1          1          0 |
     17. |        3                                1          0          0 |
     18. |        5          4        3.1          1          1          2 |
     19. |        2          1                     1          1          0 |
     20. |        3                                1          0          0 |
     21. |        5          6                     1          1          0 |
     22. |        2                                1          0          0 |
     23. |        3          6                     1          1          0 |
     24. |       14                                1          0          0 |
     25. |        3                                1          0          0 |
     26. |        2                                1          0          0 |
     27. |        3                                1          0          0 |
     28. |       20                                1          0          0 |
     29. |        4                                1          0          0 |
     30. |        1                                1          0          0 |
         +-----------------------------------------------------------------+
    Last edited by Ken Chui; 23 Nov 2022, 06:51.

    Comment


    • #3
      Here is another way, with the advantage of using only inbuilt Stata commands:

      Code:
      foreach var of varlist  AcqPRitself-IM_new_prod {
          gen _`var' = subinstr(`var',","," ",.)
          replace _`var' = subinstr(_`var',"."," ",.)
          gen int n_`var' = wordcount(_`var')
          drop _`var'
      }
      Sample output matching #2:

      Code:
      . list AcqPRitself IM_ER_neut IM_new_prod n_AcqPRitself n_IM_ER_neut n_IM_new_prod, sep(0)
      
           +-----------------------------------------------------------------+
           | AcqPRi~f   IM_ER_~t   IM_new~d   n_AcqP~f   n~R_neut   n_IM_n~d |
           |-----------------------------------------------------------------|
        1. |        4          5                     1          1          0 |
        2. |        1                                1          0          0 |
        3. |        9                  7.34          1          0          2 |
        4. |        1                                1          0          0 |
        5. |        4                                1          0          0 |
        6. |        2                                1          0          0 |
        7. |       10                    12          1          0          1 |
        8. |        1                                1          0          0 |
        9. |        3                                1          0          0 |
       10. |        2      3,4,5                     1          3          0 |
       11. |        2                                1          0          0 |
       12. |        1                                1          0          0 |
       13. |        2          3                     1          1          0 |
       14. |        1                                1          0          0 |
       15. |        7          8                     1          1          0 |
       16. |        3          5                     1          1          0 |
       17. |        3                                1          0          0 |
       18. |        5          4        3.1          1          1          2 |
       19. |        2          1                     1          1          0 |
       20. |        3                                1          0          0 |
       21. |        5          6                     1          1          0 |
       22. |        2                                1          0          0 |
       23. |        3          6                     1          1          0 |
       24. |       14                                1          0          0 |
       25. |        3                                1          0          0 |
       26. |        2                                1          0          0 |
       27. |        3                                1          0          0 |
       28. |       20                                1          0          0 |
       29. |        4                                1          0          0 |
       30. |        1                                1          0          0 |
           +-----------------------------------------------------------------+

      Comment


      • #4
        Both ways work like a charm. Thank you.

        Comment


        • #5
          Incidentally, if you're okay with using regular expressions, the code could have been even shorter:

          Code:
          foreach var of varlist AcqPRitself-IM_new_prod {
              gen int n_`var' = wordcount(ustrregexra(`var',",|\."," "))
          }

          Comment

          Working...
          X