Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel data: Identify recurrent strings across columns

    Dear Statalist,

    I have a large dataset that contains several million emails (one row per email sent). I am trying to create a variable that indicates whether a given email sender has written to an email recipient before. To illustrate the problem, I created a short example:

    message_id sender_id time recipient_1 recipient_2 recipient_3 num_recipients new_recipients
    1 john_peters 31.05.2010 02:57 abc def ghi 3 3
    2 john_peters 31.05.2010 03:47 klm 1 1
    3 john_peters 31.05.2010 03:54 nop ghi 2 1
    4 john_peters 31.05.2010 05:04 abc klm 2 0
    5 john_peters 31.05.2010 06:28 def 1 0
    6 john_peters 31.05.2010 06:42 abc klm 2 0
    7 john_peters 31.05.2010 06:59 qrs def 2 1
    8 mary_clark 01.06.2010 06:28 abc klm 2 2
    9 mary_clark 01.06.2010 06:42 tuv 1 1
    10 mary_clark 01.06.2010 06:59 tuv xyz 2 1

    The variable message_id is a unique identifier for each email.
    The variable sender_id is a unique identifier for each email sender.
    The variable time is the time stamp when the email was sent.
    The variables recipient_* are unique string identifiers for each email recipient.
    The variable num_recipients is the number of recipients (= rowtotal of recipient_*).

    The variable new_recipients is the variable of interest. It counts the number of email recipients in a given email that have not received an email from a given sender before. For instance, in message 4, john_peters sent an email to recipients abc, and klm. Given that john_peters had already sent a message to abc before (see message 1) but not to klm, the number of new recipients is one for message 4.

    It would be relatively easy to create the variable new_recipients using a reshape command. However, given the size of the dataset, reshape is too time consuming and memory intense. Thus, I was wondering whether there is an alternative solution (maybe a loop) that allows me to create the variable new_recipients in a more efficient manner.

    Best,

    Marvin

  • #2
    Please use dataex for data examples (FAQ Advice #12).

    The number of non-missing values is given by the egen function rownonmiss(). See the help for egen.

    Getting the number of distinct string values across rows is discussed in Section 7 of https://www.stata-journal.com/sjpdf....iclenum=pr0046 -- otherwise identifiable as

    SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
    (help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
    Q1/09 SJ 9(1):137--157
    shows how to exploit functions, egen functions, and Mata
    for working rowwise; rowsort and rowranks are introduced

    The code for the egen function rowsvals() is part of the egenmore package on SSC to be installed with

    Code:
    ssc install egenmore
    after which

    Code:
    help egenmore


    provides some documentation. Don't expect much efficiency, however, whatever that means; the price of your data structure is a loop over observations.

    Comment


    • #3
      Dear Nick,

      Thank you for your response. I repost the above example using dataex below:

      Code:
      clear
      input byte message_id str11 sender_id double time str3(recipient_1 recipient_2 recipient_3) byte(num_recipients new_recipients)
       1 "john_peters"      1590893820000 "abc" "def" "ghi" 3 3
       2 "john_peters" 1590896819999.9998 "klm" ""    ""    1 1
       3 "john_peters" 1590897239999.9998 "nop" "ghi" ""    2 1
       4 "john_peters" 1590901439999.9998 "abc" "klm" ""    2 0
       5 "john_peters" 1590906479999.9998 "def" ""    ""    1 0
       6 "john_peters"      1590907320000 "abc" "klm" ""    2 0
       7 "john_peters" 1590908340000.0002 "qrs" "def" ""    2 1
       8 "mary_clark"  1590992879999.9998 "abc" "klm" ""    2 2
       9 "mary_clark"       1590993720000 "tuv" ""    ""    1 1
      10 "mary_clark"  1590994740000.0002 "tuv" "xyz" ""    2 1
      end
      format %tcnn/dd/ccYY_hh:MM time
      I read your article and I found it helpful. However, I am still at loss with my problem. I am looking for a solution that is similar to what you have described here:
      https://www.stata.com/support/faqs/d...t-occurrences/

      with the exception that the variable "state" in your example is split across different columns in my case. I am very sorry if my problem seems trivial but even after days of searching forum entries, I still could not find a solution that fits this problem (other than reshaping the dataset first).

      Best,

      Marvin

      Comment


      • #4
        Dear Statalist,

        Following up on my earlier question, I wanted to share a solution to the problem above, in case someone else encounters a similar problem with large datasets. It is certainly not the most elegant way but it works.


        Code:
        *** 1) Input data from above
        
        *** 2) Split dataset by sender
        egen split_id_2 = group(sender_id)
        xtile split_id=split_id_2, nq(2)           ///In this case, the dataset is split in two but other values are possible, too, by changing the value in parentheses.
        sum split_id
        local max = r(max)
        save main, replace
        
        forval i = 1/`max' {
            use main if split_id == `i', clear
            keep message_id sender_id time recipient_1 recipient_2 recipient_3
            save split_id`i', replace
        }
        erase main.dta
        
        *** 3) Create variable "num_recipients" and "new_recipients"
        local files: dir . files "*.dta"
        
        foreach aaa of local files {
            use `aaa', clear
                egen num_recipients = rownonmiss(recipient_*), strok
                reshape long recipient_, i(message_id) j(receivers)
                gen new_recipients_temp = 0
                bysort sender_id recipient_ (time): replace new_recipients_temp = 1 if _n == 1 & recipient_ != ""
                reshape wide recipient_ new_recipients_temp, i(message_id) j(receivers)
                egen new_recipients = rowtotal(new_recipients_temp*)
                drop new_recipients_temp*
        save `aaa', replace
        }
        
        *** 4) Recombine and save dataset
        use split_id1, clear
        foreach num of numlist 2/`max' {
            append using split_id`num'
        }
        save dataset, replace
        
        foreach num of numlist 1/`max' {
            erase split_id`num'.dta
        }

        If there is a better way to create these variables without splitting the dataset first, I would be curious to learn about it.

        Best,

        Marvin

        Comment


        • #5
          Marvin Hanisch

          Thanks for coming back and posting the solution to your problem! I'm sure this will benefit many people going forward!

          Comment

          Working...
          X