Panel data: Identify recurrent strings across columns

Marvin Hanisch

Join Date: Oct 2017
Posts: 50

Panel data: Identify recurrent strings across columns

26 Feb 2019, 06:51

Dear Statalist,

I have a large dataset that contains several million emails (one row per email sent). I am trying to create a variable that indicates whether a given email sender has written to an email recipient before. To illustrate the problem, I created a short example:

message_id	sender_id	time	recipient_1	recipient_2	recipient_3	num_recipients	new_recipients
1	john_peters	31.05.2010 02:57	abc	def	ghi	3	3
2	john_peters	31.05.2010 03:47	klm			1	1
3	john_peters	31.05.2010 03:54	nop	ghi		2	1
4	john_peters	31.05.2010 05:04	abc	klm		2	0
5	john_peters	31.05.2010 06:28	def			1	0
6	john_peters	31.05.2010 06:42	abc	klm		2	0
7	john_peters	31.05.2010 06:59	qrs	def		2	1
8	mary_clark	01.06.2010 06:28	abc	klm		2	2
9	mary_clark	01.06.2010 06:42	tuv			1	1
10	mary_clark	01.06.2010 06:59	tuv	xyz		2	1

The variable message_id is a unique identifier for each email.
The variable sender_id is a unique identifier for each email sender.
The variable time is the time stamp when the email was sent.
The variables recipient_* are unique string identifiers for each email recipient.
The variable num_recipients is the number of recipients (= rowtotal of recipient_*).

The variable new_recipients is the variable of interest. It counts the number of email recipients in a given email that have not received an email from a given sender before. For instance, in message 4, john_peters sent an email to recipients abc, and klm. Given that john_peters had already sent a message to abc before (see message 1) but not to klm, the number of new recipients is one for message 4.

It would be relatively easy to create the variable new_recipients using a reshape command. However, given the size of the dataset, reshape is too time consuming and memory intense. Thus, I was wondering whether there is an alternative solution (maybe a loop) that allows me to create the variable new_recipients in a more efficient manner.

Best,

Marvin

Tags: data, loop, panel data, string

Nick Cox

Join Date: Mar 2014

Posts: 35721
#2

26 Feb 2019, 07:03

Please use dataex for data examples (FAQ Advice #12).

The number of non-missing values is given by the egen function rownonmiss(). See the help for egen.

Getting the number of distinct string values across rows is discussed in Section 7 of https://www.stata-journal.com/sjpdf....iclenum=pr0046 -- otherwise identifiable as

SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
(help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
Q1/09 SJ 9(1):137--157
shows how to exploit functions, egen functions, and Mata
for working rowwise; rowsort and rowranks are introduced

The code for the egen function rowsvals() is part of the egenmore package on SSC to be installed with

Code:

ssc install egenmore

after which

Code:

help egenmore

provides some documentation. Don't expect much efficiency, however, whatever that means; the price of your data structure is a loop over observations.
Comment

Marvin Hanisch

Join Date: Oct 2017
Posts: 50

26 Feb 2019, 10:18

Dear Nick,

Thank you for your response. I repost the above example using dataex below:

Code:

clear
input byte message_id str11 sender_id double time str3(recipient_1 recipient_2 recipient_3) byte(num_recipients new_recipients)
 1 "john_peters"      1590893820000 "abc" "def" "ghi" 3 3
 2 "john_peters" 1590896819999.9998 "klm" ""    ""    1 1
 3 "john_peters" 1590897239999.9998 "nop" "ghi" ""    2 1
 4 "john_peters" 1590901439999.9998 "abc" "klm" ""    2 0
 5 "john_peters" 1590906479999.9998 "def" ""    ""    1 0
 6 "john_peters"      1590907320000 "abc" "klm" ""    2 0
 7 "john_peters" 1590908340000.0002 "qrs" "def" ""    2 1
 8 "mary_clark"  1590992879999.9998 "abc" "klm" ""    2 2
 9 "mary_clark"       1590993720000 "tuv" ""    ""    1 1
10 "mary_clark"  1590994740000.0002 "tuv" "xyz" ""    2 1
end
format %tcnn/dd/ccYY_hh:MM time

I read your article and I found it helpful. However, I am still at loss with my problem. I am looking for a solution that is similar to what you have described here:
https://www.stata.com/support/faqs/d...t-occurrences/

with the exception that the variable "state" in your example is split across different columns in my case. I am very sorry if my problem seems trivial but even after days of searching forum entries, I still could not find a solution that fits this problem (other than reshaping the dataset first).

Best,

Marvin

Comment

Marvin Hanisch

Join Date: Oct 2017
Posts: 50

27 Apr 2019, 05:20

Dear Statalist,

Following up on my earlier question, I wanted to share a solution to the problem above, in case someone else encounters a similar problem with large datasets. It is certainly not the most elegant way but it works.

Code:

*** 1) Input data from above

*** 2) Split dataset by sender
egen split_id_2 = group(sender_id)
xtile split_id=split_id_2, nq(2)           ///In this case, the dataset is split in two but other values are possible, too, by changing the value in parentheses.
sum split_id
local max = r(max)
save main, replace

forval i = 1/`max' {
    use main if split_id == `i', clear
    keep message_id sender_id time recipient_1 recipient_2 recipient_3
    save split_id`i', replace
}
erase main.dta

*** 3) Create variable "num_recipients" and "new_recipients"
local files: dir . files "*.dta"

foreach aaa of local files {
    use `aaa', clear
        egen num_recipients = rownonmiss(recipient_*), strok
        reshape long recipient_, i(message_id) j(receivers)
        gen new_recipients_temp = 0
        bysort sender_id recipient_ (time): replace new_recipients_temp = 1 if _n == 1 & recipient_ != ""
        reshape wide recipient_ new_recipients_temp, i(message_id) j(receivers)
        egen new_recipients = rowtotal(new_recipients_temp*)
        drop new_recipients_temp*
save `aaa', replace
}

*** 4) Recombine and save dataset
use split_id1, clear
foreach num of numlist 2/`max' {
    append using split_id`num'
}
save dataset, replace

foreach num of numlist 1/`max' {
    erase split_id`num'.dta
}

If there is a better way to create these variables without splitting the dataset first, I would be curious to learn about it.

Best,

Marvin

Comment

David Benson

Join Date: Oct 2018

Posts: 489
#5

27 Apr 2019, 19:56

Marvin Hanisch

Thanks for coming back and posting the solution to your problem! I'm sure this will benefit many people going forward!
Comment

Announcement

Panel data: Identify recurrent strings across columns

Comment

Comment

Comment

Comment