Combining strings by groups

Anton Ivanov

Join Date: Sep 2014
Posts: 267

Combining strings by groups

18 Sep 2018, 12:51

Hello! I am seeking your help with the following task. My data set (example given below) consists of comments (string) identified by listing_id and month. What would be the appropriate way of combining (appending) all comments for a given listing_id by month?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long listing_id strL comments byte month
2515 "A"   1
2515 "B"   1
2515 "C"   2
2515 "D"   4
2515 "E"   4
2515 "F"   5
2515 "G"   6
2515 "H"   6
2515 "I"   7
2515 "J"   7
2515 "K"   7
2515 "L"   7
2515 "M"   8
2515 "N"   8
2515 "O"   9
2515 "P"  10
2515 "Q"  11
2515 "R"  11
2539 "S"  12
2595 "T"   3
2595 "U"   4
2595 "V"   5
2595 "W"   9
2595 "X"   9
3330 "Y"   1
3330 "Z"   4
3330 "AA"  4
3330 "BB"  5
3330 "CC"  5
3330 "DD"  9
3330 "EE" 10
3330 "FF" 12
3831 "GG"  1
3831 "HH"  5
3831 "II"  7
3831 "JJ"  8
3831 "KK"  8
3831 "LL"  8
3831 "MM"  8
end

Thankfully,
Anton

Last edited by Anton Ivanov; 18 Sep 2018, 13:21.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#2

18 Sep 2018, 13:31

Do you mean this?

Code:

by listing_id month, sort: gen combined_comments = comments if _n == 1 by listing_id month: replace combined_comments = combined_comments[_n-1] /// + ", " + comments if _n > 1 & !missing(comments) by listing_id month: keep if _n == _N

If this isn't what you had in mind, please post back showing an example of what you want the result to look like and a clearer explanation of how it relates to the inputs.
1 like
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#3

18 Sep 2018, 13:35

If the maximum number of comments for a listing_id/month is not too large, I would use reshape and then combine the variables with -egen concat()-. In the case above, your maximum number of comments within a month for a listing_id is 4, so that works easily.

Code:

bysort listing_id month: gen j_var=_n reshape wide comments, i(listing_id month) j(j_var) egen all_comments=concat(comments1-comments4) *to reshape back to long: reshape long drop if comments==""

Last edited by Carole J. Wilson; 18 Sep 2018, 13:37.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
1 like
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#4

18 Sep 2018, 13:37

That's exactly what I meant. Thank you very much for help, Clyde.
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#5

18 Sep 2018, 13:38

Carole, I also appreciate your suggestion. Thank you!
Comment

Anton Ivanov

Join Date: Sep 2014
Posts: 267

08 Jun 2019, 20:18

Hello!

I am using the approach suggested by Clyde for a new dataset. The example is given below (number of rows ~4 million; number of unique ids ~3200) :

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long fips str852 content_lemmatized
1001 "time quit lazy get get ready hit gym" 
1001 "im devil get deatsville al"           
1001 "need hurry tf amp get already"        
1001 "wont go fishing"                      
1001 "forgot rubberband og dropped"         
1001 "might well stay high school"          
1001 "dont anyone important keep around tbh"
1001 "agree crocs god nd mistake st mistake"
1001 "im sorry get distracted"              
1001 "blue eyed baby show way"              
end

However, it results in the following error:

Code:

bysort fips: gen j_var=_n
reshape wide content_lemmatized, i(fips) j(j_var)
egen all_comments=concat(comments1-comments4)

*to reshape back to long:
reshape long
drop if comments==""
______________

variable j_var takes on too many values
r(134);

end of do-file

r(134);

.

I would appreciate if anybody could suggest a plausible solution.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35717
#7

09 Jun 2019, 05:47

Some confusion here. See #3 from Carole J. Wilson which begins with a clear reservation (emphasis added):

If the maximum number of comments for a listing_id/month is not too large, I would use reshape

So, the implication is to retreat to what Clyde Schechter originally suggested.
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#8

09 Jun 2019, 09:34

Mon dieu! I just realized that mistakenly used Carole's code under the impressing that I am using Clyde's. I apologize for the confusion and my unfortunate oversight. Gotta take a break from work. Thank you, Nick!
Comment

Straso Jovanovski

Join Date: Feb 2016
Posts: 70

08 Mar 2023, 22:31

This code doesn't work for me:

Code:

sort personid surveynum, stable
foreach l of var * {
    by personid surveynum : gen `l'_concat = `l'[1]
    by personid surveynum : replace `l'_concat = `l'_concat[_n-1] + " " + `l' if _n > 1 & !missing(`l')
    by personid surveynum : replace `l'_concat = `l'_concat[_N]
    by personid surveynum : keep if _n == _N
}

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35717
#10

09 Mar 2023, 03:06

Code:

by personid surveynum :

likely identifies groups with just one observation, so in that case "doesn't work" could mean here that the code works, but doesn't give you what you want.

Perhaps you need

Code:

by personid (surveynum):

If that doesn't help then we need please (1) a data example (2) more details on what you mean by "doesn't work". FAQ Advice #12 explains both points.
1 like
Comment
Straso Jovanovski

Join Date: Feb 2016

Posts: 70
#11

09 Mar 2023, 12:02

Thank you for your help. I will try to explain in more detail, then follow up with an example if needed. My dataset is comprised of survey responses from just under 200 respondents (each with a unique identifier 'personid') surveyed on two different occasions, pre- and post- intervention (identified by variable 'surveynum'). The data has several groups of variables in which each group basically represents a single variable that accepts multiple answer choices from survey takers represented by codes, so a person for example can select one or more answer options for each of several questions. A variable may be named multichoice1, multichoice2, multichoice3, etc. Each of which contains the corresponding number or code if a person has selected it among their answer(s). For any one survey-taker there may be numbers in one, more, or all of the above sets of variables. These sets have different code ranges: some go from 1-4, some 1-6, others 1-9.
There are quite a few of these variable sets; I had to first identify them by isolating those variables whose names end in a number, then going through and removing some that did not really belong in the 'multi-choice' category to finally get at the list of 'stubs' (with the help of -levelsof- after a few preceding steps) that would go into the -reshape- operation that followed.
Then I did -reshape long- (extracting the 'codes' / numbers at the end of the sets of multi-choice variable names into a new variable 'code').
Now, the below code works perfectly (to an extent) as written (just need to -collapse- to get to the final answer from there), except that that leaves a lot of empty spaces between and around the codes, due to having cells with missing values;

Code:

sort personid surveynum, stable foreach l of var * { by personid surveynum : gen `l'_concat = `l'[1] by personid surveynum : replace `l'_concat = `l'_concat[_n-1] + " " + `l' if _n > 1 //& !missing(`l') by personid surveynum : replace `l'_concat = `l'_concat[_N] *by personid surveynum : keep if _n == _N }

when I run the 4th line above inclusive of the last, commented out part:

Code:

by personid surveynum : replace `l'_concat = `l'_concat[_n-1] + " " + `l' if _n > 1 & !missing(`l')

I just get 1s or 9s in the concatenated variables.

when I do run the above code including the 6th line:

Code:

by personid surveynum : keep if _n == _N

I end up with only a single number in the concatenated variables, most often a '9' which is the highest number option in many cases.

When I change the above to:

Code:

by personid surveynum : keep if _n == 1

I get lots of `1's, - 1 is the first option across all cases.
:
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35717
#12

09 Mar 2023, 18:03

Sorry, you gave detail as requested, but I won't try to absorb all that without an example.
Comment

Announcement

Combining strings by groups

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment