Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining strings by groups

    Hello! I am seeking your help with the following task. My data set (example given below) consists of comments (string) identified by listing_id and month. What would be the appropriate way of combining (appending) all comments for a given listing_id by month?

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long listing_id strL comments byte month
    2515 "A"   1
    2515 "B"   1
    2515 "C"   2
    2515 "D"   4
    2515 "E"   4
    2515 "F"   5
    2515 "G"   6
    2515 "H"   6
    2515 "I"   7
    2515 "J"   7
    2515 "K"   7
    2515 "L"   7
    2515 "M"   8
    2515 "N"   8
    2515 "O"   9
    2515 "P"  10
    2515 "Q"  11
    2515 "R"  11
    2539 "S"  12
    2595 "T"   3
    2595 "U"   4
    2595 "V"   5
    2595 "W"   9
    2595 "X"   9
    3330 "Y"   1
    3330 "Z"   4
    3330 "AA"  4
    3330 "BB"  5
    3330 "CC"  5
    3330 "DD"  9
    3330 "EE" 10
    3330 "FF" 12
    3831 "GG"  1
    3831 "HH"  5
    3831 "II"  7
    3831 "JJ"  8
    3831 "KK"  8
    3831 "LL"  8
    3831 "MM"  8
    end
    Thankfully,
    Anton
    Last edited by Anton Ivanov; 18 Sep 2018, 13:21.

  • #2
    Do you mean this?

    Code:
    by listing_id month, sort: gen combined_comments = comments if _n == 1
    by listing_id month: replace combined_comments = combined_comments[_n-1] ///
        + ", " + comments if _n > 1 & !missing(comments)
    by listing_id month: keep if _n == _N
    If this isn't what you had in mind, please post back showing an example of what you want the result to look like and a clearer explanation of how it relates to the inputs.

    Comment


    • #3
      If the maximum number of comments for a listing_id/month is not too large, I would use reshape and then combine the variables with -egen concat()-. In the case above, your maximum number of comments within a month for a listing_id is 4, so that works easily.


      Code:
      bysort listing_id month: gen j_var=_n
      reshape wide comments, i(listing_id month) j(j_var)
      egen all_comments=concat(comments1-comments4)
      
      *to reshape back to long:
      reshape long
      drop if comments==""
      Last edited by Carole J. Wilson; 18 Sep 2018, 13:37.
      Stata/MP 14.1 (64-bit x86-64)
      Revision 19 May 2016
      Win 8.1

      Comment


      • #4
        That's exactly what I meant. Thank you very much for help, Clyde.

        Comment


        • #5
          Carole, I also appreciate your suggestion. Thank you!

          Comment


          • #6
            Hello!

            I am using the approach suggested by Clyde for a new dataset. The example is given below (number of rows ~4 million; number of unique ids ~3200) :
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input long fips str852 content_lemmatized
            1001 "time quit lazy get get ready hit gym" 
            1001 "im devil get deatsville al"           
            1001 "need hurry tf amp get already"        
            1001 "wont go fishing"                      
            1001 "forgot rubberband og dropped"         
            1001 "might well stay high school"          
            1001 "dont anyone important keep around tbh"
            1001 "agree crocs god nd mistake st mistake"
            1001 "im sorry get distracted"              
            1001 "blue eyed baby show way"              
            end
            However, it results in the following error:

            Code:
            bysort fips: gen j_var=_n
            reshape wide content_lemmatized, i(fips) j(j_var)
            egen all_comments=concat(comments1-comments4)
            
            *to reshape back to long:
            reshape long
            drop if comments==""
            ______________
            
            variable j_var takes on too many values
            r(134);
            
            end of do-file
            
            r(134);
            
            .
            I would appreciate if anybody could suggest a plausible solution.


            Comment


            • #7
              Some confusion here. See #3 from Carole J. Wilson which begins with a clear reservation (emphasis added):


              If the maximum number of comments for a listing_id/month is not too large, I would use reshape
              So, the implication is to retreat to what Clyde Schechter originally suggested.

              Comment


              • #8
                Mon dieu! I just realized that mistakenly used Carole's code under the impressing that I am using Clyde's. I apologize for the confusion and my unfortunate oversight. Gotta take a break from work. Thank you, Nick!

                Comment


                • #9
                  This code doesn't work for me:
                  Code:
                  sort personid surveynum, stable
                  foreach l of var * {
                      by personid surveynum : gen `l'_concat = `l'[1]
                      by personid surveynum : replace `l'_concat = `l'_concat[_n-1] + " " + `l' if _n > 1 & !missing(`l')
                      by personid surveynum : replace `l'_concat = `l'_concat[_N]
                      by personid surveynum : keep if _n == _N
                  }

                  Comment


                  • #10
                    Code:
                    by personid surveynum :
                    likely identifies groups with just one observation, so in that case "doesn't work" could mean here that the code works, but doesn't give you what you want.

                    Perhaps you need

                    Code:
                    by personid (surveynum):
                    If that doesn't help then we need please (1) a data example (2) more details on what you mean by "doesn't work". FAQ Advice #12 explains both points.

                    Comment


                    • #11
                      Thank you for your help. I will try to explain in more detail, then follow up with an example if needed. My dataset is comprised of survey responses from just under 200 respondents (each with a unique identifier 'personid') surveyed on two different occasions, pre- and post- intervention (identified by variable 'surveynum'). The data has several groups of variables in which each group basically represents a single variable that accepts multiple answer choices from survey takers represented by codes, so a person for example can select one or more answer options for each of several questions. A variable may be named multichoice1, multichoice2, multichoice3, etc. Each of which contains the corresponding number or code if a person has selected it among their answer(s). For any one survey-taker there may be numbers in one, more, or all of the above sets of variables. These sets have different code ranges: some go from 1-4, some 1-6, others 1-9.
                      There are quite a few of these variable sets; I had to first identify them by isolating those variables whose names end in a number, then going through and removing some that did not really belong in the 'multi-choice' category to finally get at the list of 'stubs' (with the help of -levelsof- after a few preceding steps) that would go into the -reshape- operation that followed.
                      Then I did -reshape long- (extracting the 'codes' / numbers at the end of the sets of multi-choice variable names into a new variable 'code').
                      Now, the below code works perfectly (to an extent) as written (just need to -collapse- to get to the final answer from there), except that that leaves a lot of empty spaces between and around the codes, due to having cells with missing values;
                      Code:
                      sort personid surveynum, stable
                      foreach l of var * {
                          by personid surveynum : gen `l'_concat = `l'[1]
                          by personid surveynum : replace `l'_concat = `l'_concat[_n-1] + " " + `l' if _n > 1 //& !missing(`l')
                          by personid surveynum : replace `l'_concat = `l'_concat[_N]
                          *by personid surveynum : keep if _n == _N
                      }
                      when I run the 4th line above inclusive of the last, commented out part:
                      Code:
                      by personid surveynum : replace `l'_concat = `l'_concat[_n-1] + " " + `l' if _n > 1 & !missing(`l')
                      I just get 1s or 9s in the concatenated variables.

                      when I do run the above code including the 6th line:
                      Code:
                      by personid surveynum : keep if _n == _N
                      I end up with only a single number in the concatenated variables, most often a '9' which is the highest number option in many cases.

                      When I change the above to:
                      Code:
                      by personid surveynum : keep if _n == 1
                      I get lots of `1's, - 1 is the first option across all cases.
                      :

                      Comment


                      • #12
                        Sorry, you gave detail as requested, but I won't try to absorb all that without an example.

                        Comment

                        Working...
                        X