Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I return the last distinct substring in a string?

    I use stata for some of my data analysis. I wanted to ask how to go about solving this problem: I would like to return the last distinct substring in a given string.

    For example, given a string: "orange, banana, melon, cocoa, cocoa" you'll agree that the last distinct substring is cocoa which is the result I want.

    Also, we could have string like: "orange,cocoa, banana, orange, orange" and I would like to return the index of the last stable substring (i.e. orange because there was no other subtring that appear after orange at the end of the string ) in the string

    I look forward to your response.

    Many thanks,
    Last edited by Richie Oyeleke; 14 Feb 2020, 10:30.

  • #2
    I can think of various Stata functions that might be relevant to your problem (e.g., strrpos(), strpos(), wordcount(), word()); the -split- command might also be useful. Check the help on those items. However, I have no idea what you mean by "last distinct substring," or what the "last stable substring" is. "Cocoa" recurs in your first example, which would not fit any ordinary sense of "distinct." And, I don't get what is stable (as opposed to "unstable") about "orange." I think you'll need to get a colleague to help you express those ideas in a different way. Perhaps someone else here on StataList will understand.

    Comment


    • #3
      Mike Lacy Thank you for your comment. So here's what I mean: given the string "cocoa, orange, banana, melon, cocoa, cocoa" there are four unique (i used unique/distinct for lack of a better word) substrings or words (i.e. orange, banana, melon, cocoa) in the string and it doesn't matter whether a substring recurs or not. Also, a substring is stable (for lack of a better word) if after appearing before or not in the given string it ended the substring sequence in the given string (and it doesn't matter if it had appeared earlier and succeeded by another substring before appearing again). For example, in this string "cocoa, orange, banana, melon, cocoa, cocoa" the substring "cocoa" appeared 3 times but its last two occurrence ended the given string (i.e it was not succeeded by any other subtring afterwards. So, I would like to return the index value / position of "cocoa" where it successor remained "cocoa" in this case the index would be (5) since it has a success "cocoa" with index value(6) and no other different substring successor. I hope I have been able to simplify things, essential the previous 2 questions have now been combined into one. Please note that the solution I want is a general solution and not a specific solution that finds the index of cocoa only since I do not know the substring that is stable forehand. I only used this as an example.
      Last edited by Richie Oyeleke; 14 Feb 2020, 12:13.

      Comment


      • #4
        I'll have to leave this to someone else to answer, as I'm still not getting your intent. My best advice for you here would be to explain what you want by showing several (say 5 or so) examples of strings and the position you would like to extract.

        Comment


        • #5
          Perhaps the following example will start you on your way. What it does is find the final comma, take everything from the next position to the end of the string, then remove leading and trailing blanks from what is left, and return that as the result.
          Code:
          clear all
          cls
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str36 fruit
          "orange, banana, melon, cocoa, cocoa" 
          "orange,cocoa, banana, orange, orange"
          "nuts"
          end
          generate last = trim(substr(fruit,strrpos(fruit,",")+1,.))
          list, clean
          Code:
          . list, clean
          
                                                fruit     last  
            1.    orange, banana, melon, cocoa, cocoa    cocoa  
            2.   orange,cocoa, banana, orange, orange   orange  
            3.                                   nuts     nuts

          Comment


          • #6
            Code:
            split fruits , parse(",") gen(last)
            reshape long last , i(fruits) j(index)  
            bys fruits (index) : drop if mi(last) | last == last[_n-1]
            bys fruits (index) : keep if _n == _N
            Code:
            . list
            
                 +--------------------------------------------------------+
                 |                               fruits   index      last |
                 |--------------------------------------------------------|
              1. |                                 nuts       1      nuts |
              2. |  orange, banana, melon, cocoa, cocoa       4     cocoa |
              3. | orange,cocoa, banana, orange, orange       4    orange |
                 +--------------------------------------------------------+

            Comment


            • #7
              Originally posted by William Lisowski View Post
              Perhaps the following example will start you on your way. What it does is find the final comma, take everything from the next position to the end of the string, then remove leading and trailing blanks from what is left, and return that as the result.
              Code:
              clear all
              cls
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str36 fruit
              "orange, banana, melon, cocoa, cocoa"
              "orange,cocoa, banana, orange, orange"
              "nuts"
              end
              generate last = trim(substr(fruit,strrpos(fruit,",")+1,.))
              list, clean
              Code:
              . list, clean
              
              fruit last
              1. orange, banana, melon, cocoa, cocoa cocoa
              2. orange,cocoa, banana, orange, orange orange
              3. nuts nuts
              Thank you William Lisowski your solution will be quite helpful. I appreciate your time.

              Comment


              • #8
                Originally posted by Bjarte Aagnes View Post
                Code:
                split fruits , parse(",") gen(last)
                reshape long last , i(fruits) j(index)
                bys fruits (index) : drop if mi(last) | last == last[_n-1]
                bys fruits (index) : keep if _n == _N
                Code:
                . list
                
                +--------------------------------------------------------+
                | fruits index last |
                |--------------------------------------------------------|
                1. | nuts 1 nuts |
                2. | orange, banana, melon, cocoa, cocoa 4 cocoa |
                3. | orange,cocoa, banana, orange, orange 4 orange |
                +--------------------------------------------------------+
                Bjarte Aagnes Thank you so much for this. This is exactly what I am looking for. But I am getting this error when I tried it: "variable id does not uniquely identify the observations
                Your data are currently wide. You are performing a reshape long. You specified i(fruits) and j(index). In the current wide form,
                variable fruits should uniquely identify the observations." I am still trying to figure out what the problem is but just incase you have any suggestions on how to resolve this problem, I would appreciate it.

                Comment


                • #9
                  Originally posted by Mike Lacy View Post
                  I'll have to leave this to someone else to answer, as I'm still not getting your intent. My best advice for you here would be to explain what you want by showing several (say 5 or so) examples of strings and the position you would like to extract.
                  Thanks for your help.

                  Comment


                  • #10
                    Response to #8:

                    Code:
                    bysort fruits : keep if ( _n == 1 )
                    
                    split fruits , parse(",") gen(last)
                    reshape long last , i(fruits) j(index)
                    bys fruits (index) : drop if mi(last) | last == last[_n-1]
                    bys fruits (index) : keep if _n == _N
                    Last edited by Bjarte Aagnes; 17 Feb 2020, 11:00.

                    Comment


                    • #11
                      Originally posted by Bjarte Aagnes View Post
                      Response to #8:

                      Code:
                      bysort fruits : keep if ( _n == 1 )
                      
                      split fruits , parse(",") gen(last)
                      reshape long last , i(fruits) j(index)
                      bys fruits (index) : drop if mi(last) | last == last[_n-1]
                      bys fruits (index) : keep if _n == _N
                      Bjarte Aagnes Thank you. This resolves the previous error messages but the index value is no longer being displayed?

                      Comment


                      • #12
                        Originally posted by Richie Oyeleke View Post

                        Bjarte Aagnes Thank you. This resolves the previous error messages but the index value is no longer being displayed?
                        Bjarte Aagnes This works actually I can see the index value , I guess I missed it previously. Thank you so much!

                        Comment


                        • #13
                          Code:
                          clear
                          input str36 fruits  
                          "nuts"                                 
                          "orange, banana, melon, cocoa, cocoa"  
                          "orange,cocoa, banana, orange, orange" 
                          "orange,cocoa, banana, orange, orange" 
                          "orange,cocoa, banana, orange, orange" 
                          "orange,cocoa, banana, orange, orange" 
                          "orange,cocoa, banana, orange, orange" 
                          end
                          
                          bysort fruits : keep if ( _n == 1 )
                          
                          split fruits , parse(",") gen(last)
                          reshape long last , i(fruits) j(index)
                          bys fruits (index) : drop if mi(last) | last == last[_n-1]
                          bys fruits (index) : keep if _n == _N
                          
                          list, clean
                          Code:
                                                               fruits   index      last  
                            1.                                   nuts       1      nuts  
                            2.    orange, banana, melon, cocoa, cocoa       4     cocoa  
                            3.   orange,cocoa, banana, orange, orange       4    orange

                          Comment


                          • #14
                            I would suggest the following code to overcome the difficulties introduced by having multiple observations with the same value for fruits.
                            Code:
                            split fruits , parse(",") gen(last)
                            generate seq = _n
                            reshape long last , i(seq) j(index)
                            bys seq (index) : drop if mi(last) | last == last[_n-1]
                            bys seq (index) : keep if _n == _N
                            drop seq
                            list, clean
                            Code:
                            . list, clean
                            
                                   index                                 fruits      last  
                              1.       1                                   nuts      nuts  
                              2.       4    orange, banana, melon, cocoa, cocoa     cocoa  
                              3.       4   orange,cocoa, banana, orange, orange    orange  
                              4.       4   orange,cocoa, banana, orange, orange    orange  
                              5.       4   orange,cocoa, banana, orange, orange    orange  
                              6.       4   orange,cocoa, banana, orange, orange    orange  
                              7.       4   orange,cocoa, banana, orange, orange    orange
                            Last edited by William Lisowski; 17 Feb 2020, 11:45.

                            Comment


                            • #15
                              Originally posted by Bjarte Aagnes View Post
                              Code:
                              clear
                              input str36 fruits
                              "nuts"
                              "orange, banana, melon, cocoa, cocoa"
                              "orange,cocoa, banana, orange, orange"
                              "orange,cocoa, banana, orange, orange"
                              "orange,cocoa, banana, orange, orange"
                              "orange,cocoa, banana, orange, orange"
                              "orange,cocoa, banana, orange, orange"
                              end
                              
                              bysort fruits : keep if ( _n == 1 )
                              
                              split fruits , parse(",") gen(last)
                              reshape long last , i(fruits) j(index)
                              bys fruits (index) : drop if mi(last) | last == last[_n-1]
                              bys fruits (index) : keep if _n == _N
                              
                              list, clean
                              Code:
                               fruits index last
                              1. nuts 1 nuts
                              2. orange, banana, melon, cocoa, cocoa 4 cocoa
                              3. orange,cocoa, banana, orange, orange 4 orange
                              Thank you very much. Your solution is sleek and works perfect!

                              Comment

                              Working...
                              X