Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting String

    Hi, I am trying to split the movie names from its year of release. Most of the years are put in parentheses, which could be put into the a variable easily. But for some movies, a translation of the movie title is put before the year.

    Data example:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str102 title
    "Ghostbusters (1984 Original) (1984)"                  
    "Rio Bravo (1959)"                                     
    "The Seventh Seal (Det Sjunde inseglet) (1957)"        
    "The Lion King (1994)"                                 
    "Deadpool 2 (2018)"                                    
    "Back to the Future (1985)"                            
    "The Fugitive (1993)"                                  
    "Crouching Tiger, Hidden Dragon (2001)"                
    "The Searchers (1956)"                                 
    "Bullitt (1968)"                                       
    end
    The codes that I tried are:
    Code:
     split title, p("(") gen(year)
    variables created as string: 
    year1  year2  year3  year4
    
    . split year2, p(")")
    variable created as string: 
    year21
    The result is messy, as below:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str68 year1 str51 year2 str27 year3 str5 year4 str49 year21
    "Black Panther "                                  "2018)"                     ""      "" "2018"
    "Avengers: Endgame "                              "2019)"                     ""      "" "2019"
    "Mission: Impossible - Fallout "                  "2018)"                     ""      "" "2018"
    "Mad Max: Fury Road "                             "2015)"                     ""      "" "2015"
    "Spider-Man: Into the Spider-Verse "              "2018)"                     ""      "" "2018"
    "Wonder Woman "                                   "2017)"                     ""      "" "2017"
    "Logan "                                          "2017)"                     ""      "" "2017"
    "Coco "                                           "2017)"                     ""      "" "2017"
    "Dunkirk "                                        "2017)"                     ""      "" "2017"
    "Star Wars: The Last Jedi "                       "2017)"                     ""      "" "2017"
    "Thor: Ragnarok "                                 "2017)"                     ""      "" "2017"
    "Seven Samurai (Shichinin no Samurai) "           "Shichinin no Samurai) "    "1956)" "" "1954"
    "Star Wars: Episode VII - The Force Awakens "     "2015)"                     ""      "" "2015"
    "Metropolis "                                     "1927)"                     ""      "" "1927"
    "Soul "                                           "2020)"                     ""      "" "2020"
    "The Adventures of Robin Hood "                   "1938)"                     ""      "" "1938"
    "Spider-Man: Far From Home "                      "2019)"                     ""      "" "2019"
    "Incredibles 2 "                                  "2018)"                     ""      ""
    
    end
    The end result I wish to get: a column with movie titles, keeping the original content inside parentheses, but make the 'year' a new variable.

    Question: How should I edit codes to split the variable more accurately/without needing to make corrections manually?
    This is my first time posting. I apologize in advance if I didn't use the Statalist function correctly.
    Thank you very much for your help.




  • #2
    Thank you for this very well written question!

    Code:
    gen result = regexs(0) if(regexm(title, "\([0-9]+\)"))
    replace result = regexs(0) if(regexm(result, "[0-9]+"))
    list result
    In the code above, I use a regular expression to exact any substring that consists of numbers between parentheses. e.g. (2018). In the second line I extract only the numbers, removing the parentheses from the final result. Looks to me like this works with your example data. It should work with your full data as long as your movie titles don't have any other substrings with the "number inside parentheses" pattern.
    Last edited by Daniel Schaefer; 23 Oct 2022, 21:14.

    Comment


    • #3
      You can also achieve the same result as in #2 with a single line of code:

      Code:
      gen result2 = regexs(1) if regexm(title,"\(([0-9]+)\)")
      You get:
      Code:
      . list , noobs sep(0)
      
        +------------------------------------------------------------------+
        |                                         title   result   result2 |
        |------------------------------------------------------------------|
        |           Ghostbusters (1984 Original) (1984)     1984      1984 |
        |                              Rio Bravo (1959)     1959      1959 |
        | The Seventh Seal (Det Sjunde inseglet) (1957)     1957      1957 |
        |                          The Lion King (1994)     1994      1994 |
        |                             Deadpool 2 (2018)     2018      2018 |
        |                     Back to the Future (1985)     1985      1985 |
        |                           The Fugitive (1993)     1993      1993 |
        |         Crouching Tiger, Hidden Dragon (2001)     2001      2001 |
        |                          The Searchers (1956)     1956      1956 |
        |                                Bullitt (1968)     1968      1968 |
        +------------------------------------------------------------------+
      A slightly more careful regular expression, which picks up the year only when there are exactly four digits enclosed within brackets, would be:
      Code:
      gen result = ustrregexs(1) if ustrregexm(title,"\(([0-9]{4})\)")
      Last edited by Hemanshu Kumar; 23 Oct 2022, 22:23.

      Comment


      • #4
        Daniel Schaefer Thank you so much for your prompt reply! It works!

        Comment


        • #5
          @Henmanshu Kumar Thank you very much!

          Comment


          • #6
            Whenever there is a regular expression solution, that's great, but i always wonder about ways of doing it otherwise. Here's one.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str102 title
            "Ghostbusters (1984 Original) (1984)"                  
            "Rio Bravo (1959)"                                    
            "The Seventh Seal (Det Sjunde inseglet) (1957)"        
            "The Lion King (1994)"                                
            "Deadpool 2 (2018)"                                    
            "Back to the Future (1985)"                            
            "The Fugitive (1993)"                                  
            "Crouching Tiger, Hidden Dragon (2001)"                
            "The Searchers (1956)"                                
            "Bullitt (1968)"                                      
            end
            
            gen date = word(title, -1)
            gen movie = trim(subinstr(title, word(title,-1), "", 1))
            destring date, ignore("()") replace
            
            list movie date
            
                 +-----------------------------------------------+
                 |                                  movie   date |
                 |-----------------------------------------------|
              1. |           Ghostbusters (1984 Original)   1984 |
              2. |                              Rio Bravo   1959 |
              3. | The Seventh Seal (Det Sjunde inseglet)   1957 |
              4. |                          The Lion King   1994 |
              5. |                             Deadpool 2   2018 |
                 |-----------------------------------------------|
              6. |                     Back to the Future   1985 |
              7. |                           The Fugitive   1993 |
              8. |         Crouching Tiger, Hidden Dragon   2001 |
              9. |                          The Searchers   1956 |
             10. |                                Bullitt   1968 |
                 +-----------------------------------------------+
            There is a small risk here that subinstr() catches the wrong phrase Another way to approach this is to trim whatever precedes the last space.



            Code:
            gen  movie2 = trim(substr(title, 1, strrpos(title, " ")))
            Last edited by Nick Cox; 24 Oct 2022, 01:44.

            Comment

            Working...
            X