Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting hashtags from string and saving as separate variables

    Hello dear forum members,

    I am seeking your help with the following task. I have a data set that contains about 16,000 individual tweets stored in a string variable called content (%846s). Some of the tweets contain one or more hashtags denoted with # (e.g., "text #hashtag1 #hashtag2") I would like to remove all hashtags from content and store them in separate individual string variables hashtag1, hashtag2, hashtagN.

    Thankfully,
    Anton

    P.S. Because of the character limit, I cannot use -dataex-, so I am posting a couple exemplar tweets below:

    Code:
    Got to hea Hear how NBA legend Grant Hill managed his postsurgical pain without using opioids. #OurChoicesMatter #adhttp://h3.sml360.com/-/2u0rtÂ
    Too many families are missing a loved one lost to opioids this holiday weekend.  Help us amplify their voices as we demand #billionsnotmillions #hearmyroar #wagingaroar #notonemore #fedup #facingaddiction http://ow.ly/ExaS30lM3KxÂ
    Physical therapy as an alternative to #opioid use in #painmanagement has gained national attention as more light shines on #addiction, overdose and death due to opioids. @ohiouhttp://goo.gl/C15FWhÂ

  • #2
    Anton,

    Take a look at split. Something like this might work...
    Code:
    clear
    input str200 text
    "Got to hea Hear how NBA legend Grant Hill managed his postsurgical pain without using opioids. #OurChoicesMatter #adhttp://h3.sml360.com/-/2u0rtÂ"
    "Too many families are missing a loved one lost to opioids this holiday weekend.  Help us amplify their voices as we demand #billionsnotmillions #hearmyroar #wagingaroar #notonemore #fedup #facingaddiction http://ow.ly/ExaS30lM3KxÂ"
    "Physical therapy as an alternative to #opioid use in #painmanagement has gained national attention as more light shines on #addiction, overdose and death due to opioids. @ohiouhttp://goo.gl/C15FWhÂ"
    end
    
    split text, parse(#)
    Lance

    Comment


    • #3
      Lance, thank you very much for this suggestion -- works really well.

      Comment


      • #4
        I used moss from SSC. It can get you some distances, but my regular expression catches too much.

        Code:
        . moss text, match("(\#[a-zA-Z]*)") regex
        
        . list _match*
        
             +-------------------------------------------------------------------------------+
          1. |              _match1 |         _match2 |      _match3 |     _match4 | _match5 |
             |    #OurChoicesMatter |         #adhttp |              |             |         |
             |-------------------------------------------------------------------------------|
             |                                        _match6                                |
             |                                                                               |
             +-------------------------------------------------------------------------------+
        
             +-------------------------------------------------------------------------------+
          2. |              _match1 |         _match2 |      _match3 |     _match4 | _match5 |
             | #billionsnotmillions |     #hearmyroar | #wagingaroar | #notonemore |  #fedup |
             |-------------------------------------------------------------------------------|
             |                                        _match6                                |
             |                               #facingaddiction                                |
             +-------------------------------------------------------------------------------+
        
             +-------------------------------------------------------------------------------+
          3. |              _match1 |         _match2 |      _match3 |     _match4 | _match5 |
             |              #opioid | #painmanagement |   #addiction |             |         |
             |-------------------------------------------------------------------------------|
             |                                        _match6                                |
             |                                                                               |
             +-------------------------------------------------------------------------------+

        Comment


        • #5
          Thank you, Nick. This is a really good way to do it as well.

          Comment


          • #6
            Additionally, what would be the appropriate way to remove all URLs that start with http:// from a string?

            E.g.,
            Code:
            #Opioids should be the last resort for treating pain. Are you listening, doctors? #addiction http://www.npr.org/sections/health-shots/2015/09/22/436905063/to-curb-pain-without-opioids-oregon-looks-to-alternative-treatments …
            #Opioids such as fentanyl—the drug that killed Prince—rose by nearly 75 percent in 2015. http://www.slate.com/blogs/the_slatest/2016/12/09/executives_at_major_fentanyl_producer_arrested_in_overprescription_case.html …

            Comment


            • #7
              With 191 posts to your name, you should by now be experienced with using dataex to present data examples. Please do so in the future.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str233 text
              "Opioids should be the last resort for treating pain. Are you listening, doctors? #addiction http://www.npr.org/sections/health-shots/2015/09/22/436905063/to-curb-pain-without-opioids-oregon-looks-to-alternative-treatments …"   
              "Opioids such as fentanyl—the drug that killed Prince—rose by nearly 75 percent in 2015. http://www.slate.com/blogs/the_slatest/2016/12/09/executives_at_major_fentanyl_producer_arrested_in_overprescription_case.html …"
              end
              
              replace text = subinstr(text, substr(text, strpos(lower(text),"http"),.), "", .)
              Result:

              Code:
              . l, clean
              
                                                                                                             text  
                1.   Opioids should be the last resort for treating pain. Are you listening, doctors? #addiction   
                2.   Opioids such as fentanyl—the drug that killed Prince—rose by nearly 75 percent in 2015.

              Comment


              • #8
                Andrew, this has worked perfectly! Thank you very much

                Comment

                Working...
                X