Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract Variables from a sentence

    Hey,
    I just opened a new Topic, for a new Problem, if this is not okay, just message it to me

    So i have sveral obervations in sentences like this
    Obs Sentence
    1 Mr. Jonas Mueller has a job at the university Munich, but has applied to a job in the university Berlin.
    2 Mr. Thomas Schmidt, university/college in Hamburg, but has applied to a job on the college Bremen.
    3 Mr. Johannes Klaus has applied a job at university Munich.

    So i need to extract from these different sentences the locations like Munichm Hamburg, Berlin, Bremen, in extra variables. TABLE 2
    Obs location1 locations2(working now)
    1 Munich Berlin
    2 Hamburg Bremen
    3 Munich

    So my question is, is there a code:
    How to get these locations into extra variables.
    More exactly, are there for exemple codes which give me the variables if, before the "location" there is maybe university or college in front, and then just put in the word after the university or college, the location.
    It means:
    Scan sentence, if there is the word university or college, put in to the variable" location1=Munich, Hamburg" the word after university/college.
    For the second university/college after the first one in the sentence OR only 1 university/college in the sentence, put it into the variable" location2=Berlin, Bremen, Munich" the word after university/college.
    So I will get TABLE 2.

    Hope someone had a similiar issue and could help me

  • #2
    New question, new thread == exactly right.

    moss from SSC will let you extract proper names. That is at least part of your problem.

    Code:
     
    input Obs str244 Sentence
    1 "Mr. Jonas Mueller has a job at the university Munich, but has applied to a job in the university Berlin."
    2 "Mr. Thomas Schmidt, university/college in Hamburg, but has applied to a job on the college Bremen."
    3 "Mr. Johannes Klaus has applied a job at university Munich." 
    end 
    
    . moss Sentence , regex match("([A-Z][a-z]*)")
    
    . egen Names = concat(_match*) , p(" ")
    
    . l Names
    
         +----------------------------------+
         |                            Names |
         |----------------------------------|
      1. |   Mr Jonas Mueller Munich Berlin |
      2. | Mr Thomas Schmidt Hamburg Bremen |
      3. |         Mr Johannes Klaus Munich |
         +----------------------------------+
    There are other things you can do:

    1. Zap "Mr" before you look for names.

    2. Consider accented letters too (e.g. those with umlauts) if they are part of the real problem.

    3. Use the results to compile dictionaries of desired and undesired names, whichever is easier. reshape and merge are your friends.


    .

    Comment


    • #3
      In addition to Nicks suggestions, you can pre-process the strings to remove people's names using a similar pattern.

      Code:
      clear
      input str244 Sentence
      "Mr. Jonas Mueller has a job at the university Munich, but has applied to a job in the university Berlin."
      "Mr. Thomas Schmidt, university/college in Hamburg, but has applied to a job on the college Bremen."
      "Mr. Johannes Klaus has applied a job at university Munich." 
      "Mrs. Eve B. Good has applied a job at university Munich." 
      "Mrs. Eve B Good has applied a job at university Munich." 
      end 
      
      * remove titles
      gen s = subinstr(Sentence,"Mr. ","",.)
      replace s = subinstr(s,"Mrs. ","",.)
      
      * remove middle initials
      replace s = regexr(s," [A-Z]\.? "," ")
      
      * remove people's names (i.e. two consecutive proper name)
      replace s = regexr(s,"[A-Z][a-z]+ [A-Z][a-z]+","")
      
      moss s , regex match("([A-Z][a-z]*)")
      egen Names = concat(_match*) , p(" ")
      list Names
      The " [A-Z]\.? " pattern breaks down to
      1. " " a space
      2. "[A-Z]" a single uppercase letter
      3. "\." The period is a special character that must be escaped to match just a period
      4. "?" modify 3 to match zero or one (i.e. make the period optional)
      5. " " a space
      and the string that matches the above is replaced by a single space.

      The "[A-Z][a-z]+ [A-Z][a-z]+" pattern breaks down to
      1. "[A-Z]" a single uppercase letter
      2. "[a-z]" a single lowercase letter
      3. "+" modify #3 to match one or more
      4. " " a space
      5. "[A-Z]" a single uppercase letter
      6. "[a-z]" a single lowercase letter
      7. "+" modify #6 to match one or more
      and will therefore match two consecutive proper names.

      For more information on how to use regular expressions in Stata, see this FAQ.

      Comment


      • #4
        Thanks for the Help!
        It bringts me at least near to my problem, and tried something out with your coded.
        But the sentences i wrote above are more the easiest sentences i have, they differ very hard in their construction. So i need a more a command i write above, if in the sentence/variable there is e.g. university, then i need the next name in this sentence in a new variable.

        Comment

        Working...
        X