Extract Variables from a sentence

Jonathan Shan

Join Date: Mar 2015

Posts: 5
#1

Extract Variables from a sentence

20 Mar 2015, 21:34

Hey,
I just opened a new Topic, for a new Problem, if this is not okay, just message it to me

So i have sveral obervations in sentences like this
Obs Sentence
1 Mr. Jonas Mueller has a job at the university Munich, but has applied to a job in the university Berlin.
2 Mr. Thomas Schmidt, university/college in Hamburg, but has applied to a job on the college Bremen.
3 Mr. Johannes Klaus has applied a job at university Munich.

So i need to extract from these different sentences the locations like Munichm Hamburg, Berlin, Bremen, in extra variables. TABLE 2
Obs location1 locations2(working now)
1 Munich Berlin
2 Hamburg Bremen
3 Munich

So my question is, is there a code:
How to get these locations into extra variables.
More exactly, are there for exemple codes which give me the variables if, before the "location" there is maybe university or college in front, and then just put in the word after the university or college, the location.
It means:
Scan sentence, if there is the word university or college, put in to the variable" location1=Munich, Hamburg" the word after university/college.
For the second university/college after the first one in the sentence OR only 1 university/college in the sentence, put it into the variable" location2=Berlin, Bremen, Munich" the word after university/college.
So I will get TABLE 2.

Hope someone had a similiar issue and could help me
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35734

21 Mar 2015, 03:22

New question, new thread == exactly right.

moss from SSC will let you extract proper names. That is at least part of your problem.

Code:

 
input Obs str244 Sentence
1 "Mr. Jonas Mueller has a job at the university Munich, but has applied to a job in the university Berlin."
2 "Mr. Thomas Schmidt, university/college in Hamburg, but has applied to a job on the college Bremen."
3 "Mr. Johannes Klaus has applied a job at university Munich." 
end 

. moss Sentence , regex match("([A-Z][a-z]*)")

. egen Names = concat(_match*) , p(" ")

. l Names

     +----------------------------------+
     |                            Names |
     |----------------------------------|
  1. |   Mr Jonas Mueller Munich Berlin |
  2. | Mr Thomas Schmidt Hamburg Bremen |
  3. |         Mr Johannes Klaus Munich |
     +----------------------------------+

There are other things you can do:

1. Zap "Mr" before you look for names.

2. Consider accented letters too (e.g. those with umlauts) if they are part of the real problem.

3. Use the results to compile dictionaries of desired and undesired names, whichever is easier. reshape and merge are your friends.

.

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#3

21 Mar 2015, 08:56

In addition to Nicks suggestions, you can pre-process the strings to remove people's names using a similar pattern.

Code:

clear input str244 Sentence "Mr. Jonas Mueller has a job at the university Munich, but has applied to a job in the university Berlin." "Mr. Thomas Schmidt, university/college in Hamburg, but has applied to a job on the college Bremen." "Mr. Johannes Klaus has applied a job at university Munich." "Mrs. Eve B. Good has applied a job at university Munich." "Mrs. Eve B Good has applied a job at university Munich." end * remove titles gen s = subinstr(Sentence,"Mr. ","",.) replace s = subinstr(s,"Mrs. ","",.) * remove middle initials replace s = regexr(s," [A-Z]\.? "," ") * remove people's names (i.e. two consecutive proper name) replace s = regexr(s,"[A-Z][a-z]+ [A-Z][a-z]+","") moss s , regex match("([A-Z][a-z]*)") egen Names = concat(_match*) , p(" ") list Names

The " [A-Z]\.? " pattern breaks down to
" " a space

"[A-Z]" a single uppercase letter

"\." The period is a special character that must be escaped to match just a period

"?" modify 3 to match zero or one (i.e. make the period optional)

" " a space

and the string that matches the above is replaced by a single space.

The "[A-Z][a-z]+ [A-Z][a-z]+" pattern breaks down to
"[A-Z]" a single uppercase letter

"[a-z]" a single lowercase letter

"+" modify #3 to match one or more

" " a space

"[A-Z]" a single uppercase letter

"[a-z]" a single lowercase letter

"+" modify #6 to match one or more

and will therefore match two consecutive proper names.

For more information on how to use regular expressions in Stata, see this FAQ.
Comment
Jonathan Shan

Join Date: Mar 2015

Posts: 5
#4

21 Mar 2015, 10:52

Thanks for the Help!
It bringts me at least near to my problem, and tried something out with your coded.
But the sentences i wrote above are more the easiest sentences i have, they differ very hard in their construction. So i need a more a command i write above, if in the sentence/variable there is e.g. university, then i need the next name in this sentence in a new variable.
Comment

Announcement

Extract Variables from a sentence

Comment

Comment

Comment