Extract Business Titles And Time Periods From String
Solution 1:
The problem you are trying to solve is well known and researched, and you will find a large amount of research paper describing approaches and algorithms if you google for the terms "Named Entity Extraction" and "Relationship Extraction" Some good starting points are:
Chapter 7 of the book "Natural Language Processing with Python", in fact that entire book would probably be helpful. Chapter online here
This paper on "Named Entity Relation Mining using Wikipedia"
This paper "ddNovel Algorithms for Relationship Mining which describes mining job titles and organizations as one of the examples.
These are just a few links I've found interesting, there are a ton more and probably better ones than these, but this should get you started.
Solution 2:
I don't think there is going to be a single regex that you can use for this, unless it's really nasty. I think the solution to this might be Natural Language Processing. Certainly there are packages for this, but using them might not be simple.
Essentially you want to take a sentence like "X is/was Y", and figure out which part is a name, which part is a list of job titles, and which parts are irrelevant. Maybe look for sequences of words that are either capitalized or small words like "and" and "of"?
(?:\u\w+)( (?:\u\w*)|(?:of)|(?:and))* #Note the space
The \u
means that the next single character (the first character of the \w+
group) is uppercase. Haven't tested it, but it seems like it should work. This may be a non-trivial problem.
Post a Comment for "Extract Business Titles And Time Periods From String"