Skip to content Skip to sidebar Skip to footer

Extract Business Titles And Time Periods From String

I am extracting information about certain companies from Reuters using Python. I have been able to get the officer/executive names, biographies, and compensation from this page No

Solution 1:

The problem you are trying to solve is well known and researched, and you will find a large amount of research paper describing approaches and algorithms if you google for the terms "Named Entity Extraction" and "Relationship Extraction" Some good starting points are:

These are just a few links I've found interesting, there are a ton more and probably better ones than these, but this should get you started.

Solution 2:

I don't think there is going to be a single regex that you can use for this, unless it's really nasty. I think the solution to this might be Natural Language Processing. Certainly there are packages for this, but using them might not be simple.

Essentially you want to take a sentence like "X is/was Y", and figure out which part is a name, which part is a list of job titles, and which parts are irrelevant. Maybe look for sequences of words that are either capitalized or small words like "and" and "of"?

(?:\u\w+)( (?:\u\w*)|(?:of)|(?:and))*  #Note the space

The \u means that the next single character (the first character of the \w+ group) is uppercase. Haven't tested it, but it seems like it should work. This may be a non-trivial problem.

Post a Comment for "Extract Business Titles And Time Periods From String"