Stanford Ner In Nltk Not Tagging Multiple Sentences Correctly - Python
I have a function which returns the named entities in a given body of text, using the Stanford NER. def get_named_entities(text): load_ner_files() print text[:100] # to s
Solution 1:
Assuming that one has setup the NLTK v3.2 properly, see
- Stanford Parser and NLTK
- https://gist.github.com/alvations/e1df0ba227e542955a8a
- https://gist.github.com/alvations/0ed8641d7d2e1941b9f9
TL;DR:
pip install -U nltk
or
conda update nltk
After setting up NLTK and Stanford Tools (remember to set environment variables):
import time
import urllib.request
from itertools import chain
from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.tag import StanfordNERTagger
classArticle:
def__init__(self, url, encoding='utf8'):
self.url = url
self.encoding='utf8'
self.text = self.fetch_url_text()
self.process_text()
deffetch_url_text(self):
response = urllib.request.urlopen(self.url)
self.data = response.read().decode(self.encoding)
self.bsoup = BeautifulSoup(self.data, 'html.parser')
return'\n'.join([paragraph.text for paragraph
in self.bsoup.find_all('p')])
defprocess_text(self):
self.paragraphs = [sent_tokenize(p.strip())
for p in self.text.split('\n') if p]
_sents = list(chain(*self.paragraphs))
self.sents = [word_tokenize(sent) for sent in _sents]
self.words = list(chain(*self.sents))
url = 'https://aeon.co/essays/elon-musk-puts-his-case-for-a-multi-planet-civilisation'
a1 = Article(url)
three_sentences = a1.sents[20:23]
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
# Tag multiple sentences at one go.
start = time.time()
tagged_sents = st.tag_sents(three_sentences)
print ("Tagging took:", time.time() - start)
print (tagged_sents, end="\n\n")
for sent in tagged_sents:
print (sent)
print()
# (Much slower) Tagging sentences one at the time and # Stanford NER is refired every time.
start = time.time()
tagged_sents = [st.tag(sent) for sent in three_sentences]
print ("Tagging took:", time.time() - start)
for sent in tagged_sents:
print (sent)
print()
[out]:
Taggingtook: 2.537247657775879[[('Musk', 'PERSON'), ('was', 'O'), ('laughing', 'O'), ('because', 'O'), ('he', 'O'), ('was', 'O'), ('joking', 'O'), (':', 'O'), ('he', 'O'), ('cares', 'O'), ('a', 'O'), ('great', 'O'), ('deal', 'O'), ('about', 'O'), ('Earth', 'LOCATION'), ('.', 'O')], [('When', 'O'), ('he', 'O'), ('is', 'O'), ('not', 'O'), ('here', 'O'), ('at', 'O'), ('SpaceX', 'ORGANIZATION'), (',', 'O'), ('he', 'O'), ('is', 'O'), ('running', 'O'), ('an', 'O'), ('electric', 'O'), ('car', 'O'), ('company', 'O'), ('.', 'O')], [('But', 'O'), ('this', 'O'), ('is', 'O'), ('his', 'O'), ('manner', 'O'), ('.', 'O')]]
[('Musk', 'PERSON'), ('was', 'O'), ('laughing', 'O'), ('because', 'O'), ('he', 'O'), ('was', 'O'), ('joking', 'O'), (':', 'O'), ('he', 'O'), ('cares', 'O'), ('a', 'O'), ('great', 'O'), ('deal', 'O'), ('about', 'O'), ('Earth', 'LOCATION'), ('.', 'O')][('When', 'O'), ('he', 'O'), ('is', 'O'), ('not', 'O'), ('here', 'O'), ('at', 'O'), ('SpaceX', 'ORGANIZATION'), (',', 'O'), ('he', 'O'), ('is', 'O'), ('running', 'O'), ('an', 'O'), ('electric', 'O'), ('car', 'O'), ('company', 'O'), ('.', 'O')][('But', 'O'), ('this', 'O'), ('is', 'O'), ('his', 'O'), ('manner', 'O'), ('.', 'O')]Taggingtook: 7.375355243682861[('Musk', 'PERSON'), ('was', 'O'), ('laughing', 'O'), ('because', 'O'), ('he', 'O'), ('was', 'O'), ('joking', 'O'), (':', 'O'), ('he', 'O'), ('cares', 'O'), ('a', 'O'), ('great', 'O'), ('deal', 'O'), ('about', 'O'), ('Earth', 'LOCATION'), ('.', 'O')][('When', 'O'), ('he', 'O'), ('is', 'O'), ('not', 'O'), ('here', 'O'), ('at', 'O'), ('SpaceX', 'ORGANIZATION'), (',', 'O'), ('he', 'O'), ('is', 'O'), ('running', 'O'), ('an', 'O'), ('electric', 'O'), ('car', 'O'), ('company', 'O'), ('.', 'O')][('But', 'O'), ('this', 'O'), ('is', 'O'), ('his', 'O'), ('manner', 'O'), ('.', 'O')]
Post a Comment for "Stanford Ner In Nltk Not Tagging Multiple Sentences Correctly - Python"