Skip to content Skip to sidebar Skip to footer

SpaCy Sentence Segmentation Failing On Quotes

I am parsing some news data with spaCy and am noticing a consistent failure regarding sentence segmentation where there is a quote. Has anyone else solved this issue? Here is a r

Solution 1:

I googled the original news article to try to figure out why your data looks like it does (missing whitespace between sentences where I wouldn't expect it in a formal news article), and it looks like the original problem is that no whitespace is inserted between HTML paragraphs. If you can fix that problem with how the article is extracted from the original HTML (insert whitespace when you run into <p> or </p>), you won't have this problem with spacy or other tools.

The models available in standard tools will often be trained on news data and it's reasonable to expect them to work well for data like this, but they expect whitespace between sentences. Unless you retrain the models with data including missing whitespace between sentences (or preprocess your data as suggested in a comment), you're going have these kinds of problems.


Post a Comment for "SpaCy Sentence Segmentation Failing On Quotes"