XPATH - How To Get Inner Text Data Littered With
Tags?
I have HTML text like this data &l
This is some important data
Even this is data
this is useful too
Solution 1:
Update
Based on your edit, maybe you can use the XPath string()
function. For example:
>>> doc.xpath('string(//p)')
'\n This is some important data\n \n Even this is data\n \n this is useful too\n '
(original answer follows)
If you're getting back the text you want in multiple pieces:
('This is some important data','Even this is data','this is useful too')
Why not just join those pieces?
>>> ' '.join(doc.xpath('//p/text()'))
['\n This is some important data\n ', '\n Even this is data\n ', '\n this is useful too\n ']
You can even get rid of the line breaks:
>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'
If you wanted the "inner html" of the p
element, you could call lxml.etree.tostring
on all of it's children:
>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n Even this is data\n <br/>\n this is useful too\n '
NB: All of these examples assume:
>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
... parser=etree.HTMLParser())
Solution 2:
You can also expose your own functions in XPath:
import lxml.html, lxml.etree
raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''
doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)
def cat(context, a):
return [''.join(a)]
ns['cat'] = cat
print repr(doc.xpath('cat(//p/text())'))
which prints
'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'
You can perform the transformations however you like using this method.
Post a Comment for "XPATH - How To Get Inner Text Data Littered With
Tags?"