Skip to content Skip to sidebar Skip to footer

XPATH - How To Get Inner Text Data Littered With
Tags?

I have HTML text like this

This is some important data
Even this is data
this is useful too

data &l

Solution 1:

Update

Based on your edit, maybe you can use the XPath string() function. For example:

>>> doc.xpath('string(//p)')
'\n    This is some important data\n    \n    Even this is data\n    \n    this is useful too\n  '

(original answer follows)

If you're getting back the text you want in multiple pieces:

('This is some important data','Even this is data','this is useful too')

Why not just join those pieces?

>>> ' '.join(doc.xpath('//p/text()'))
['\n    This is some important data\n    ', '\n    Even this is data\n    ', '\n    this is useful too\n  ']

You can even get rid of the line breaks:

>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'

If you wanted the "inner html" of the p element, you could call lxml.etree.tostring on all of it's children:

>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n    Even this is data\n    <br/>\n    this is useful too\n  '

NB: All of these examples assume:

>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
...    parser=etree.HTMLParser())

Solution 2:

You can also expose your own functions in XPath:

import lxml.html, lxml.etree

raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''

doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, a):
    return [''.join(a)]
ns['cat'] = cat

print repr(doc.xpath('cat(//p/text())'))

which prints

'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

You can perform the transformations however you like using this method.


Post a Comment for "XPATH - How To Get Inner Text Data Littered With
Tags?"