How To Get Value Of Specified Tag Attribute From Xml Using Regexp + Python?
I have a script that parses some xml. XML contains: <SDTITLE="A"FLAGS=""HOST="9511.com"><TITLETEXT="9511 domain"/><ADDRSTREET="Pmb#400, San Pablo Ave"CITY="Berkeley"STATE="CA"COUNTRY="US"/><CREATEDDATE="13-Oct-1990"DAY="13"MONTH="10"YEAR="1990"/><OWNERNAME="9511.Org Domain Name Proxy Agents"/><EMAILADDR="proxy@9511.org"/><LANGLEX="en"CODE="us-ascii"/><LINKSINNUM="75"/><SPEEDTEXT="3158"PCT="17"/><CHILDSRATING="0"/></SD><SD><POPULARITYURL="9511.com/"TEXT="1417678"SOURCE="panel"/></SD>'''
soup = BeautifulSoup.BeautifulSoup(xml)
print(soup.find('popularity')['text'])
Output
u'1417678'
Solution 2:
You are just matching the first sequence of decimal digits that occurs after the element's name. The first sequence of digits '(\d+)'
after a arbitrary number of non-digits '[^\d]*'
is 9511
.
In order to findall
values of @TEXT
attributes, something like this would work:
my_values = re.findall("<POPULARITY(?:\D+=\"\S*\")*\s+TEXT=\"(\d*)\"", xml) # returning a list btw
Or, if no other attributes will have digit-only values except @TEXT
:
re.findall("<POPULARITY\s+(?:\S+\s+)*\w+=\"(\d+)\"", xml)
Where (?:...)
matches the embraced expression, but doesn't act as an addressable group, like (...)
. The special sequences \S
and \D
are the invertions of their lowercase counterparts, expanding to (anything but) whitespace and digits, respectively.
However, like already mentioned, regex are not meant to be used on XML, because XML is not a regular language.
Post a Comment for "How To Get Value Of Specified Tag Attribute From Xml Using Regexp + Python?"