Python: Parsing Numeric Values From String Using Regular Expressions
Solution 1:
Firstly, I suspect that the period in the first part of the regex should be escaped with a leading backslash (if it is intended to match a decimal point), currently it matches any character which is why you have a match containing a space '$26 '
.
The 2,333
is therefore matched to the first part of your regex (the ,
matches with the unescaped .
), which is why it didn't match the ,450
part of that number.
Whilst your (corrected) regex works with your limited sample data which might be good enough, it may be too broad for general use - for instance it matches ($1267.3%
. You could build up a bigger regex out of smaller parts, however this can get ugly fast:
import re
test_string = "Distributions $54.00 bob $26 and 0.30 5% ($0.23) 2,333,450"
test_string += " $12,354.00 43 43.12 1234,12 ($123,456.78"
COMMA_SEP_NUMBER = r'\d{1,3}(?:,\d{3})*'# require groups of 3
DECIMAL_NUMBER = r'\d+(?:\.\d*)?'
COMMA_SEP_DECIMAL = COMMA_SEP_NUMBER + r'(?:\.(?:\d{3},)*\d{0,3})?'# are commas used after the decimal point?
regex_items = []
regex_items.append('\$' + COMMA_SEP_DECIMAL)
regex_items.append('\$' + DECIMAL_NUMBER)
regex_items.append(COMMA_SEP_DECIMAL + '\%')
regex_items.append(DECIMAL_NUMBER + '\%')
regex_items.append(COMMA_SEP_DECIMAL)
regex_items.append(DECIMAL_NUMBER)
r = re.compile('|'.join(regex_items))
print r.findall(test_string)
Note that this doesn't account for parenthesis around the numbers, and it fails on 1234,12
(which should probably be interpreted as two numbers 1234
and 12
) due to matching 123
against the COMMA_SEP_NUMBER pattern.
This is a problem with this technique because if the DECIMAL_NUMBER pattern comes first, COMMA_SEP_NUMBER will never be matched.
Finally, here's a nice tool for visualising regex
\d{1,3}(?:,\d{3})*(?:\.(?:\d{3},)*\d{0,3})?
Solution 2:
How about merge two parts into one?
>>> test_string = "Distributions $54.00 bob $26 and 0.30 5% ($0.23) 2,333,450">>> re.findall(r'\(?\$?\d+(?:,\d+)*\.?\d*%?\)?', test_string)
['$54.00', '$26', '0.30', '5%', '($0.23)', '2,333,450']
- Replaced
.
with\.
to match dot literally instead of matching any charcter. - Replaced
[0-9]
with\d
. (\d
matches digit)
Post a Comment for "Python: Parsing Numeric Values From String Using Regular Expressions"