Python To Extract The @user And Url Link In Twitter Text Data With Regex
There is a list string twitter text data, for example, the following data (actually, there is a large number of text,not just these data), I want to extract the all the user name a
Solution 1:
Note that your pn = re.compile(r'@(\S+)')
regex will capture any 1+ non-whitespace characters after @
.
To exclude matching :
, you need to convert the shorthand \S
class to [^\s]
negated character class equivalent, and add :
to it:
pn = re.compile(r'@([^\s:]+)')
Now, it will stop capturing non-whitespace symbols before the first :
. See the regex demo.
If you need to capture until the last :
, you can just add :
after the capturing group: pn = re.compile(r'@(\S+):')
.
As for a URL matching regex, there are many on the Web, just choose the one that works best for you.
Here is an example code:
import re
p = re.compile(r'@([^\s:]+)')
test_str = "@galaxy5univ I like you\nRT @BestOfGalaxies: Let's sit under the stars ...\n@jonghyun__bot .........((thanks)\nRT @yosizo: thanks.ddddd <https://y...content-available-to-author-only...o.com>\nRT @LDH_3_yui: #fam, ccccc https://m...content-available-to-author-only...s.com"print(p.findall(test_str))
p2 = re.compile(r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?')
print(p2.findall(test_str))
# => ['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui']# => ['https://yahoo.com', 'https://msn.news.com']
Post a Comment for "Python To Extract The @user And Url Link In Twitter Text Data With Regex"