Python To Extract The @user And Url Link In Twitter Text Data With Regex

July 22, 2023 Post a Comment

There is a list string twitter text data, for example, the following data (actually, there is a large number of text,not just these data), I want to extract the all the user name a

Solution 1:

Note that your pn = re.compile(r'@(\S+)') regex will capture any 1+ non-whitespace characters after @.

To exclude matching :, you need to convert the shorthand \S class to [^\s] negated character class equivalent, and add : to it:

pn = re.compile(r'@([^\s:]+)')

Now, it will stop capturing non-whitespace symbols before the first :. See the regex demo.

If you need to capture until the last :, you can just add : after the capturing group: pn = re.compile(r'@(\S+):').

As for a URL matching regex, there are many on the Web, just choose the one that works best for you.

Baca Juga

Here is an example code:

import re
p = re.compile(r'@([^\s:]+)')
test_str = "@galaxy5univ I like you\nRT @BestOfGalaxies: Let's sit under the stars ...\n@jonghyun__bot .........((thanks)\nRT @yosizo: thanks.ddddd <https://y...content-available-to-author-only...o.com>\nRT @LDH_3_yui: #fam, ccccc https://m...content-available-to-author-only...s.com"print(p.findall(test_str)) 
p2 = re.compile(r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?')
print(p2.findall(test_str))
# => ['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui']# => ['https://yahoo.com', 'https://msn.news.com']