Python is considered to be an excellent choice for web scraping due to it's powerful libraries.
Libraries like BeautifulSoup and Scrapy allow you to extract information from web pages.
Consider the following example where we try to extract the email address from a webpage:
import requests
from bs4 import BeautifulSoup
import re
url = 'http://example.com' //specify the URL here
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', soup.text)
print(emails)
- [a-zA-Z0-9._%+-]+: Matches the local part of the email (before the @).
- @[a-zA-Z0-9.-]+: Matches the domain name.
- \.[a-zA-Z]{2,}: Matches the domain extension (e.g., .com, .org), where the extension is at least two characters long.
This code sends a request to a specified webpage, extracts the HTML content, searches the content for any email addresses using a regular expression, and prints a list of all the emails found.