Web Scraping and Data Retrieval
Introduction to web scraping
Web scraping is the process of collecting and parsing raw data from the Web. It is a powerful tool that can help you collect data online and transfer the information in either an excel, CSV or JSON file to help you better understand the information you’ve gathered. In this tutorial, you’ll learn how to parse website data using string methods and regular expressions, parse website data using an HTML parser, and interact with forms and other website components1
Example
# Importing Required Libraries
import requests
from bs4 import BeautifulSoup
# URL of the website
url = "https://www.realpython.com/python-web-scraping-practical-introduction/"
# Sending Request to the Website
response = requests.get(url)
# Parsing the HTML Content
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting the Title of the Website
title = soup.title.string
# Printing the Title of the Website
print("Title of the Website: ", title)
# Importing Required Libraries
import requests
from bs4 import BeautifulSoup
# URL of the website
url = "https://www.realpython.com/python-web-scraping-practical-introduction/"
# Sending Request to the Website
response = requests.get(url)
# Parsing the HTML Content
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting all the Headings of the Website
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
# Printing all the Headings of the Website
for heading in headings:
print(heading.text.strip())
# Importing Required Libraries
import requests
from bs4 import BeautifulSoup
# URL of the website
url = "https://www.realpython.com/python-web-scraping-practical-introduction/"
# Sending Request to the Website
response = requests.get(url)
# Parsing the HTML Content
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting all the Links of the Website
links = soup.find_all('a')
# Printing all the Links of the Website
for link in links:
print(link.get('href'))
Exercises
Exercise: Scrape the title of the website https://www.nytimes.com/ using BeautifulSoup.
Exercise: Scrape the first paragraph of the article https://www.nytimes.com/2022/12/31/world/europe/ukraine-russia-war.html using BeautifulSoup.
Exercise: Scrape the image of the day from https://apod.nasa.gov/apod/ using BeautifulSoup.
Exercise: Scrape the top 10 trending repositories on Github using BeautifulSoup.
Exercise: Scrape the top 10 most viewed videos on Youtube using BeautifulSoup.
Parsing HTML with libraries like BeautifulSoup
HTML is a markup language used to create web pages. Parsing HTML is the process of extracting data from HTML documents. Python provides several libraries for parsing HTML, including BeautifulSoup, lxml, and html5lib. BeautifulSoup is a popular Python library for parsing HTML and XML documents. It provides a simple way to navigate and search HTML documents, and it can handle malformed HTML. In this tutorial, you’ll learn how to use BeautifulSoup to parse HTML documents, extract data from HTML documents, and navigate HTML documents12
Example
# Importing Required Libraries
from bs4 import BeautifulSoup
# HTML Content
html_content = '''
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>Heading 1</h1>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<ul>
<li>List Item 1</li>
<li>List Item 2</li>
</ul>
</body>
</html>
'''
# Parsing the HTML Content
soup = BeautifulSoup(html_content, 'html.parser')
# Extracting the Title of the Page
title = soup.title.string
# Printing the Title of the Page
print("Title of the Page: ", title)
# Extracting all the Paragraphs of the Page
paragraphs = soup.find_all('p')
# Printing all the Paragraphs of the Page
for paragraph in paragraphs:
print(paragraph.text)
# Importing Required Libraries
import requests
from bs4 import BeautifulSoup
# URL of the website
url = "https://www.realpython.com/python-web-scraping-practical-introduction/"
# Sending Request to the Website
response = requests.get(url)
# Parsing the HTML Content
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting all the Headings of the Website
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
# Printing all the Headings of the Website
for heading in headings:
print(heading.text.strip())
# Importing Required Libraries
import requests
from bs4 import BeautifulSoup
# URL of the website
url = "https://www.realpython.com/python-web-scraping-practical-introduction/"
# Sending Request to the Website
response = requests.get(url)
# Parsing the HTML Content
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting all the Links of the Website
links = soup.find_all('a')
# Printing all the Links of the Website
for link in links:
print(link.get('href'))
Exercises
Exercise: Scrape the title of the website https://www.nytimes.com/ using BeautifulSoup.
Exercise: Scrape the first paragraph of the article https://www.nytimes.com/2022/12/31/world/europe/ukraine-russia-war.html using BeautifulSoup.
Exercise: Scrape the image of the day from https://apod.nasa.gov/apod/ using BeautifulSoup.
Exercise: Scrape the top 10 trending repositories on Github using BeautifulSoup.
Exercise: Scrape the top 10 most viewed videos on Youtube using BeautifulSoup.
Scraping data from websites
Web scraping is the process of extracting data from websites. It is a powerful tool that can help you collect data online and transfer the information in either an excel, CSV or JSON file to help you better understand the information you’ve gathered. Python provides several libraries for parsing HTML, including BeautifulSoup, lxml, and html5lib. BeautifulSoup is a popular Python library for parsing HTML and XML documents. It provides a simple way to navigate and search HTML documents, and it can handle malformed HTML. In this tutorial, you’ll learn how to use Python for web scraping. You’ll also learn how to parse website data using string methods and regular expressions, parse website data using an HTML parser, and interact with forms and other website components1
Example
# Importing Required Libraries
import requests
from bs4 import BeautifulSoup
# URL of the website
url = "https://www.realpython.com/python-web-scraping-practical-introduction/"
# Sending Request to the Website
response = requests.get(url)
# Parsing the HTML Content
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting the Title of the Website
title = soup.title.string
# Printing the Title of the Website
print("Title of the Website: ", title)
# Importing Required Libraries
import requests
from bs4 import BeautifulSoup
# URL of the website
url = "https://www.realpython.com/python-web-scraping-practical-introduction/"
# Sending Request to the Website
response = requests.get(url)
# Parsing the HTML Content
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting all the Headings of the Website
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
# Printing all the Headings of the Website
for heading in headings:
print(heading.text.strip())
# Importing Required Libraries
import requests
from bs4 import BeautifulSoup
# URL of the website
url = "https://www.realpython.com/python-web-scraping-practical-introduction/"
# Sending Request to the Website
response = requests.get(url)
# Parsing the HTML Content
soup = BeautifulSoup(response.content, 'html.parser')
# Extracting all the Links of the Website
links = soup.find_all('a')
# Printing all the Links of the Website
for link in links:
print(link.get('href'))
Exercises
Exercise: Scrape the title of the website https://www.nytimes.com/ using BeautifulSoup.
Exercise: Scrape the first paragraph of the article https://www.nytimes.com/2022/12/31/world/europe/ukraine-russia-war.html using BeautifulSoup.
Exercise: Scrape the image of the day from https://apod.nasa.gov/apod/ using BeautifulSoup.
Exercise: Scrape the top 10 trending repositories on Github using BeautifulSoup.
Exercise: Scrape the top 10 most viewed videos on Youtube using BeautifulSoup.
Handling authentication and pagination
Pagination is the process of breaking up large datasets into smaller, more manageable chunks. It is a common technique used in web scraping to avoid overloading servers with too many requests. Authentication is the process of verifying the identity of a user or application. Many APIs require authentication to access their data. In this tutorial, you’ll learn how to handle pagination and authentication in Python using the requests library. You’ll learn how to make authenticated requests to an API, handle pagination using offset and cursor-based pagination, and write code to automate the process of making multiple requests12
Example
# Importing Required Libraries
import requests
# URL of the API
url = "https://api.example.com/data"
# Authentication Credentials
username = "your_username"
password = "your_password"
# Making Authenticated Request
response = requests.get(url, auth=(username, password))
# Handling Pagination Using Offset
offset = 0
limit = 100
while True:
# Sending Request to the API
response = requests.get(url, params={"offset": offset, "limit": limit}, auth=(username, password))
# Processing the Response
data = response.json()
results = data["results"]
if not results:
break
# Processing the Results
for result in results:
# Do Something with the Result
# Updating the Offset
offset += limit
# Importing Required Libraries
import requests
# URL of the API
url = "https://api.example.com/data"
# Authentication Credentials
username = "your_username"
password = "your_password"
# Handling Pagination Using Cursor
cursor = None
while True:
# Sending Request to the API
params = {"cursor": cursor} if cursor else {}
response = requests.get(url, params=params, auth=(username, password))
# Processing the Response
data = response.json()
results = data["results"]
if not results:
break
# Processing the Results
for result in results:
# Do Something with the Result
# Updating the Cursor
cursor = data["cursor"]
Exercises
Exercise: Scrape the first 10 pages of search results for the query “python” on https://www.google.com using requests and BeautifulSoup.
Exercise: Scrape the first 10 pages of search results for the query “python” on https://www.bing.com using requests and BeautifulSoup.
Exercise: Scrape the first 10 pages of search results for the query “python” on https://www.yahoo.com using requests and BeautifulSoup.
Exercise: Scrape the first 10 pages of search results for the query “python” on https://www.duckduckgo.com using requests and BeautifulSoup.
Exercise: Scrape the first 10 pages of search results for the query “python” on https://www.ask.com using requests and BeautifulSoup.
Last updated
Was this helpful?