Python Practice Set (Part 20): Introduction to Web Scraping

Anubhav Somani
Sep 3
2 min read

Welcome to the 20th installment of our foundational Python practice series! We've saved one of the most exciting topics for last: Web Scraping. This is the art of writing a script to automatically visit a website and extract specific information from its HTML structure. It's a powerful technique used for market research, price comparison, data journalism, and much more.

In this set, we'll use two essential third-party libraries:

requests: To fetch the content of a webpage.
BeautifulSoup: To parse the HTML and find the data we need.

Note: You'll need to install these libraries first by running: pip install requests beautifulsoup4

1. Fetching a Webpage's HTML

The first step in scraping is to get the raw HTML source code of a page. Write a function get_page_html that takes a url as an argument. The function should use the requests library to make a GET request to that URL and return the page's HTML content as a string.

For our examples, we will use a simple, non-existent URL for demonstration purposes. The focus is on the code structure.

2. Parsing HTML and Finding the Title

Once you have the HTML, you need to parse it to make it searchable. Write a function get_page_title that takes HTML content as input. The function should:

Create a BeautifulSoup object from the HTML.
Find and extract the text from the <title> tag of the page.
Return the title string.

3. Finding an Element by Its ID

Web developers often use id attributes to uniquely identify important elements. Imagine you have a webpage with a main heading <h1 id="main-title">Welcome to the Scraper's Zone</h1>. Write a function find_main_titlethat takes HTML content, parses it, and finds the text of the element with the ID "main-title".

4. Extracting All Links from a Page

A common scraping task is to gather all the links from a webpage. Write a function find_all_links that takes HTML content and returns a list of all the URLs found in the href attribute of every <a> (anchor) tag on the page.

5. Scraping a Blog Post Feed

Let's combine our skills. You are given a block of HTML representing a simple blog feed. Write a function scrape_blog_titles that parses this HTML and returns a list containing the title of each blog post. The title of each post is inside an <h2> tag with the class post-title.