WebScraping(BeautifulSoup) with Python

  • What is Web Scraping?
    • BeautifulSoup
    • Getting Everything Set
    • Let's Work with an Example
    • find_all and find functions
    • Conclusion

What is Web Scraping?

Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database.


BeautifulSoup

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.

Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility. For more click here.


Getting Everything Set

Now let's start writing some lines of code. We are going to import 02 libraries Beautifulsoup and requests.

from bs4 import Beautifulsoup
import requests

The requests Library is a python library used for making HTTP request from a server inorder to generate the HTML code of a specific website passed as a parameter. Click here to learn more.

If you are unable to import the bs4(Beautifulsoup4) library make sure you install it.

$ pip install beautifulsoup4
#Also do same for the requests library
$ pip install requests

Let's Work with an Example

We will use this Commercial Booking Website books.toscrape.com So let's start. This is an image of the website we will scrape Screenshot from 2022-05-18 12-46-56.png.

html_text = request.get('https://books.toscrape.com/catalogue/category/books/sports-and-games_17/index.html').text
print(html_text)

In the code above we are using the get function from the requests library, which takes a parameter the link address of the website we want to scrape. Then the .text is used at the end of the line to return only text(html lines) found inside the code. Printing html_text we get the html source code as follows. Screenshot from 2022-05-18 13-04-16.png

Now let's create an object from the Beautifulsoup class. From this object we will be able to make some research on the html code we requested earlier. let's create an instance of Beautifulsoup class.

soup = Beautifulsoup(html_text, 'lxml')
#lxml is a python library which facilitates the handling of xml and html languages
#so lxml will help us handle scraping more easily.

From here soup handles the code of the url passed in the get method. As you can imagine all our scraping will carried out using soup.


Find_all and Find functions

We want to scrape the name of the first book, in the list of books on the website. This is what we are to do.

  1. select the name of book
  2. right click on it
  3. take the option inspect
  4. from there we get the class name and the html tag

Screenshot from 2022-05-18 14-59-57.png The class holding this name is class = product_pod and it is found inside an a html tag. Using now the find() function which take the both the html tage where the book name is found and the class name of the html tag. This is a simple example

#for find()
book = soup.find('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
#here the variable book holds all the information held by the the first book on the website. Find returns only the first element it finds.

#for find_all()
books = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
#Here we return all the elements having that class and contains an html tag <li>. So here all the books are stored inside the books variable.

Finding the Book name Now we can scrape the name using the find method still. The find methos will be used on the book variable this time to get the book name.

#No need specifying the class in this case
book_name = book.find('h3').text
#Output
#The Book of Basketball: ...

Congratulations you just scraped a website.

Conclusion

This aricle is just an introduction to web Scraping . If you want to learn more about it, visit the freecodecamp YouTube. To see the complete code for this article visit my Github account For more follow me on Twitter. Please Comment, Like and Share.