Web Scraping From EspnCricinfo Using Python: A Step-by-Step Guide

Author: Hina Idrees and Muhammad Furqan

In this blog, I’ll take you through how I developed a Python script to scrape player ranking data from the ICC (International Cricket Council) website. The project involves extracting the ranking tables from the webpage and saving the data into a CSV file. This can be useful for cricket analysts, fans, or anyone looking to automate the process of fetching ICC rankings. Along the way, I encountered several challenges, which I will explain along with their solutions.

Step 1: Importing the Necessary Libraries

Before starting, we have to import the required libraries. I used three main libraries:

requests: To fetch the web page.
BeautifulSoup: For parsing the HTML and extracting the data.
pandas: To create and manipulate data tables in the form of DataFrames and to save extracted tables to CSV file.

				
					from bs4 import BeautifulSoup
import requests
import pandas as pd

Following is the code to import these libraries:

Step 2: Providing the URL and Extracting HTML

We need to specify what kind of rankings we want to scrape. In this example, we are fetching the rankings for T20 Batting. To make the script flexible, I allowed the match format (Test, ODI, or T20) and the player category (Batting, Bowling, or All-rounder) to be passed as variables.

				
					match_type = ['t20', 'batting']

url = f'https://www.relianceiccrankings.com/ranking/{match_type[0].lower()}/{match_type[1].lower()}/'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')

Code:

Explanation:

match_type[0] represents the match format (e.g., “t20”, “odi” or “test”,).
match_type[1] represents the player category (e.g., “batting”, “bowling” or “all-rounder”).
I construct the URL by formatting by inserting these values into the URL.
The requests.get() method fetches the HTML of the page.
The BeautifulSoup library is then used to parse the HTML content.

Step 3: Extracting the Tables from the HTML

The ranking data is usually organized in a table, so we need to extract that table from the HTML. Depending on the page, there may be multiple tables, but we only need the one containing the rankings.

				
					soup.find('table')
soup.find_all('table')

Code: In the code above, find(‘table’) looks for the first table on the page, while find_all(‘table’) retrieves all the tables. This gives us flexibility in case there are multiple tables. Challenges:

In some cases, the table I needed was not the first one returned by find(‘table’). By switching to find_all(‘table’), I could loop through all available tables to find the correct one.

Step 4: Extracting Table Headers The next step is to extract the column names (headers) from the table.

				
					title = soup.find_all('th')
table = [header.text.strip() for header in title]
df = pd.DataFrame(columns=table)

Code:

Explanation:

find_all(‘th’) searches for all the headers (<th> tags) in the table.
I used a list comprehension to clean up the header text by stripping extra spaces and storing them in the list table.
The pandas DataFrame is then initialized with these headers.

Challenges:

I initially struggled with how to clean up the header names because some headers had extra whitespace or formatting. The solution was to use .strip() to remove these unwanted spaces.

Step 5: Extracting Data from Each Row

After collecting the headers, we need to extract the data from each row and assign it to the corresponding columns.

				
					column_data = soup.find_all('tr')

for row in column_data[1:]:
    row_data = row.find_all('td')
    ind_row_data = [data.text for data in row_data]
    
    row_data = {
        "Rating": ind_row_data[1],
        "Name": ind_row_data[2],
        f"Career Best {match_type[0].capitalize()} Rating": ind_row_data[4],
        f"Career Best {match_type[0].capitalize()} Ranking": ind_row_data[5]
    }
    
    length = len(df)
    df.loc[length] = row_data

Code:

Explanation:

I first find all rows using find_all(‘tr’).
I then loop through each row and use find_all(‘td’) to grab the data inside table cells (<td> tags).
A dictionary (row_data) stores each cell’s value under the corresponding header. The keys are dynamically created based on the match type (Test, ODI, T20).
I append the row to the DataFrame using df.loc[length].

Challenges:

Parsing the rows correctly was tricky, as the HTML structure sometimes varied depending on the table. For instance, some rows might have extra empty cells. I solved this by adding checks to ensure I was always fetching the correct data cells.

Step 6: Viewing Data

At this stage, the data has been successfully extracted into the DataFrame. You can preview it to make sure everything is working correctly.

				
					df.head(8)
df.tail(7)

Code:

Explanation:

head(8) displays the first 8 rows of the DataFrame.
tail(7) displays the last 7 rows of the DataFrame.

Step 7: Saving Data to a CSV File

Finally, I saved the data as a CSV file. The file name is generated based on the match type and player category.

				
					df.to_csv(f'Top 100 {match_type[0].capitalize()} {match_type[1].capitalize()}.csv', index = False)

Code:

Explanation:

The file name is formatted based on the match type and player category (e.g., “Top 100 T20 Batting.csv”).
The to_csv() method saves the DataFrame as a CSV file.

After data scrapping, I then Clean the data i.e. remove missing values in the data and then visualize it in the form of table. The overall step-by-step guide for the Visualization is on the next page:

Visualizing ICC Player Rankings Using Python Language

In this blog, I will guide you through a Python script to visualize ICC player ranking data that I previously scraped from the ICC rankings website. This explains how to load, clean, and visualize the data, which gives us deeper insights into the top players’ ratings. Along the way, I’ll share some of the challenges I encountered and how I handled them.

Step 1: Importing Required Libraries

First of all, let’s import the necessary libraries for data manipulation and visualization:

pandas: For reading and processing CSV data.

matplotlib: For creating visualizations, such as bar charts and tables.

				
					import pandas as pd
import matplotlib.pyplot as plt

Code:

Step 2: Loading the Data

Since we scraped multiple ranking files (Batting, Bowling, and All-rounders for different formats), we’ll focus on one dataset at a time. In this case, let’s start with the ODI Batting rankings file.

				
					file_name = ['Top 100 ODI Batting.csv', 'Top 100 T20 Batting.csv', 'Top 100 Test Batting.csv', 'Top 100 ODI Bowling.csv', 'Top 100 T20 Bowling.csv', 'Top 100 Test Bowling.csv', 'Top 10 ODI All-rounder.csv', 'Top 10 T20 All-rounder.csv', 'Top 10 Test All-rounder.csv']

data = pd.read_csv(file_name[0])

Code:

The list file_name contains all the scraped datasets. For this example, I’m working with the first dataset, Top 100 ODI Batting rankings.

Step 3: Exploring the Data

Before visualizing, it’s important to explore the data to understand its structure and check for any missing values or irregularities.

				
					data.shape
data.info()

Code:

shape tells us the dimensions of the dataset (i.e., number of rows and columns).
info() provides an overview of the dataset, showing column data types and whether there are missing values.

Handling Missing Values Missing values can affect the analysis and visualizations, so it’s important to address them. The code following is how I checked for missing values:

				
					print("\nAfter Handling Missing Values:")
print(data.isnull().sum())

Code:

This gives a count of missing values in each column as our dataset have too much missing values in it.

Step 4: Cleaning the Data

I noticed that the dataset had an unnamed column, which didn’t contain any information. So, I removed it using the following line of code:

				
					clean_data = data.drop(columns=['Unnamed: 2'])

I also observed that some cells contained unwanted characters like \n, which were causing formatting issues. To fix this, I replaced those characters with a space using a lambda function and a regular expression:

				
					clean_data = clean_data.apply(lambda x: x.str.replace(r'\n+', ' ', regex=True) if x.dtype == 'object' else x)

This cleaned up any newline characters that were messing up the formatting of names or other text data.

To focus only on the top 10 players, I used the head() function:

				
					clean_data = clean_data.head(10)

Code:

Step 5: Visualizing the Data

Displaying Data as a Table

To visualize the cleaned data in a table format using matplotlib, I created a table that displays the top 10 players:

				
					fig, ax = plt.subplots(figsize=(14, 7))
ax.axis('off')

table = ax.table(cellText=clean_data.values, colLabels=clean_data.columns, cellLoc='left', loc='center', fontsize = 18.0)
plt.show()

Code:

In the code above, ax.table creates a table using the data from clean_data. The ax.axis(‘off’) line hides the axes since I’m only interested in showing the table.

Bar Chart: Player Ratings

Next, I created a bar chart to compare the ratings of the top 10 players in the ODI format. This visualization makes it easy to see how players rank against each other based on their rating.

				
					plt.figure(figsize=(12,6))
plt.bar(clean_data['Name'], clean_data['Rating'], color = 'skyblue')

plt.xlabel('Player Name', fontsize=12)
plt.ylabel('Rating', fontsize = 12)
plt.title('Top 10 Players Rating Chart')
plt.xticks(rotation = 90)
plt.show()

Code:

In this chart:

The plt.bar() function generates the bar chart.
The clean_data[‘Name’] column provides the player names for the x-axis.
The clean_data[‘Rating’] column gives the ratings for the y-axis.
The labels and titles are added using plt.xlabel(), plt.ylabel(), and plt.title() respectively.
Finally, plt.xticks(rotation=90) rotates the player names to make them easier to read.

Challenges Faced During the Process

Handling Missing Data: Although the dataset was mostly clean, missing data is a common issue when working with scraped datasets. To ensure clean results, I used isnull() to check for missing values.
Unnamed Columns: In many CSV files, we might encounter columns with no name or irrelevant data. I had to manually inspect the dataset and drop these columns.
Formatting Issues: Some player names or data points contained newline characters (\n) that affected the table’s readability. To fix this, I applied a regular expression to clean up the text.

case studies

See More Case Studies

Python Scripting

IMDb Top 250 Movies Scraper using Python

I developed the IMDb Top 250 Movies Scraper using Python to automate the extraction of information about the top 250 movies listed on IMDb. This tool efficiently gathers data and stores it in a SQL Server database.

Learn more

Python Scripting

IMDb Movie and TV Show Data Extraction and Visualization

The primary objective of this project is to extract relevant movie and TV show data from IMDb. Organize the data in a structured format using pandas and CSV files. Visualize the insights through Matplotlib and Power BI.

Learn more

Python Scripting

IMDb Data Extraction and Dashboard Creation

The goal of this project was to extract, organize, movie and TV show data from the IMDb website and organize it into structured databases, clean and transform the data, and then visualize key insights using Power BI.

Learn more

Partner with Us for Complete IT Solutions

We’re happy to answer any questions you may have and help you determine which of our services best fit your needs.

Your benefits:

What happens next?

We Schedule a call at your convenience

We do Consulting Meeting

We prepare a proposal

Schedule a Free Consultation

First name

Last name

Comapny / Organization

Company email

Phone

How Can We Help You?

Message

Web Scraping From EspnCricinfo Using Python: A Step-by-Step Guide

Author: Hina Idrees and Muhammad Furqan

See More Case Studies

IMDb Top 250 Movies Scraper using Python

IMDb Movie and TV Show Data Extraction and Visualization

IMDb Data Extraction and Dashboard Creation

Partner with Us for Complete IT Solutions

Your benefits:

What happens next?

Schedule a Free Consultation

Our Services

Company

LinkedIn

Facebook

Instagram

Simplifying IT
for a complex world.

Technology Focus

Azure Data Factory

Azure Data Fabric

Azure Synapse

Apache spark

Angular

Services

Business Challenges

Data & AI

UI/UX

Web Development

Cloud Services

Industry Focus

Web Scraping From EspnCricinfo Using Python: A Step-by-Step Guide

Author: Hina Idrees and Muhammad Furqan

See More Case Studies

IMDb Top 250 Movies Scraper using Python

IMDb Movie and TV Show Data Extraction and Visualization

IMDb Data Extraction and Dashboard Creation

Partner with Us for Complete IT Solutions

Your benefits:

What happens next?

Schedule a Free Consultation

LinkedIn

Facebook

Instagram

Simplifying IT for a complex world.

Technology Focus

Azure Data Factory

Azure Data Fabric

Azure Synapse

Apache spark

Angular

Services

Business Challenges

Data & AI

UI/UX

Web Development

Cloud Services

Industry Focus

Simplifying IT
for a complex world.