In this blog, I’ll take you through how I developed a Python script to scrape player ranking data from the ICC (International Cricket Council) website. The project involves extracting the ranking tables from the webpage and saving the data into a CSV file. This can be useful for cricket analysts, fans, or anyone looking to automate the process of fetching ICC rankings. Along the way, I encountered several challenges, which I will explain along with their solutions.
Step 1: Importing the Necessary Libraries
Before starting, we have to import the required libraries. I used three main libraries:
- requests: To fetch the web page.
- BeautifulSoup: For parsing the HTML and extracting the data.
- pandas: To create and manipulate data tables in the form of DataFrames and to save extracted tables to CSV file.
from bs4 import BeautifulSoup
import requests
import pandas as pd
Following is the code to import these libraries:
Step 2: Providing the URL and Extracting HTML
We need to specify what kind of rankings we want to scrape. In this example, we are fetching the rankings for T20 Batting. To make the script flexible, I allowed the match format (Test, ODI, or T20) and the player category (Batting, Bowling, or All-rounder) to be passed as variables.
match_type = ['t20', 'batting']
url = f'https://www.relianceiccrankings.com/ranking/{match_type[0].lower()}/{match_type[1].lower()}/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
Code:
Explanation:
- match_type[0] represents the match format (e.g., “t20”, “odi” or “test”,).
- match_type[1] represents the player category (e.g., “batting”, “bowling” or “all-rounder”).
- I construct the URL by formatting by inserting these values into the URL.
- The requests.get() method fetches the HTML of the page.
- The BeautifulSoup library is then used to parse the HTML content.
Step 3: Extracting the Tables from the HTML
The ranking data is usually organized in a table, so we need to extract that table from the HTML. Depending on the page, there may be multiple tables, but we only need the one containing the rankings.
soup.find('table')
soup.find_all('table')
- In some cases, the table I needed was not the first one returned by find(‘table’). By switching to find_all(‘table’), I could loop through all available tables to find the correct one.
title = soup.find_all('th')
table = [header.text.strip() for header in title]
df = pd.DataFrame(columns=table)
Code:
Explanation:
- find_all(‘th’) searches for all the headers (<th> tags) in the table.
- I used a list comprehension to clean up the header text by stripping extra spaces and storing them in the list table.
- The pandas DataFrame is then initialized with these headers.
Challenges:
- I initially struggled with how to clean up the header names because some headers had extra whitespace or formatting. The solution was to use .strip() to remove these unwanted spaces.
Step 5: Extracting Data from Each Row
After collecting the headers, we need to extract the data from each row and assign it to the corresponding columns.
column_data = soup.find_all('tr')
for row in column_data[1:]:
row_data = row.find_all('td')
ind_row_data = [data.text for data in row_data]
row_data = {
"Rating": ind_row_data[1],
"Name": ind_row_data[2],
f"Career Best {match_type[0].capitalize()} Rating": ind_row_data[4],
f"Career Best {match_type[0].capitalize()} Ranking": ind_row_data[5]
}
length = len(df)
df.loc[length] = row_data
Code:
Explanation:
- I first find all rows using find_all(‘tr’).
- I then loop through each row and use find_all(‘td’) to grab the data inside table cells (<td> tags).
- A dictionary (row_data) stores each cell’s value under the corresponding header. The keys are dynamically created based on the match type (Test, ODI, T20).
- I append the row to the DataFrame using df.loc[length].
Challenges:
- Parsing the rows correctly was tricky, as the HTML structure sometimes varied depending on the table. For instance, some rows might have extra empty cells. I solved this by adding checks to ensure I was always fetching the correct data cells.
Step 6: Viewing Data
At this stage, the data has been successfully extracted into the DataFrame. You can preview it to make sure everything is working correctly.
df.head(8)
df.tail(7)
Code:
Explanation:
- head(8) displays the first 8 rows of the DataFrame.
- tail(7) displays the last 7 rows of the DataFrame.
Step 7: Saving Data to a CSV File
Finally, I saved the data as a CSV file. The file name is generated based on the match type and player category.
df.to_csv(f'Top 100 {match_type[0].capitalize()} {match_type[1].capitalize()}.csv', index = False)
Code:
Explanation:
- The file name is formatted based on the match type and player category (e.g., “Top 100 T20 Batting.csv”).
- The to_csv() method saves the DataFrame as a CSV file.
After data scrapping, I then Clean the data i.e. remove missing values in the data and then visualize it in the form of table. The overall step-by-step guide for the Visualization is on the next page:
Visualizing ICC Player Rankings Using Python Language
In this blog, I will guide you through a Python script to visualize ICC player ranking data that I previously scraped from the ICC rankings website. This explains how to load, clean, and visualize the data, which gives us deeper insights into the top players’ ratings. Along the way, I’ll share some of the challenges I encountered and how I handled them.
Step 1: Importing Required Libraries
First of all, let’s import the necessary libraries for data manipulation and visualization:
- pandas: For reading and processing CSV data.
matplotlib: For creating visualizations, such as bar charts and tables.
import pandas as pd
import matplotlib.pyplot as plt
Code:
Step 2: Loading the Data
Since we scraped multiple ranking files (Batting, Bowling, and All-rounders for different formats), we’ll focus on one dataset at a time. In this case, let’s start with the ODI Batting rankings file.
file_name = ['Top 100 ODI Batting.csv', 'Top 100 T20 Batting.csv', 'Top 100 Test Batting.csv', 'Top 100 ODI Bowling.csv', 'Top 100 T20 Bowling.csv', 'Top 100 Test Bowling.csv', 'Top 10 ODI All-rounder.csv', 'Top 10 T20 All-rounder.csv', 'Top 10 Test All-rounder.csv']
data = pd.read_csv(file_name[0])
Code:
The list file_name contains all the scraped datasets. For this example, I’m working with the first dataset, Top 100 ODI Batting rankings.
Step 3: Exploring the Data
Before visualizing, it’s important to explore the data to understand its structure and check for any missing values or irregularities.
data.shape
data.info()
- shape tells us the dimensions of the dataset (i.e., number of rows and columns).
- info() provides an overview of the dataset, showing column data types and whether there are missing values.
print("\nAfter Handling Missing Values:")
print(data.isnull().sum())
Code:
This gives a count of missing values in each column as our dataset have too much missing values in it.
Step 4: Cleaning the Data
I noticed that the dataset had an unnamed column, which didn’t contain any information. So, I removed it using the following line of code:
clean_data = data.drop(columns=['Unnamed: 2'])
I also observed that some cells contained unwanted characters like \n, which were causing formatting issues. To fix this, I replaced those characters with a space using a lambda function and a regular expression:
clean_data = clean_data.apply(lambda x: x.str.replace(r'\n+', ' ', regex=True) if x.dtype == 'object' else x)
This cleaned up any newline characters that were messing up the formatting of names or other text data.
To focus only on the top 10 players, I used the head() function:
clean_data = clean_data.head(10)
Code:
Step 5: Visualizing the Data
Displaying Data as a Table
To visualize the cleaned data in a table format using matplotlib, I created a table that displays the top 10 players:
fig, ax = plt.subplots(figsize=(14, 7))
ax.axis('off')
table = ax.table(cellText=clean_data.values, colLabels=clean_data.columns, cellLoc='left', loc='center', fontsize = 18.0)
plt.show()
Code:
In the code above, ax.table creates a table using the data from clean_data. The ax.axis(‘off’) line hides the axes since I’m only interested in showing the table.
Bar Chart: Player Ratings
Next, I created a bar chart to compare the ratings of the top 10 players in the ODI format. This visualization makes it easy to see how players rank against each other based on their rating.
plt.figure(figsize=(12,6))
plt.bar(clean_data['Name'], clean_data['Rating'], color = 'skyblue')
plt.xlabel('Player Name', fontsize=12)
plt.ylabel('Rating', fontsize = 12)
plt.title('Top 10 Players Rating Chart')
plt.xticks(rotation = 90)
plt.show()
Code:
In this chart:
- The plt.bar() function generates the bar chart.
- The clean_data[‘Name’] column provides the player names for the x-axis.
- The clean_data[‘Rating’] column gives the ratings for the y-axis.
- The labels and titles are added using plt.xlabel(), plt.ylabel(), and plt.title() respectively.
- Finally, plt.xticks(rotation=90) rotates the player names to make them easier to read.
Challenges Faced During the Process
- Handling Missing Data: Although the dataset was mostly clean, missing data is a common issue when working with scraped datasets. To ensure clean results, I used isnull() to check for missing values.
- Unnamed Columns: In many CSV files, we might encounter columns with no name or irrelevant data. I had to manually inspect the dataset and drop these columns.
- Formatting Issues: Some player names or data points contained newline characters (\n) that affected the table’s readability. To fix this, I applied a regular expression to clean up the text.