
- 29th Jul 2024
- 18:08 pm
In this assignment you need to do the following:
1. You decide to look for an external data source and you realise that Wikipedia maintains the details you are looking for on their website. However, this data is not easily available. Thus, you decide to use web scraping to extract these data from the website.
a. You need to extract data for Australian Universities from Universities table in the following URL
2. Write a Python programme to scrape universities and their details (i.e., University: Australian Catholic University , Type: Public, Campus: Sydney, Brisbane, Canberra, Ballarat, Melbourne, State/Territory: National, Established: 1991, University status: 1991).
3. You need to use following libraries for web scraping:
- BeautifulSoup (from bs4)
- requests
- re
Important note: You CANNOT use Pandas to extract data.
4. The main steps include:
a. Extract university name, university type, university campus, university state/territory, university
establishment year and university status for all the universities in Australia
- You need to use the functions provided by the BeautifulSoup library for the remaining web scraping tasks.
- Should modularise the programme (no unnecessary for loops)
b. Save the data into a CSV file.
Free Assignment Solution - Web Scraping – Task – Using BeautifulSoup Library - Assignment Solution
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "RDataScrapping.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"id": "EgL7PIxJFvO4",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "6b84b8ba-a97f-4d30-f361-46ec0ca6ce9f"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
"Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.7/dist-packages (4.6.3)\n"
]
}
],
"source": [
"# Importing and Data Initilising\n",
"!pip install beautifulsoup4\n",
"from bs4 import BeautifulSoup\n",
"import re\n"
]
},
{
"cell_type": "code",
"source": [
"# Importing the Request, pages in text format\n",
"import requests as r\n",
"\n",
"wikiURL=\"https://en.wikipedia.org/wiki/List_of_universities_in_Australia\"\n",
"wiki_page_request = r.get(wikiURL)\n",
"wiki_page_text = wiki_page_request.text\n",
"\n"
],
"metadata": {
"id": "-RUlKlrjHoNZ"
},
"execution_count": 31,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Using the BeautifulSoup \n",
"from bs4 import BeautifulSoup\n",
"import requests as r\n",
"\n",
"wiki_page_request = r.get(wikiURL)\n",
"wiki_page_text = wiki_page_request.text\n",
"\n",
"# New code below\n",
"soup = BeautifulSoup(wiki_page_text, 'html.parser')"
],
"metadata": {
"id": "UGZxvokbJOY7"
},
"execution_count": 32,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Function that will give tables in operatable format\n",
"def returnTableData():\n",
" table = soup.find('table',{'class':\"wikitable\"})\n",
" headers = [header.text.strip() for header in table.find_all('th')[0:6]]\n",
" index = 0 \n",
" for headersInit in headers: \n",
" if re.search('\\[',headersInit):\n",
" result = re.split('\\[', headersInit, 1)\n",
" headers[index]=result[0]\n",
" index += 1\n",
"\n",
" rows = []\n",
"\n",
" # Find all `tr` tags\n",
" data_rows = table.find_all('tr')\n",
"\n",
" for row in data_rows:\n",
" value = row.find_all('td')\n",
" beautified_value = [ele.text.strip() for ele in value[0:6]]\n",
" # Remove data arrays that are empty\n",
" if len(beautified_value) == 0:\n",
" continue\n",
" if(\",\" in str(beautified_value[2])):\n",
" beautified_value[2]=\"...\"\n",
" rows.append(beautified_value)\n",
" \n",
" return headers,rows"
],
"metadata": {
"id": "ducsaHOVJa2e"
},
"execution_count": 34,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Creating an CSV Files\n",
"import csv\n",
"\n",
"headers,rows =returnTableData()\n",
"with open('universityList.csv', 'w', newline=\"\") as output:\n",
" writer = csv.writer(output)\n",
" writer.writerow(headers)\n",
" writer.writerows(rows)"
],
"metadata": {
"id": "nme0-oghT7sU"
},
"execution_count": 35,
"outputs": []
}
]
}
Get the best Web scraping – Task – Using BeautifulSoup Library assignment help and tutoring services from our experts now!
This sample Python assignment solution has been successfully completed by our team of Python programmers. The solutions provided are designed exclusively for research and reference purposes. If you find value in reviewing the reports and code, our Python tutors would be delighted.
-
For a comprehensive solution package including code, reports, and screenshots, please visit our Python Assignment Sample Solution page.
-
Contact our Python experts for personalized online tutoring sessions focused on clarifying any doubts related to this assignment.
-
Explore the partial solution for this assignment available in the blog above for further insights.
About The Author - Jamie Taylor
Jamie Taylor is a skilled web scraping specialist with extensive experience in Python programming. Jamie excels in using BeautifulSoup, Requests, and regular expressions to extract and analyze data from websites. Known for creating efficient and modular scraping solutions, Jamie’s work includes projects like extracting detailed information about Australian universities from Wikipedia, ensuring accuracy and ease of data manipulation.