Python

How to Crawl XML Sitemaps with Python

Learn how to efficiently crawl and extract URLs from XML sitemaps using Python’s Ultimate Sitemap Parser (USP). Discover how to handle nested sitemaps, filter URLs, and save results for SEO audits, web scraping, and competitor analysis.

Introduction
Prerequisites
- Install Python
- Install Ultimate Sitemap Parser (USP)
Crawling Sitemaps using "ultimate-sitemap-parser"
Conclusion

Introduction

Sitemaps are a critical part of SEO and web crawling because they provide a structured list of URLs that a website wants search engines to index. Instead of scraping a site link by link, crawling the XML sitemap allows you to quickly discover all available URLs in a much more efficient way.
- The Challenges of Manual Sitemap Parsing
  While sitemaps make crawling easier, manually parsing them can be tricky:
  - Many websites use index sitemaps that link to multiple smaller sitemaps, requiring additional steps to process.
  - Large sitemaps can contain thousands of URLs, making extraction time-consuming and inefficient.
- The Solution: Ultimate Sitemap Parser (USP)
  This is where ultimate-sitemap-parser (USP) comes in. It’s a powerful Python library built to handle complex sitemaps with ease. USP can:
  - Fetch and parse XML sitemaps automatically.
  - Handle nested index sitemaps without extra coding.
  - Extract URLs efficiently with just a simple function call.
- What You’ll Learn in This Guide
  
  In this tutorial, we’ll walk you through how to use ultimate-sitemap-parser in Python to crawl the ASOS sitemap and extract all available URLs quickly. This approach saves time, reduces complexity, and provides a scalable way to handle sitemap crawling for SEO projects.
Prerequisites

Before we start crawling XML sitemaps with Python, ensure your system is ready with the following setup:
- Install Python
  To run the sitemap crawler, you’ll need Python 3.x installed on your computer.
  - Download the latest version of Python from the official Python website.
  - After installation, confirm it’s working by running the command below in your terminal or command prompt:
  python3 --version
  
  If Python is installed correctly, this will display the installed version number.
- Install Ultimate Sitemap Parser (USP)
  
  The ultimate-sitemap-parser library makes it simple to fetch and parse sitemaps in Python. Install it using pip:
  pip install ultimate-sitemap-parser
  Once installed, you’re ready to start building your Python sitemap crawler.
Crawling Sitemaps using "ultimate-sitemap-parser"

Now that you’ve installed ultimate-sitemap-parser (USP), let’s see how to use it to crawl the ASOS sitemap and extract URLs efficiently. We’ll walk through the core features step by step.
- Fetching the Sitemap and Extracting URLs
  
  The USP library makes it simple to fetch and parse XML sitemaps without manually handling XML files. With just a few lines of code, you can extract every URL from a sitemap.
  
  from usp.tree import sitemap_tree_for_homepage
  
  # Define the target website
  url = "https://www.asos.com/"
  
  # Fetch and parse the sitemap
  tree = sitemap_tree_for_homepage(url)
  
  # Extract and print all URLs
  for page in tree.all_pages():
  print(page.url)
  
  This script will automatically fetch the ASOS sitemap and print all available URLs.
- Handling Nested Sitemaps
  A big advantage of USP is that it can automatically handle nested index sitemaps. Many websites (including ASOS) split their sitemaps into smaller files, such as:
  - Product pages
  - Category pages
  - Blog or content pages
  USP will:
  - Detect index sitemaps.
  - Fetch linked sub-sitemaps.
  - Recursively extract URLs from all of them.
  👉 The script above already handles this seamlessly — no extra code needed.
- Extracting Only a Subset of URLs
  
  Sometimes you only want specific types of pages, such as product URLs. USP makes this easy with simple filtering.
  
  For example, ASOS product URLs usually contain /product/. Here’s how you can filter only product pages:
  
  # Extract and filter only product page URLs
  product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]
  
  # Print filtered URLs
  for url in product_urls:
  print(url)
  
  This way, you only crawl the URLs that matter to your project.
- Storing URLs in a CSV File
  
  Instead of printing the URLs, you might want to save them for later use. Here’s how to export the extracted URLs into a CSV file:
  
  import csv
  from usp.tree import sitemap_tree_for_homepage
  
  # Define the target website
  url = "https://www.asos.com/"
  
  # Fetch and parse the sitemap
  tree = sitemap_tree_for_homepage(url)
  
  # Extract all URLs
  urls = [page.url for page in tree.all_pages()]
  
  # Save URLs to a CSV file
  csv_filename = "asos_sitemap_urls.csv"
  with open(csv_filename, "w", newline="", encoding="utf-8") as file:
  writer = csv.writer(file)
  writer.writerow(["URL"]) # Write header
  for url in urls:
  writer.writerow([url])
  
  print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")
  
  Running this script will extract all sitemap URLs and save them into a CSV file, making them easy to analyze or use in other projects.
Conclusion
In this guide, we explored how to crawl XML sitemaps with Python using the ultimate-sitemap-parser (USP) library. Instead of manually parsing XML files or struggling with nested index sitemaps, USP simplifies the process with just a few lines of code.

Here’s what we covered:
- ✅ How to extract URLs directly from the ASOS sitemap.
- ✅ How USP automatically handles nested sitemaps.
- ✅ How to store extracted URLs in a CSV file for easy analysis.
With USP, sitemap crawling becomes faster, cleaner, and more efficient, making it an excellent choice for SEO audits, competitor research, and large-scale web scraping projects.

Next Steps
If you’d like to dive deeper into USP, here are some helpful resources:
- 📖 Official Documentation – Learn how to get started with USP.
- 💻 GitHub Repository – Explore the source code and contribute.
Whether you’re building a Python-based SEO tool, auditing your own website, or running a large-scale scraping project, USP is a powerful and reliable addition to your toolkit.

🚀 Start experimenting today and unlock the full potential of sitemap crawling with Python.

How to Crawl XML Sitemaps with Python

Learn how to efficiently crawl and extract URLs from XML sitemaps using Python’s Ultimate Sitemap Parser (USP). Discover how to handle nested sitemaps, filter URLs, and save results for SEO audits, web scraping, and competitor analysis.

Table of Contents

Introduction

The Challenges of Manual Sitemap Parsing

The Solution: Ultimate Sitemap Parser (USP)

What You’ll Learn in This Guide

Prerequisites

Install Python

Install Ultimate Sitemap Parser (USP)

Crawling Sitemaps using "ultimate-sitemap-parser"

Fetching the Sitemap and Extracting URLs

Handling Nested Sitemaps

Extracting Only a Subset of URLs

Storing URLs in a CSV File

Conclusion

Tags:

Getting Started with Pandas in Python: A Beginner’s Ste...

How to Crawl XML Sitemaps with Python

Getting Started with Pandas in Python: A Beginner’s Ste...

Beginner’s Guide to Web Scraping with Puppeteer and Pro...

How to Crawl XML Sitemaps with Python

Learn how to efficiently crawl and extract URLs from XML sitemaps using Python’s Ultimate Sitemap Parser (USP). Discover how to handle nested sitemaps, filter URLs, and save results for SEO audits, web scraping, and competitor analysis.

Table of Contents

Introduction

The Challenges of Manual Sitemap Parsing

The Solution: Ultimate Sitemap Parser (USP)

What You’ll Learn in This Guide

Prerequisites

Install Python

Install Ultimate Sitemap Parser (USP)

Crawling Sitemaps using "ultimate-sitemap-parser"

Fetching the Sitemap and Extracting URLs

Handling Nested Sitemaps

Extracting Only a Subset of URLs

Storing URLs in a CSV File

Conclusion

Tags:

Related Posts

Getting Started with Pandas in Python: A Beginner’s Ste...