How to Crawl XML Sitemaps with Python

Learn how to efficiently crawl and extract URLs from XML sitemaps using Python’s Ultimate Sitemap Parser (USP). Discover how to handle nested sitemaps, filter URLs, and save results for SEO audits, web scraping, and competitor analysis.

 6
How to Crawl XML Sitemaps with Python
  • Introduction

    Sitemaps are a critical part of SEO and web crawling because they provide a structured list of URLs that a website wants search engines to index. Instead of scraping a site link by link, crawling the XML sitemap allows you to quickly discover all available URLs in a much more efficient way.

    • The Challenges of Manual Sitemap Parsing

      While sitemaps make crawling easier, manually parsing them can be tricky:

      • Many websites use index sitemaps that link to multiple smaller sitemaps, requiring additional steps to process.

      • Large sitemaps can contain thousands of URLs, making extraction time-consuming and inefficient.

    • The Solution: Ultimate Sitemap Parser (USP)

      This is where ultimate-sitemap-parser (USP) comes in. It’s a powerful Python library built to handle complex sitemaps with ease. USP can:

      • Fetch and parse XML sitemaps automatically.

      • Handle nested index sitemaps without extra coding.

      • Extract URLs efficiently with just a simple function call.

    • What You’ll Learn in This Guide

      In this tutorial, we’ll walk you through how to use ultimate-sitemap-parser in Python to crawl the ASOS sitemap and extract all available URLs quickly. This approach saves time, reduces complexity, and provides a scalable way to handle sitemap crawling for SEO projects.

  • Prerequisites

    Before we start crawling XML sitemaps with Python, ensure your system is ready with the following setup:

    • Install Python

      To run the sitemap crawler, you’ll need Python 3.x installed on your computer.

      • Download the latest version of Python from the official Python website.

      • After installation, confirm it’s working by running the command below in your terminal or command prompt:

          python3 --version

      If Python is installed correctly, this will display the installed version number.

    • Install Ultimate Sitemap Parser (USP)

      The ultimate-sitemap-parser library makes it simple to fetch and parse sitemaps in Python. Install it using pip:
          pip install ultimate-sitemap-parser
      Once installed, you’re ready to start building your Python sitemap crawler.

  • Crawling Sitemaps using "ultimate-sitemap-parser"

    Now that you’ve installed ultimate-sitemap-parser (USP), let’s see how to use it to crawl the ASOS sitemap and extract URLs efficiently. We’ll walk through the core features step by step.

    • Fetching the Sitemap and Extracting URLs

      The USP library makes it simple to fetch and parse XML sitemaps without manually handling XML files. With just a few lines of code, you can extract every URL from a sitemap.

      from usp.tree import sitemap_tree_for_homepage

      # Define the target website

      url = "https://www.asos.com/"

      # Fetch and parse the sitemap

      tree = sitemap_tree_for_homepage(url)

      # Extract and print all URLs

      for page in tree.all_pages():
          print(page.url)

      This script will automatically fetch the ASOS sitemap and print all available URLs.

    • Handling Nested Sitemaps

      A big advantage of USP is that it can automatically handle nested index sitemaps. Many websites (including ASOS) split their sitemaps into smaller files, such as:

      • Product pages
      • Category pages
      • Blog or content pages


      USP will:

      • Detect index sitemaps.
      • Fetch linked sub-sitemaps.
      • Recursively extract URLs from all of them.

      👉 The script above already handles this seamlessly — no extra code needed.

    • Extracting Only a Subset of URLs

      Sometimes you only want specific types of pages, such as product URLs. USP makes this easy with simple filtering.

      For example, ASOS product URLs usually contain /product/. Here’s how you can filter only product pages:

      # Extract and filter only product page URLs
      product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]

      # Print filtered URLs
      for url in product_urls:
          print(url)

      This way, you only crawl the URLs that matter to your project.

    • Storing URLs in a CSV File

      Instead of printing the URLs, you might want to save them for later use. Here’s how to export the extracted URLs into a CSV file:

      import csv
      from usp.tree import sitemap_tree_for_homepage

      # Define the target website
      url = "https://www.asos.com/"

      # Fetch and parse the sitemap
      tree = sitemap_tree_for_homepage(url)

      # Extract all URLs
      urls = [page.url for page in tree.all_pages()]

      # Save URLs to a CSV file
      csv_filename = "asos_sitemap_urls.csv"
      with open(csv_filename, "w", newline="", encoding="utf-8") as file:
          writer = csv.writer(file)
          writer.writerow(["URL"])  # Write header
          for url in urls:
              writer.writerow([url])

      print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")

      Running this script will extract all sitemap URLs and save them into a CSV file, making them easy to analyze or use in other projects.

  • Conclusion

    In this guide, we explored how to crawl XML sitemaps with Python using the ultimate-sitemap-parser (USP) library. Instead of manually parsing XML files or struggling with nested index sitemaps, USP simplifies the process with just a few lines of code.

    Here’s what we covered:

    • ✅ How to extract URLs directly from the ASOS sitemap.
    • ✅ How USP automatically handles nested sitemaps.
    • ✅ How to store extracted URLs in a CSV file for easy analysis.


    With USP, sitemap crawling becomes faster, cleaner, and more efficient, making it an excellent choice for SEO audits, competitor research, and large-scale web scraping projects.

    Next Steps
    If you’d like to dive deeper into USP, here are some helpful resources:

    Whether you’re building a Python-based SEO tool, auditing your own website, or running a large-scale scraping project, USP is a powerful and reliable addition to your toolkit.

    🚀 Start experimenting today and unlock the full potential of sitemap crawling with Python.