如何使用Python解析Sitemap.xml并提取所有URL

作为网站所有者或开发者，有时我们需要从网站的Sitemap中提取URL，用于SEO分析、手动提交、内容审核或其他自动化任务。本文将介绍如何使用Python编写一个简单的脚本来实现这一目标。

什么是Sitemap？

Sitemap（网站地图）是一个XML文件，其中包含了网站所有页面的URL信息。它帮助搜索引擎更好地抓取和理解网站结构。通常位于网站根目录下的sitemap.xml，例如：https://xsfly.com/sitemap.xml。

为什么需要提取Sitemap中的URL？

SEO优化：分析网站内容结构和覆盖率
内容审核：确保所有重要页面都被正确索引
数据分析：获取网站所有页面的列表进行进一步处理
监控检测：定期检查网站页面可访问性

Python提取SitemapURL的实战代码

下面是一个完整的Python脚本，可以从sitemap中提取特定网站的所有URL：

import requests
import xml.etree.ElementTree as ET

def fetch_and_extract_sitemap_links(sitemap_url, output_file='xsfly_links.txt'):
    """
    从指定的sitemap URL获取内容并提取所有https://xsfly.com/开头的链接
    
    Parameters:
    sitemap_url (str): sitemap.xml的URL地址
    output_file (str): 输出文件名
    """
    try:
        # 设置请求头，模拟浏览器行为
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        
        # 发送GET请求获取sitemap内容
        print(f"正在从 {sitemap_url} 获取sitemap数据...")
        response = requests.get(sitemap_url, headers=headers, timeout=10)
        
        # 检查请求是否成功
        response.raise_for_status()
        
        # 解析XML内容
        root = ET.fromstring(response.content)
        
        # 定义命名空间
        namespaces = {
            'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'
        }
        
        # 查找所有<url>元素
        urls = root.findall('.//ns:url', namespaces)
        
        # 提取符合条件的链接
        xsfly_links = []
        for url in urls:
            loc = url.find('ns:loc', namespaces)
            if loc is not None and loc.text.startswith('https://xsfly.com/'):
                xsfly_links.append(loc.text)
        
        # 将链接保存到txt文件
        with open(output_file, 'w', encoding='utf-8') as file:
            for link in xsfly_links:
                file.write(link + '\n')
        
        print(f"成功提取 {len(xsfly_links)} 个链接")
        print(f"结果已保存到 {output_file}")
        
        return xsfly_links
        
    except requests.exceptions.RequestException as e:
        print(f"网络请求错误: {e}")
        return []
    except ET.ParseError as e:
        print(f"XML解析错误: {e}")
        return []
    except Exception as e:
        print(f"发生未知错误: {e}")
        return []

# 使用示例
if __name__ == "__main__":
    sitemap_url = "https://xsfly.com/sitemap.xml"
    links = fetch_and_extract_sitemap_links(sitemap_url)
    
    # 可选：打印前几个链接作为预览
    if links:
        print("\n前5个链接预览:")
        for i, link in enumerate(links[:5], 1):
            print(f"{i}. {link}")
        if len(links) > 5:
            print("... (更多链接已保存到文件)")
    else:
        print("[WARNING] 未提取到任何链接，请检查 sitemap URL 或网络连接")