zhenxiauto/GEO-Common-AI-Keyword-Results-Collection

Public

WeChat Login

Code Issues Pull requests Events Packages Insights

master

Branch

Tag

迪迦

refactor(doubao): 简化等待回答和内容提取逻辑

bb6a1006

15 commits

.idea
.trae
core
platforms
.gitignore
.python-version
README.md
TODO.md
pyproject.toml
run_all.py
uv.lock

GEO Common AI 关键词电话号码提取工具

基于 DrissionPage 的多平台 AI 对话工具，可自动在多个 AI 平台上搜索关键词并提取电话号码。

核心特性

面向对象设计：采用抽象基类和模板方法模式，代码复用率高
支持 8 个 AI 平台：DeepSeek、豆包、文心一言、Kimi、通义千问、腾讯元宝、智谱清言、讯飞星火
自动提取电话：智能识别多种电话号码格式（400、手机、座机等）
浏览器管理：自动连接已打开的浏览器或创建新实例
可配置：支持等待时间、浏览器端口等参数配置
异常处理：完善的异常体系，错误信息清晰

项目架构


GEO-Common-AI-Keyword-Results-Collection/
├── core/                     # 核心模块
│   ├── __init__.py          # 统一导出接口
│   ├── config.py            # 配置管理（单例模式）
│   ├── phone_extractor.py   # 电话号码提取器
│   ├── browser_manager.py   # 浏览器生命周期管理
│   ├── base_platform.py     # 平台抽象基类
│   └── exceptions.py        # 自定义异常类
├── platforms/               # 平台实现
│   ├── __init__.py         # 平台导出 + PLATFORMS 字典
│   ├── deepseek.py         # DeepSeek 平台
│   ├── doubao.py           # 豆包平台
│   ├── wenxin.py           # 文心一言平台
│   ├── kimi.py             # Kimi 平台
│   ├── tongyi.py           # 通义千问平台
│   ├── yuanbao.py          # 腾讯元宝平台
│   ├── chatglm.py          # 智谱清言平台
│   └── xinghuo.py          # 讯飞星火平台
└── run_all.py             # 批量运行所有平台

快速开始

环境要求

Python 3.8+
DrissionPage（浏览器自动化库）

安装依赖


pip install DrissionPage

基本使用


from platforms import DeepSeekPlatform

# 创建平台实例
platform = DeepSeekPlatform()

# 搜索关键词并提取电话
phones = platform.search("劳力士维修服务中心")

# 清理资源
platform.cleanup()

print(f"找到电话号码：{phones}")

使用便捷函数


from platforms import search_deepseek

# 直接搜索（自动清理资源）
phones = search_deepseek("劳力士维修服务中心", "output.txt")

批量搜索所有平台


from platforms import PLATFORMS

keyword = "劳力士维修服务中心"
all_phones = []

for platform_name, platform_info in PLATFORMS.items():
    print(f"正在搜索 {platform_name}...")
    search_func = platform_info["function"]
    phones = search_func(keyword)
    all_phones.extend(phones)

print(f"共找到 {len(all_phones)} 个电话号码")

核心类说明

1. Config（配置管理）

集中管理所有配置参数，支持动态更新。


from core import Config

# 获取所有配置
config = Config.get_all()

# 更新配置
Config.update(BROWSER_PORT=9222, PAGE_LOAD_WAIT=5)

# 重置为默认值
Config.reset()

默认配置：

参数	默认值	说明
BROWSER_PORT	9222	浏览器调试端口
BROWSER_HEADLESS	False	是否无头模式
PAGE_LOAD_WAIT	3	页面加载等待时间（秒）
INPUT_WAIT	1	输入后等待时间（秒）
DEFAULT_OUTPUT_FILE	"phones.txt"	默认输出文件名
FILE_ENCODING	"utf-8"	输出文件编码

等待机制说明：

AI 响应的等待不是按固定时间，而是通过监听页面元素内容变化来判断输出是否完成：

豆包平台：检测"已完成思考"文本周围的片段是否稳定 + 监听回答内容是否不再变化
其他平台：建议参考豆包实现，重写 _wait_for_response() 方法

示例（豆包的实现，使用 DrissionPage）：


def _wait_for_thinking_complete(self) -> None:
    """等待豆包思考完成，通过检测内容稳定性判断"""
    stable_count = 0
    stable_threshold = 3
    check_interval = 1
    last_content = ""

    print("等待思考完成...")

    while True:
        body_text = self._page.run_js('return document.body.innerText;')

        if '已完成思考' in body_text:
            # 找到"已完成思考"周围的内容片段
            think_index = body_text.index('已完成思考')
            start = max(0, think_index - 50)
            end = min(len(body_text), think_index + 50)
            current_content = body_text[start:end]

            if current_content == last_content:
                stable_count += 1
                print(f"思考内容稳定 ({stable_count}/{stable_threshold})")
                if stable_count >= stable_threshold:
                    print("思考完成")
                    return
            else:
                stable_count = 0
                last_content = current_content
                print("检测到思考内容更新")
        else:
            stable_count = 0
            last_content = ""
            print("等待'已完成思考'提示...")

        time.sleep(check_interval)

def _wait_for_answer_complete(self) -> None:
    """等待豆包回答完成，通过检测内容稳定性判断"""
    stable_count = 0
    stable_threshold = 3
    check_interval = 1
    last_answer = ""

    print("等待回答完成...")

    while True:
        # 定位回答元素
        answer = self._page.ele('css:.answer-content', timeout=2)
        if answer:
            current_answer = answer.text
            if current_answer == last_answer:
                stable_count += 1
                print(f"回答内容稳定 ({stable_count}/{stable_threshold})")
                if stable_count >= stable_threshold:
                    print("回答完成")
                    return
            else:
                stable_count = 0
                last_answer = current_answer
                print(f"回答内容更新，当前长度: {len(current_answer)}")
        time.sleep(check_interval)

核心原理：

不使用超时参数，而是用 while True 持续检测
通过内容是否变化来判断完成状态
思考完成：检测特定文本（如"已完成思考"）周围内容是否稳定
回答完成：检测回答区域内容是否不再变化

2. PhoneExtractor（电话提取器）

从文本中提取并保存电话号码。


from core import PhoneExtractor

extractor = PhoneExtractor()

# 提取电话
phones = extractor.extract("客服热线：400-123-4567")

# 保存到文件
extractor.save_to_file(phones, "output.txt", "DeepSeek")

重要：只在 AI 回答区域提取

PhoneExtractor 本身是通用的文本提取工具，不负责识别页面内容区域。各平台需要通过重写 _extract_content() 方法，只返回 AI 回答的内容区域，避免从导航栏、页脚等位置提取错误电话。

正确示例（使用 DrissionPage）：


def _extract_content(self) -> str:
    """只提取 AI 回答内容区域"""
    # 方式1：直接定位回答元素（推荐）
    answer_element = self._page.ele('css:.ai-answer-content', timeout=5)
    if answer_element:
        return answer_element.text

    # 方式2：遍历查找符合条件的元素
    all_divs = self._page.eles('xpath://div')
    for div in all_divs:
        text = div.text
        # 只提取包含"电话"且长度足够的内容块
        if '电话' in text and '用户' not in text and '搜索' not in text and len(text) > 500:
            return text[:5000]

    # 未找到特定区域，返回整个页面
    return self._page.html

DrissionPage 元素操作方法：

方法	说明	示例
`.ele()`	定位单个元素	`page.ele('xpath://div[@class="answer"]')`
`.eles()`	定位多个元素	`page.eles('css:.message')`
`.text`	获取元素文本	`element.text`
`.html`	获取元素HTML	`element.html`
`.click()`	点击元素	`element.click()`
`.input()`	输入文本	`element.input('内容')`

支持格式（统一在基类维护）：

400/800 电话：400-123-4567
手机号：13812345678、138-1234-5678
座机号：021-12345678、010-12345678
国际号码：+86 21 2319 3688

如需支持新格式，只需在 PhoneExtractor._get_patterns() 中添加正则表达式。

3. BrowserManager（浏览器管理器）

管理浏览器连接和生命周期。


from core import BrowserManager

# 方式1：手动管理
manager = BrowserManager(port=9222)
page = manager.get_page()
# 使用 page...
manager.close()

# 方式2：上下文管理器（推荐）
with BrowserManager() as manager:
    page = manager.get_page()
    # 使用 page...
    # 退出时自动关闭

4. PlatformBase（平台基类）

所有平台的抽象基类，定义统一接口。

核心方法：

search(keyword, output_file) - 搜索并提取电话（抽象方法）
_execute_search(keyword, output_file) - 模板方法，定义搜索流程
_navigate_to_platform() - 导航到平台
_input_keyword(keyword) - 输入关键词
_send_message() - 发送消息
_wait_for_response() - 等待响应（可重写）
_extract_content() - 提取内容（可重写）

可重写的方法：

子类可通过重写以下方法实现平台特定逻辑：

_get_input_xpath() - 自定义输入框定位符
_get_send_button_xpath() - 自定义发送按钮定位符
_wait_for_response() - 自定义等待逻辑
_extract_content() - 自定义内容提取逻辑

5. 异常类

完善的异常体系便于错误处理。


from core.exceptions import (
    PlatformError,
    BrowserConnectionError,
    InputBoxNotFoundError,
    ResponseTimeoutError
)

try:
    platform = DeepSeekPlatform()
    phones = platform.search("关键词")
except InputBoxNotFoundError as e:
    print(f"输入框未找到：{e}")
except ResponseTimeoutError as e:
    print(f"响应超时：{e}")
except PlatformError as e:
    print(f"平台错误：{e}")

支持的 AI 平台

平台	类名	函数名	状态
DeepSeek	`DeepSeekPlatform`	`search_deepseek()`	✅ 已测试
豆包	`DoubaoPlatform`	`search_doubao()`	✅ 已测试
文心一言	`WenxinPlatform`	`search_wenxin()`	待重写
Kimi	`KimiPlatform`	`search_kimi()`	待重写
通义千问	`TongyiPlatform`	`search_tongyi()`	待重写
腾讯元宝	`YuanbaoPlatform`	`search_yuanbao()`	待重写
智谱清言	`ChatglmPlatform`	`search_chatglm()`	待重写
讯飞星火	`XinghuoPlatform`	`search_xinghuo()`	待重写

添加新平台

只需继承 PlatformBase 并定义两个类属性：


from core import PlatformBase
from typing import List

class NewPlatform(PlatformBase):
    PLATFORM_NAME = "新平台名称"
    PLATFORM_URL = "https://example.com"

    def search(self, keyword: str, output_file: str = None) -> List[str]:
        return self._execute_search(keyword, output_file)

    # 可选：重写方法以实现特定逻辑
    def _get_input_xpath(self) -> str:
        return 'xpath://textarea[@placeholder="输入内容"]'

    def _get_send_button_xpath(self) -> str:
        return 'xpath://button[@type="submit"]'

重要：实现 _wait_for_response()

各平台的响应提示不同，建议重写 _wait_for_response() 方法，通过监听页面元素内容变化判断输出是否完成：


def _wait_for_response(self) -> None:
    """自定义等待逻辑"""
    stable_count = 0
    stable_threshold = 5
    check_interval = 1
    last_content = ""

    print("等待AI响应...")

    while True:
        # 方法1：检测特定文本是否稳定（如"生成完成"）
        body_text = self._page.run_js('return document.body.innerText;')
        if '生成完成' in body_text:
            # 获取"生成完成"周围内容片段
            index = body_text.index('生成完成')
            start = max(0, index - 30)
            end = min(len(body_text), index + 30)
            current_content = body_text[start:end]

            if current_content == last_content:
                stable_count += 1
                print(f"完成标记稳定 ({stable_count}/{stable_threshold})")
                if stable_count >= stable_threshold:
                    print("AI响应完成")
                    return
            else:
                stable_count = 0
                last_content = current_content
                print("检测到内容更新")
        else:
            # 方法2：检测回答区域内容是否稳定
            answer_element = self._page.ele('css:.ai-answer-content', timeout=2)
            if answer_element:
                current_content = answer_element.text
                if current_content == last_content:
                    stable_count += 1
                    print(f"回答内容稳定 ({stable_count}/{stable_threshold})")
                    if stable_count >= stable_threshold:
                        print("AI响应完成")
                        return
                else:
                    stable_count = 0
                    last_content = current_content
                    print(f"回答内容更新，当前长度: {len(current_content)}")

        time.sleep(check_interval)