Skip to content

Commit e712282

Browse files
authored
Merge pull request #15 from SHUzhangshuo/main
添加了可以直接用于评估的抽取器test_model_extractor
2 parents 129c079 + 74c0036 commit e712282

8 files changed

Lines changed: 128 additions & 5 deletions

File tree

data/test_model.jsonl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{"id": "sample-001-programming-tutorial", "html": "<html><body>\n <h1 cc-select=\"true\">Python编程教程</h1>\n <p cc-select=\"true\">这是一个Python基础教程,展示如何定义函数。</p>\n <pre cc-select=\"true\"><code>def greet(name):\n \"\"\"问候函数\"\"\"\n return f\"Hello, {name}!\"\n\n# 使用示例\nresult = greet(\"World\")\nprint(result)</code></pre>\n <p cc-select=\"true\">这个函数可以用来问候任何人。</p>\n </body></html>", "groundtruth_llm_webkit_md": "# Python编程教程\n\n这是一个Python基础教程,展示如何定义函数。\n\n```python\ndef greet(name):\n \"\"\"问候函数\"\"\"\n return f\"Hello, {name}!\"\n\n# 使用示例\nresult = greet(\"World\")\nprint(result)\n```\n\n这个函数可以用来问候任何人。", "groundtruth_content_list": [{"type": "heading", "content": "Python编程教程", "level": 1}, {"type": "paragraph", "content": "这是一个Python基础教程,展示如何定义函数。"}, {"type": "code", "content": "def greet(name):\n \"\"\"问候函数\"\"\"\n return f\"Hello, {name}!\"\n\n# 使用示例\nresult = greet(\"World\")\nprint(result)"}, {"type": "paragraph", "content": "这个函数可以用来问候任何人。"}], "llm_webkit_md": "# Python编程教程\n\n这是一个Python基础教程,展示如何定义函数。\n\n```python\ndef greet(name):\n \"\"\"问候函数\"\"\"\n return f\"Hello, {name}!\"\n\n# 使用示例\nresult = greet(\"World\")\nprint(result)\n```\n\n这个函数可以用来问候任何人。", "content_list": [{"type": "heading", "content": "Python编程教程", "level": 1}, {"type": "paragraph", "content": "这是一个Python基础教程,展示如何定义函数。"}, {"type": "code", "content": "def greet(name):\n \"\"\"问候函数\"\"\"\n return f\"Hello, {name}!\"\n\n# 使用示例\nresult = greet(\"World\")\nprint(result)"}, {"type": "paragraph", "content": "这个函数可以用来问候任何人。"}], "url": "https://python-tutorial.example.com/functions", "domain": null, "language": "en", "content_type": "programming", "difficulty": null, "tags": null}
2+
{"id": "sample-002-math-formulas", "html": "<html><body>\n <h1 cc-select=\"true\">数学公式示例</h1>\n <p cc-select=\"true\">这里展示一些基本的数学公式。</p>\n <p cc-select=\"true\">勾股定理:a² + b² = c²</p>\n <div cc-select=\"true\" class=\"formula\">\n <p>二次方程的解为:</p>\n <p>x = (-b ± √(b² - 4ac)) / 2a</p>\n </div>\n <p cc-select=\"true\">欧拉公式是数学中最美丽的公式之一:e^(iπ) + 1 = 0</p>\n <table cc-select=\"true\">\n <tr><th>函数</th><th>导数</th></tr>\n <tr><td>x²</td><td>2x</td></tr>\n <tr><td>sin(x)</td><td>cos(x)</td></tr>\n </table>\n </body></html>", "groundtruth_llm_webkit_md": "# 数学公式示例\n\n这里展示一些基本的数学公式。\n\n勾股定理:$a^2 + b^2 = c^2$\n\n二次方程的解为:\n\n$$x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}$$\n\n欧拉公式是数学中最美丽的公式之一:$e^{i\\pi} + 1 = 0$\n\n| 函数 | 导数 |\n|------|------|\n| x² | 2x |\n| sin(x) | cos(x) |", "groundtruth_content_list": [{"type": "heading", "content": "数学公式示例", "level": 1}, {"type": "paragraph", "content": "这里展示一些基本的数学公式。"}, {"type": "paragraph", "content": "勾股定理:a² + b² = c²"}, {"type": "paragraph", "content": "二次方程的解为:"}, {"type": "equation-interline", "content": "x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}"}, {"type": "paragraph", "content": "欧拉公式是数学中最美丽的公式之一:e^(iπ) + 1 = 0"}, {"type": "table", "content": "| 函数 | 导数 |\n|------|------|\n| x² | 2x |\n| sin(x) | cos(x) |"}], "llm_webkit_md": "# 数学公式示例\n\n这里展示一些基本的数学公式。\n\n勾股定理:$a^2 + b^2 = c^2$\n\n二次方程的解为:\n\n$$x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}$$\n\n欧拉公式是数学中最美丽的公式之一:$e^{i\\pi} + 1 = 0$\n\n| 函数 | 导数 |\n|------|------|\n| x² | 2x |\n| sin(x) | cos(x) |", "content_list": [{"type": "heading", "content": "数学公式示例", "level": 1}, {"type": "paragraph", "content": "这里展示一些基本的数学公式。"}, {"type": "paragraph", "content": "勾股定理:a² + b² = c²"}, {"type": "paragraph", "content": "二次方程的解为:"}, {"type": "equation-interline", "content": "x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}"}, {"type": "paragraph", "content": "欧拉公式是数学中最美丽的公式之一:e^(iπ) + 1 = 0"}, {"type": "table", "content": "| 函数 | 导数 |\n|------|------|\n| x² | 2x |\n| sin(x) | cos(x) |"}], "url": "https://math-examples.edu/formulas", "domain": null, "language": "zh", "content_type": "academic", "difficulty": null, "tags": null}
3+
{"id": "sample-003-data-analysis", "html": "<html><body>\n <h1 cc-select=\"true\">数据分析报告</h1>\n <p cc-select=\"true\">以下是2024年第一季度的销售数据分析。</p>\n <h2 cc-select=\"true\">数据处理代码</h2>\n <pre cc-select=\"true\"><code>import pandas as pd\nimport numpy as np\n\n# 读取数据\ndf = pd.read_csv('sales_q1_2024.csv')\n\n# 计算统计信息\nmonthly_avg = df.groupby('month')['sales'].mean()\nprint(f\"平均销售额: {monthly_avg}\")</code></pre>\n <h2 cc-select=\"true\">销售统计</h2>\n <table cc-select=\"true\">\n <tr><th>月份</th><th>销售额(万元)</th><th>增长率</th></tr>\n <tr><td>1月</td><td>120.5</td><td>+15.2%</td></tr>\n <tr><td>2月</td><td>135.8</td><td>+12.7%</td></tr>\n <tr><td>3月</td><td>148.3</td><td>+9.2%</td></tr>\n </table>\n <p cc-select=\"true\">标准差公式:σ = √(Σ(xi - μ)² / n)</p>\n <p cc-select=\"true\">总体来看,第一季度销售表现良好,呈现稳定增长趋势。</p>\n </body></html>", "groundtruth_llm_webkit_md": "# 数据分析报告\n\n以下是2024年第一季度的销售数据分析。\n\n## 数据处理代码\n\n```python\nimport pandas as pd\nimport numpy as np\n\n# 读取数据\ndf = pd.read_csv('sales_q1_2024.csv')\n\n# 计算统计信息\nmonthly_avg = df.groupby('month')['sales'].mean()\nprint(f\"平均销售额: {monthly_avg}\")\n```\n\n## 销售统计\n\n| 月份 | 销售额(万元) | 增长率 |\n|------|-------------|--------|\n| 1月 | 120.5 | +15.2% |\n| 2月 | 135.8 | +12.7% |\n| 3月 | 148.3 | +9.2% |\n\n标准差公式:$\\sigma = \\sqrt{\\frac{\\Sigma(x_i - \\mu)^2}{n}}$\n\n总体来看,第一季度销售表现良好,呈现稳定增长趋势。", "groundtruth_content_list": [{"type": "heading", "content": "数据分析报告", "level": 1}, {"type": "paragraph", "content": "以下是2024年第一季度的销售数据分析。"}, {"type": "heading", "content": "数据处理代码", "level": 2}, {"type": "code", "content": "import pandas as pd\nimport numpy as np\n\n# 读取数据\ndf = pd.read_csv('sales_q1_2024.csv')\n\n# 计算统计信息\nmonthly_avg = df.groupby('month')['sales'].mean()\nprint(f\"平均销售额: {monthly_avg}\")"}, {"type": "heading", "content": "销售统计", "level": 2}, {"type": "table", "content": "| 月份 | 销售额(万元) | 增长率 |\n|------|-------------|--------|\n| 1月 | 120.5 | +15.2% |\n| 2月 | 135.8 | +12.7% |\n| 3月 | 148.3 | +9.2% |"}, {"type": "paragraph", "content": "标准差公式:σ = √(Σ(xi - μ)² / n)"}, {"type": "paragraph", "content": "总体来看,第一季度销售表现良好,呈现稳定增长趋势。"}], "llm_webkit_md": "# 数据分析报告\n\n以下是2024年第一季度的销售数据分析。\n\n## 数据处理代码\n\n```python\nimport pandas as pd\nimport numpy as np\n\n# 读取数据\ndf = pd.read_csv('sales_q1_2024.csv')\n\n# 计算统计信息\nmonthly_avg = df.groupby('month')['sales'].mean()\nprint(f\"平均销售额: {monthly_avg}\")\n```\n\n## 销售统计\n\n| 月份 | 销售额(万元) | 增长率 |\n|------|-------------|--------|\n| 1月 | 120.5 | +15.2% |\n| 2月 | 135.8 | +12.7% |\n| 3月 | 148.3 | +9.2% |\n\n标准差公式:$\\sigma = \\sqrt{\\frac{\\Sigma(x_i - \\mu)^2}{n}}$\n\n总体来看,第一季度销售表现良好,呈现稳定增长趋势。", "content_list": [{"type": "heading", "content": "数据分析报告", "level": 1}, {"type": "paragraph", "content": "以下是2024年第一季度的销售数据分析。"}, {"type": "heading", "content": "数据处理代码", "level": 2}, {"type": "code", "content": "import pandas as pd\nimport numpy as np\n\n# 读取数据\ndf = pd.read_csv('sales_q1_2024.csv')\n\n# 计算统计信息\nmonthly_avg = df.groupby('month')['sales'].mean()\nprint(f\"平均销售额: {monthly_avg}\")"}, {"type": "heading", "content": "销售统计", "level": 2}, {"type": "table", "content": "| 月份 | 销售额(万元) | 增长率 |\n|------|-------------|--------|\n| 1月 | 120.5 | +15.2% |\n| 2月 | 135.8 | +12.7% |\n| 3月 | 148.3 | +9.2% |"}, {"type": "paragraph", "content": "标准差公式:σ = √(Σ(xi - μ)² / n)"}, {"type": "paragraph", "content": "总体来看,第一季度销售表现良好,呈现稳定增长趋势。"}], "url": "https://data-report.company.com/q1-2024-analysis", "domain": null, "language": "zh", "content_type": "business", "difficulty": null, "tags": null}
4+
{"id": "sample-004-algorithm-explanation", "html": "<html><body>\n <h1 cc-select=\"true\">算法复杂度分析</h1>\n <p cc-select=\"true\">这里介绍常见算法的时间复杂度。</p>\n <h2 cc-select=\"true\">快速排序实现</h2>\n <pre cc-select=\"true\"><code>def quicksort(arr):\n if len(arr) <= 1:\n return arr\n \n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n \n return quicksort(left) + middle + quicksort(right)</code></pre>\n <h2 cc-select=\"true\">复杂度对比</h2>\n <table cc-select=\"true\">\n <tr><th>算法</th><th>最好情况</th><th>平均情况</th><th>最坏情况</th></tr>\n <tr><td>快速排序</td><td>O(n log n)</td><td>O(n log n)</td><td>O(n²)</td></tr>\n <tr><td>归并排序</td><td>O(n log n)</td><td>O(n log n)</td><td>O(n log n)</td></tr>\n <tr><td>冒泡排序</td><td>O(n)</td><td>O(n²)</td><td>O(n²)</td></tr>\n </table>\n <p cc-select=\"true\">Master定理:T(n) = aT(n/b) + f(n)</p>\n <p cc-select=\"true\">其中 a ≥ 1, b > 1 是常数,f(n) 是正函数。</p>\n </body></html>", "groundtruth_llm_webkit_md": "# 算法复杂度分析\n\n这里介绍常见算法的时间复杂度。\n\n## 快速排序实现\n\n```python\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n \n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n \n return quicksort(left) + middle + quicksort(right)\n```\n\n## 复杂度对比\n\n| 算法 | 最好情况 | 平均情况 | 最坏情况 |\n|------|----------|----------|----------|\n| 快速排序 | O(n log n) | O(n log n) | O(n²) |\n| 归并排序 | O(n log n) | O(n log n) | O(n log n) |\n| 冒泡排序 | O(n) | O(n²) | O(n²) |\n\nMaster定理:$T(n) = aT(n/b) + f(n)$\n\n其中 $a \\geq 1, b > 1$ 是常数,$f(n)$ 是正函数。", "groundtruth_content_list": [{"type": "heading", "content": "算法复杂度分析", "level": 1}, {"type": "paragraph", "content": "这里介绍常见算法的时间复杂度。"}, {"type": "heading", "content": "快速排序实现", "level": 2}, {"type": "code", "content": "def quicksort(arr):\n if len(arr) <= 1:\n return arr\n \n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n \n return quicksort(left) + middle + quicksort(right)"}, {"type": "heading", "content": "复杂度对比", "level": 2}, {"type": "table", "content": "| 算法 | 最好情况 | 平均情况 | 最坏情况 |\n|------|----------|----------|----------|\n| 快速排序 | O(n log n) | O(n log n) | O(n²) |\n| 归并排序 | O(n log n) | O(n log n) | O(n log n) |\n| 冒泡排序 | O(n) | O(n²) | O(n²) |"}, {"type": "equation-inline", "content": "T(n) = aT(n/b) + f(n)"}, {"type": "paragraph", "content": "其中 a ≥ 1, b > 1 是常数,f(n) 是正函数。"}], "llm_webkit_md": "# 算法复杂度分析\n\n这里介绍常见算法的时间复杂度。\n\n## 快速排序实现\n\n```python\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n \n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n \n return quicksort(left) + middle + quicksort(right)\n```\n\n## 复杂度对比\n\n| 算法 | 最好情况 | 平均情况 | 最坏情况 |\n|------|----------|----------|----------|\n| 快速排序 | O(n log n) | O(n log n) | O(n²) |\n| 归并排序 | O(n log n) | O(n log n) | O(n log n) |\n| 冒泡排序 | O(n) | O(n²) | O(n²) |\n\nMaster定理:$T(n) = aT(n/b) + f(n)$\n\n其中 $a \\geq 1, b > 1$ 是常数,$f(n)$ 是正函数。", "content_list": [{"type": "heading", "content": "算法复杂度分析", "level": 1}, {"type": "paragraph", "content": "这里介绍常见算法的时间复杂度。"}, {"type": "heading", "content": "快速排序实现", "level": 2}, {"type": "code", "content": "def quicksort(arr):\n if len(arr) <= 1:\n return arr\n \n pivot = arr[len(arr) // 2]\n left = [x for x in arr if x < pivot]\n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n \n return quicksort(left) + middle + quicksort(right)"}, {"type": "heading", "content": "复杂度对比", "level": 2}, {"type": "table", "content": "| 算法 | 最好情况 | 平均情况 | 最坏情况 |\n|------|----------|----------|----------|\n| 快速排序 | O(n log n) | O(n log n) | O(n²) |\n| 归并排序 | O(n log n) | O(n log n) | O(n log n) |\n| 冒泡排序 | O(n) | O(n²) | O(n²) |"}, {"type": "equation-inline", "content": "T(n) = aT(n/b) + f(n)"}, {"type": "paragraph", "content": "其中 a ≥ 1, b > 1 是常数,f(n) 是正函数。"}], "url": "https://algorithm-guide.cs.edu/complexity-analysis", "domain": null, "language": "zh", "content_type": "computer_science", "difficulty": null, "tags": null}

examples/test_model.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from webmainbench import DataLoader, Evaluator, ExtractorFactory
2+
3+
# 1. 加载评测数据集
4+
dataset = DataLoader.load_jsonl("WebMainBench/data/WebMainBench_llm-webkit_v1_WebMainBench_dataset_merge_2549_llm_webkit.jsonl")
5+
6+
# 2. 创建抽取器
7+
extractor = ExtractorFactory.create("test-model")
8+
9+
# 3. 运行评测
10+
evaluator = Evaluator()
11+
result = evaluator.evaluate(dataset, extractor)
12+
13+
# 4. 查看结果
14+
print(f"Overall Score: {result.overall_metrics}")
15+
print(f"Category Metrics: {result.category_metrics}")
16+
print(f"Error Analysis: {result.error_analysis}")

tests/test_test_model_extractor.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
import unittest
2+
from webmainbench.extractors.test_model_extractor import TestModelExtractor
3+
4+
class TestTestModelExtractor(unittest.TestCase):
5+
"""测试 TestModelExtractor 的基本功能"""
6+
7+
def setUp(self):
8+
"""初始化测试用的抽取器实例"""
9+
self.extractor = TestModelExtractor("test-model")
10+
11+
# 使用 data 目录下的 test_model.jsonl 作为测试数据
12+
import json
13+
from pathlib import Path
14+
15+
# 读取第一个样本作为测试用例
16+
data_path = Path(__file__).parent.parent / "data" / "test_model.jsonl"
17+
with open(data_path, "r", encoding="utf-8") as f:
18+
first_line = f.readline()
19+
sample_dict = json.loads(first_line)
20+
21+
# 由于 TestModelExtractor 期望 sample 支持属性访问,这里用 SimpleNamespace 包装
22+
from types import SimpleNamespace
23+
self.sample_data = SimpleNamespace(**sample_dict)
24+
25+
def test_extract_from_sample(self):
26+
"""测试extract_from_sample方法"""
27+
result = self.extractor.extract_from_sample(self.sample_data)
28+
self.assertTrue(result.success)
29+
self.assertEqual(result.content, self.sample_data.llm_webkit_md)
30+
self.assertEqual(result.content_list, self.sample_data.content_list)
31+
self.assertEqual(result.language, self.sample_data.language)
32+
self.assertEqual(result.confidence_score, 1.0)
33+
34+
def test_extract_with_empty_html(self):
35+
"""测试extract方法遇到空html的情况"""
36+
result = self.extractor.extract("")
37+
self.assertFalse(result.success)
38+
self.assertIn("Empty HTML input", result.error_message)
39+
40+
if __name__ == "__main__":
41+
unittest.main()

webmainbench/data/dataset.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,16 @@ class DataSample:
1818
html: str # HTML with cc-select=true annotations
1919
groundtruth_content: str # Groundtruth markdown content
2020
groundtruth_content_list: List[Dict[str, Any]] # Groundtruth content_list from llm-webkit
21-
21+
content_list: List[Dict[str, Any]] = None # Content_list from llm-webkit
22+
content: str = None # Content from llm-webkit
2223
# Optional metadata
2324
url: Optional[str] = None
2425
domain: Optional[str] = None
2526
language: Optional[str] = None
2627
content_type: Optional[str] = None # article, forum, blog, etc.
2728
difficulty: Optional[str] = None # easy, medium, hard
2829
tags: Optional[List[str]] = None
30+
llm_webkit_md: Optional[str] = None
2931

3032
# Extracted results (populated during evaluation)
3133
extracted_results: Optional[Dict[str, Any]] = None
@@ -37,6 +39,9 @@ def to_dict(self) -> Dict[str, Any]:
3739
"html": self.html,
3840
"groundtruth_content": self.groundtruth_content,
3941
"groundtruth_content_list": self.groundtruth_content_list,
42+
"content_list": self.content_list,
43+
"content": self.content,
44+
"llm_webkit_md": self.llm_webkit_md,
4045
"url": self.url,
4146
"domain": self.domain,
4247
"language": self.language,

webmainbench/evaluator/evaluator.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -307,8 +307,11 @@ def _process_batch(self, batch_samples: List[DataSample], extractor: BaseExtract
307307

308308
def _evaluate_sample(self, sample: DataSample, extractor: BaseExtractor) -> Dict[str, Any]:
309309
"""Evaluate a single sample."""
310-
# Extract content
311-
extraction_result = extractor.extract(sample.html, sample.url)
310+
if extractor.__class__.__name__ == 'TestModelExtractor':
311+
extraction_result = extractor.extract_from_sample(sample)
312+
else:
313+
# Extract content
314+
extraction_result = extractor.extract(sample.html, sample.url)
312315

313316
# Prepare result
314317
sample_result = {

webmainbench/extractors/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from .factory import ExtractorFactory
99
from .llm_webkit_extractor import LlmWebkitExtractor
1010
from .jina_extractor import JinaExtractor
11-
11+
from .test_model_extractor import TestModelExtractor
1212

1313

1414
__all__ = [
@@ -17,4 +17,5 @@
1717
"ExtractorFactory",
1818
"LlmWebkitExtractor",
1919
"JinaExtractor",
20+
"TestModelExtractor",
2021
]

0 commit comments

Comments
 (0)