maxframe.learn.contrib.llm.text.extract#

maxframe.learn.contrib.llm.text.extract(series, model: TextGenLLM, schema: Any, description: str | None = None, examples: List[Tuple[str, str]] | None = None, index=None)[源代码]#

使用语言模型从系列中的文本内容提取结构化信息。

参数:

series (Series) -- 一个包含待提取信息文本数据的 maxframe Series。每个元素应为一个文本字符串。
model (TextGenLLM) -- 用于信息提取的语言模型实例。
schema (Any) -- 提取的模式定义。可以是定义结构的字典，或会被转换为 JSON 模式的 Pydantic BaseModel 类。
description (str, optional) -- 提取任务的描述，以帮助模型理解需要提取的内容。
examples (List[Tuple[str, str]], optional) -- 提取任务的示例，格式为 [(input_text, expected_output), ...]，以帮助大语言模型更好地理解提取需求。
index (array-like, optional) -- 输出序列的索引，默认为 None，将生成新索引。

返回:

一个包含提取信息和成功状态的 DataFrame。列包括 'output'（提取的结构化数据）和 'success'（布尔状态）。如果 'success' 为 False，则 'output' 列将包含错误信息而非预期输出。

返回类型:

DataFrame

示例

>>> from maxframe.learn.contrib.llm.models.managed import ManagedTextGenLLM
>>> import maxframe.dataframe as md
>>>
>>> # Initialize the model
>>> llm = ManagedTextGenLLM(name="Qwen3-0.6B")
>>>
>>> # Create sample data
>>> texts = md.Series([
...     "John Smith, age 30, works as a Software Engineer at Google.",
...     "Alice Johnson, 25 years old, is a Data Scientist at Microsoft."
... ])
>>>
>>> # Define extraction schema
>>> schema = {
...     "name": "string",
...     "age": "integer",
...     "job_title": "string",
...     "company": "string"
... }
>>>
>>> # Extract structured information
>>> description = "Extract person information from text"
>>> examples = [
...     ("Bob Brown, 35, Manager at Apple", '{"name": "Bob Brown", "age": 35, "job_title": "Manager", "company": "Apple"}')
... ]
>>> result = extract(texts, llm, schema=schema, description=description, examples=examples)
>>> result.execute()

备注

预览： 此 API 处于预览状态，可能不稳定。接口可能在未来的版本中发生变化。