maxframe.dataframe.read_parquet#

maxframe.dataframe.read_parquet(path, engine: str = 'auto', columns: list = None, groups_as_chunks: bool = False, filters: list = None, dtype_backend: str = <no_default>, default_index_type: DefaultIndexType | str = None, storage_options: dict = None, use_nullable_dtypes: bool = <no_default>, *, dtypes: Series = None, index_dtypes: Series = None, memory_scale: int = None, merge_small_files: bool = True, merge_small_file_options: dict = None, gpu: bool = None, session=None, run_kwargs: dict = None, **kwargs)[源代码]#

从文件路径加载一个 parquet 对象，返回一个 DataFrame。

参数:

path (str, path object or file-like object) -- 任何有效的字符串路径都可以接受。该字符串可以是一个 URL。对于文件 URL，需要指定主机。一个本地文件可以是：file://localhost/path/to/table.parquet。一个文件 URL 也可以是指向包含多个分区 parquet 文件的目录的路径。pyarrow 和 fastparquet 都支持目录路径和文件 URL。一个目录路径可以是：file://localhost/path/to/tables。所谓类文件对象，我们指的是具有 read() 方法的对象，例如文件句柄（例如通过内置的 open 函数）或 StringIO。
engine ({'auto', 'pyarrow'}, default 'auto') -- 使用的 Parquet 库。默认行为是尝试使用 'pyarrow'，
storage_options (dict, optional) -- 存储连接的选项。
columns (list, default=None) -- 如果非 None，则只从文件中读取这些列。
groups_as_chunks (bool, default False) -- 如果为 True，则每个行组对应一个 chunk；如果为 False，则每个文件对应一个 chunk。仅适用于 'pyarrow' 引擎。
filters (list, default=None) -- 用于过滤数据。过滤条件语法：[[（列，操作符，值），…]，…]，其中操作符包括 [==, =, >, >=, <, <=, !=, in, not in]。最内层的元组转换为通过AND操作应用的一组过滤条件。外层列表通过OR操作组合这些过滤条件集。也可以使用单个元组列表，这意味着不需要在过滤条件集之间进行OR操作。
default_index_type ({None, 'range', 'incremental'}, default None) -- 如果未指定 index_col，则指定要生成的索引类型。如果未指定，将使用 options.dataframe.default_index_type。
dtype_backend ({'numpy', 'pyarrow'}, default 'numpy') -- 应用于结果 DataFrame 的后端数据类型（仍处于实验阶段）。
storage_options -- 存储连接的选项。
memory_scale (int, optional) -- 实际内存占用与原始文件大小的比例。
merge_small_files (bool, default True) -- 合并尺寸较小的小文件。
**kwargs -- 任何额外的 kwargs 都会传递给引擎。

返回类型:

MaxFrame DataFrame