maxframe.dataframe.read_lance#
- maxframe.dataframe.read_lance(path, version: int = None, asof: str = None, columns: list = None, filters: str = None, index_col=None, dtype_backend: str = <no_default>, default_index_type: DefaultIndexType | str = None, storage_options: dict = None, *, dtypes: Series = None, index_dtypes: Series = None, memory_scale: int = None, merge_small_files: bool = True, merge_small_file_options: dict = None, session=None, run_kwargs: dict = None, **kwargs)[source]#
Load a Lance dataset from the file path, returning a DataFrame.
- Parameters:
path (str) – Any valid string path is acceptable. The string could be a URL. For Aliyun OSS URLs, the format is:
oss://<endpoint>/<bucket>/<path>. Example:oss://oss-cn-beijing.aliyuncs.com/my-bucket/dataset. For S3 URLs, the format is:s3://<bucket>/<path>.version (int, optional) – Specific version of the dataset to read. If not specified, reads the latest version.
asof (str, optional) – Timestamp for point-in-time read. Format: ISO 8601 datetime string. Cannot be used together with version.
columns (list, optional) – If not None, only these columns will be read from the dataset.
filters (str or list, optional) –
Filter expression for predicate pushdown. Supports: - SQL-like filter strings accepted by Lance - CNF filters like
[[('age', '>', 18), ('city', '==', 'Beijing')]]used by MaxFrame predicate pushdown
The alias
filter=...is also accepted for compatibility.index_col (int, str, sequence of int/str, or False, default None) – Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int/str is given, a MultiIndex is used. If False, no column is used as index (ignoring any pandas metadata in the dataset). If None, uses pandas metadata if available, otherwise falls back to default_index_type.
default_index_type ({None, 'range', 'incremental'}, default None) – If index_col not specified, specify type of index to generate. If not specified, options.dataframe.default_index_type will be used.
dtype_backend ({'numpy', 'pyarrow'}, default 'numpy') – Back-end data type applied to the resultant DataFrame.
storage_options (dict, optional) – Options for storage connection. For Aliyun OSS with RAM role:
{'role_arn': 'acs:ram::xxx:role/name'}memory_scale (int, optional) – Scale that real memory occupation divided with raw file size.
merge_small_files (bool, default True) – Merge small Lance fragments into larger chunks for better parallel processing efficiency.
merge_small_file_options (dict, optional) – Options for merging small files.
**kwargs – Any additional kwargs are passed to lance.
- Return type:
MaxFrame DataFrame
Examples
>>> import maxframe.dataframe as md >>> # Read from Aliyun OSS with RAM role >>> df = md.read_lance( ... "oss://oss-cn-beijing.aliyuncs.com/my-bucket/dataset", ... storage_options={"role_arn": "acs:ram::1234567890:role/maxframe-oss"} ... ) >>> # Read specific version >>> df = md.read_lance( ... "oss://oss-cn-beijing.aliyuncs.com/my-bucket/dataset", ... version=1, ... storage_options={"role_arn": "acs:ram::1234567890:role/maxframe-oss"} ... ) >>> # Read with filter >>> df = md.read_lance( ... "oss://oss-cn-beijing.aliyuncs.com/my-bucket/dataset", ... filters="`age` > 18", ... storage_options={"role_arn": "acs:ram::1234567890:role/maxframe-oss"} ... )