maxframe.dataframe.read_lance#

maxframe.dataframe.read_lance(path, version: int = None, asof: str = None, columns: list = None, filters: str = None, index_col=None, dtype_backend: str = <no_default>, default_index_type: DefaultIndexType | str = None, storage_options: dict = None, *, dtypes: Series = None, index_dtypes: Series = None, memory_scale: int = None, merge_small_files: bool = True, merge_small_file_options: dict = None, session=None, run_kwargs: dict = None, **kwargs)[source]#

Load a Lance dataset from the file path, returning a DataFrame.

Parameters:

path (str) – Any valid string path is acceptable. The string could be a URL. For Aliyun OSS URLs, the format is: oss://<endpoint>/<bucket>/<path>. Example: oss://oss-cn-beijing.aliyuncs.com/my-bucket/dataset. For S3 URLs, the format is: s3://<bucket>/<path>.
version (int, optional) – Specific version of the dataset to read. If not specified, reads the latest version.
asof (str, optional) – Timestamp for point-in-time read. Format: ISO 8601 datetime string. Cannot be used together with version.
columns (list, optional) – If not None, only these columns will be read from the dataset.
filters (str or list, optional) –
Filter expression for predicate pushdown. Supports: - SQL-like filter strings accepted by Lance - CNF filters like [[('age', '>', 18), ('city', '==', 'Beijing')]]

used by MaxFrame predicate pushdown

The alias filter=... is also accepted for compatibility.
index_col (int, str, sequence of int/str, or False, default None) – Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int/str is given, a MultiIndex is used. If False, no column is used as index (ignoring any pandas metadata in the dataset). If None, uses pandas metadata if available, otherwise falls back to default_index_type.
default_index_type ({None, 'range', 'incremental'}, default None) – If index_col not specified, specify type of index to generate. If not specified, options.dataframe.default_index_type will be used.
dtype_backend ({'numpy', 'pyarrow'}, default 'numpy') – Back-end data type applied to the resultant DataFrame.
storage_options (dict, optional) – Options for storage connection. For Aliyun OSS with RAM role: {'role_arn': 'acs:ram::xxx:role/name'}
memory_scale (int, optional) – Scale that real memory occupation divided with raw file size.
merge_small_files (bool, default True) – Merge small Lance fragments into larger chunks for better parallel processing efficiency.
merge_small_file_options (dict, optional) – Options for merging small files.
**kwargs – Any additional kwargs are passed to lance.

Return type:

MaxFrame DataFrame

Examples

>>> import maxframe.dataframe as md
>>> # Read from Aliyun OSS with RAM role
>>> df = md.read_lance(
...     "oss://oss-cn-beijing.aliyuncs.com/my-bucket/dataset",
...     storage_options={"role_arn": "acs:ram::1234567890:role/maxframe-oss"}
... )
>>> # Read specific version
>>> df = md.read_lance(
...     "oss://oss-cn-beijing.aliyuncs.com/my-bucket/dataset",
...     version=1,
...     storage_options={"role_arn": "acs:ram::1234567890:role/maxframe-oss"}
... )
>>> # Read with filter
>>> df = md.read_lance(
...     "oss://oss-cn-beijing.aliyuncs.com/my-bucket/dataset",
...     filters="`age` > 18",
...     storage_options={"role_arn": "acs:ram::1234567890:role/maxframe-oss"}
... )