maxframe.dataframe.read_odps_table#

maxframe.dataframe.read_odps_table(table_name: str | Table, partitions: None | str | List[str] = None, columns: List[str] | None = None, index_col: None | str | List[str] = None, *, odps_entry: ODPS = None, string_as_binary: bool = None, append_partitions: bool = False, dtype_backend: str = None, default_index_type: DefaultIndexType = None, filters: str | List[List[Tuple]] = None, **kw)[source]#

Read data from a MaxCompute (ODPS) table into DataFrame.

Supports specifying some columns as indexes. If not specified, RangeIndex will be generated.

Parameters:

table_name (Union[str, Table]) – Name of the table to read from.
partitions (Union[None, str, List[str]]) – Table partition or list of partitions to read from.
columns (Optional[List[str]]) – Table columns to read from. You may also specify partition columns here. If not specified, all table columns (or include partition columns if append_partitions is True) will be included.
index_col (Union[None, str, List[str]]) – Columns to be specified as indexes.
append_partitions (bool) – If True, will add all partition columns as selected columns when columns is not specified,
dtype_backend ({'numpy', 'pyarrow'}, default 'numpy') – Back-end data type applied to the resultant DataFrame (still experimental).
filters (Union[str, List[List[Tuple]]], default None) –
Filter expression to apply when reading data.
- String format: SQL WHERE clause passed directly to StorageAPI.
- List format: Nested list of tuples.
  
  Format: Inner lists are ANDed, outer lists are ORed. Example: [[('col1', '==', 'value'), ('col2', '>', 10)]]
Supported operators: ==, !=, <, >, <=, >=, in, not in.

Note

Complete filtering is not guaranteed for this argument given implementation.

Returns:

result – DataFrame read from MaxCompute (ODPS) table

Return type:

DataFrame

Examples

Before using read_odps_table, you need to create an ODPS entry whose parameters will be stored globally in current process.

>>> import maxframe.dataframe as md
>>> from odps import ODPS
>>>
>>> o = ODPS(...)  # Fill account information here

Simply read a table by name.

>>> df = md.read_odps_table("simple_table")

Read table by partition (or partitions).

>>> # Read partitioned table
>>> df = md.read_odps_table("partitioned_table", partitions="pt=20230101")
>>> # Read with multiple partitions
>>> df = md.read_odps_table("partitioned_table", partitions=["pt=20230101", "pt=20230102"])

Read with column selection.

>>> df = md.read_odps_table("table_name", columns=["col1", "col2", "col3"])

Read with columns as index.

>>> # Read with index columns
>>> df = md.read_odps_table("table_name", index_col="id")
>>> # Read with multiple index columns
>>> df = md.read_odps_table("table_name", index_col=["id", "timestamp"])

Read with filter condition. Note that complete filtering is not guaranteed.

>>> # Read table with string filter
>>> df = md.read_odps_table("source_table", filters="age > 18")
>>> # Read with list filter
>>> df = md.read_odps_table(
...     "table_name",
...     filters=[[('age', '>', 18), ('city', '==', 'Beijing')]]
... )