Comparison with PyODPS DataFrame#
PyODPS DataFrame is a DataFrame-like package provided by MaxCompute as a part of PyODPS package. It provides capability for Python data analyzers to query MaxCompute data with a set of operators similar to pandas. Despite the similarity in operators, the usage between two sets of APIs are quite different. It might not be easy for a developer to dive deep into PyODPS DataFrame with knowledge about pandas only.
Though PyODPS DataFrame is still part of PyODPS, it is recommended to create new applications with MaxFrame to enjoy its compatibility with pandas.
Object abstraction#
PyODPS DataFrame does not have indexes. This means that a majority of pandas APIs with indexes cannot be used or not fully supported.
For instance, arithmetic operations in pandas relies on index alignment. That is, two DataFrames are aligned first, and then arithmetic operation is performed.
>>> series1 = pd.Series([2, 1, 3], index=[1, 2, 4])
>>> series2 = pd.Series([1, 5, 6], index=[1, 3, 4])
>>> series1 + series2
1 3.0
2 NaN
3 NaN
4 9.0
dtype: float64
However, when indexes are absent, this kind of operation is not supported.
To support this kind of operation, in MaxFrame, it is required to add an index column to DataFrame or Series. If the index is absent, a default RangeIndex is added. Therefore the statement above can be supported.
Another huge difference between PyODPS DataFrame and MaxFrame is that in PyODPS DataFrame, representation of data objects and operators are mixed, and this may confuse newcomers. For instance,
df = o.get_table('table_name').to_df() # df is a DataFrame instance
df2 = df["col1", "col2"] # df2 is a CollectionExpr instance
In the second line, df2 is an instance of CollectionExpr which means
it is an expression and different from a DataFrame instance. However, all
DataFrame functions can be applied directly onto df2 and there is nothing
different from DataFrame instance.
In MaxFrame, however, data objects and operators are defined separately. Data
objects users interact with are all instances of a few data classes, namely
DataFrame, Series or Index. For the example above, now all
instances are DataFrame now.
df = md.read_odps_table('table_name') # df is a DataFrame instance
df2 = df[["col1", "col2"]] # df2 is also a DataFrame instance
Functions#
Functions in PyODPS DataFrame are not fully compatible with pandas. Therefore
to write code with PyODPS DataFrame, users need to read the documents first
before start coding. However, the target of MaxFrame is to create a pandas-compatible
API. Hence there are API differences between PyODPS DataFrame and MaxFrame.
These differences are listed below. Methods starts with mf. mean that these non-pandas
methods are added in MaxFrame to facilitate migrating from PyODPS DataFrame to MaxFrame.
Note that you need to read API documents of these functions before rewriting your code.
PyODPS DataFrame API |
MaxFrame API |
|---|---|
DataFrame.append_id |
Not needed. DataFrame index is added by default |
DataFrame.bloom_filter |
Not implemented yet |
DataFrame.boxplot |
DataFrame.plot.boxplot |
DataFrame.concat |
maxframe.dataframe.concat |
DataFrame.describe |
DataFrame.describe |
DataFrame.distinct |
DataFrame.drop_duplicates |
DataFrame.except_ |
DataFrame.merge with filter |
DataFrame.exclude |
DataFrame.drop |
DataFrame.extract_kv |
Not implemented yet |
DataFrame.hist |
DataFrame.plot.hist |
DataFrame.inner_join |
DataFrame.merge |
DataFrame.intersect |
DataFrame.merge |
DataFrame.left_join |
DataFrame.merge |
DataFrame.limit |
DataFrame.head |
DataFrame.map_reduce |
DataFrame.mf.map_reduce |
DataFrame.minmax_scale |
Not implemented yet |
DataFrame.outer_join |
DataFrame.merge |
DataFrame.persist |
DataFrame.to_odps_table |
DataFrame.reshuffle |
DataFrame.mf.reshuffle |
DataFrame.right_join |
DataFrame.merge |
DataFrame.setdiff |
DataFrame.merge |
DataFrame.split |
Not implemented yet |
DataFrame.std_scale |
Not implemented yet |
DataFrame.sort |
DataFrame.sort_values |
DataFrame.switch |
maxframe.dataframe.case_when |
DataFrame.to_kv |
Not implemented yet |
DataFrame.union |
maxframe.dataframe.concat |
DatetimeSequenceExpr.date |
Series.dt.date |
DatetimeSequenceExpr.day |
Series.dt.day |
DatetimeSequenceExpr.dayofweek |
Series.dt.dayofweek |
DatetimeSequenceExpr.dayofyear |
Series.dt.dayofyear |
DatetimeSequenceExpr.hour |
Series.dt.hour |
DatetimeSequenceExpr.is_month_end |
Series.dt.is_month_end |
DatetimeSequenceExpr.is_month_start |
Series.dt.is_month_start |
DatetimeSequenceExpr.is_year_end |
Series.dt.is_year_end |
DatetimeSequenceExpr.is_year_start |
Series.dt.is_year_start |
DatetimeSequenceExpr.microsecond |
Series.dt.microsecond |
DatetimeSequenceExpr.min |
Series.dt.min |
DatetimeSequenceExpr.minute |
Series.dt.minute |
DatetimeSequenceExpr.month |
Series.dt.month |
DatetimeSequenceExpr.second |
Series.dt.second |
DatetimeSequenceExpr.strftime |
Series.dt.strftime |
DatetimeSequenceExpr.unix_timestamp |
Not implemented yet |
DatetimeSequenceExpr.week |
Series.dt.week |
DatetimeSequenceExpr.weekday |
Series.dt.weekday |
DatetimeSequenceExpr.weekofyear |
Series.dt.weekofyear |
DatetimeSequenceExpr.year |
Series.dt.year |
SequenceExpr.degrees |
np.degrees(Series) |
SequenceExpr.radians |
np.radians(Series) |
SequenceExpr.tolist |
Series.to_numpy |
SequenceExpr.to_datetime |
maxframe.dataframe.to_datetime |
SequenceExpr.topk |
Not implemented yet |
SequenceExpr.trunc |
np.trunc(Series) |
SequenceExpr.hll_count |
Not implemented yet |
StringSequenceExpr.capitalize |
Series.str.capitalize |
StringSequenceExpr.contains |
Series.str.contains |
StringSequenceExpr.count |
Series.str.count |
StringSequenceExpr.endswith |
Series.str.endswith |
StringSequenceExpr.find |
Series.str.find |
StringSequenceExpr.len |
Series.str.len |
StringSequenceExpr.ljust |
Series.str.ljust |
StringSequenceExpr.lower |
Series.str.lower |
StringSequenceExpr.lstrip |
Series.str.lstrip |
StringSequenceExpr.pad |
Series.str.pad |
StringSequenceExpr.repeat |
Series.str.repeat |
StringSequenceExpr.replace |
Series.str.replace |
StringSequenceExpr.rfind |
Series.str.rfind |
StringSequenceExpr.rjust |
Series.str.rjust |
StringSequenceExpr.rstrip |
Series.str.rstrip |
StringSequenceExpr.slice |
Series.str.slice |
StringSequenceExpr.startswith |
Series.str.startswith |
StringSequenceExpr.strip |
Series.str.strip |
StringSequenceExpr.swapcase |
Series.str.swapcase |
StringSequenceExpr.title |
Series.str.title |
StringSequenceExpr.translate |
Series.str.translate |
StringSequenceExpr.upper |
Series.str.upper |
StringSequenceExpr.zfill |
Series.str.zfill |
StringSequenceExpr.isalnum |
Series.str.isalnum |
StringSequenceExpr.isalpha |
Series.str.isalpha |
StringSequenceExpr.isdigit |
Series.str.isdigit |
StringSequenceExpr.isspace |
Series.str.isspace |
StringSequenceExpr.islower |
Series.str.islower |
StringSequenceExpr.isupper |
Series.str.isupper |
StringSequenceExpr.istitle |
Series.str.istitle |
StringSequenceExpr.isnumeric |
Series.str.isnumeric |
StringSequenceExpr.isdecimal |
Series.str.isdecimal |
Execution#
PyODPS DataFrame and MaxFrame both use lazy execution to leverage efficiency of code optimization. However, the way to invoke these jobs is changed.