Comparison with PyODPS DataFrame#

PyODPS DataFrame is a DataFrame-like package provided by MaxCompute as a part of PyODPS package. It provides capability for Python data analyzers to query MaxCompute data with a set of operators similar to pandas. Despite the similarity in operators, the usage between two sets of APIs are quite different. It might not be easy for a developer to dive deep into PyODPS DataFrame with knowledge about pandas only.

Though PyODPS DataFrame is still part of PyODPS, it is recommended to create new applications with MaxFrame to enjoy its compatibility with pandas.

Object abstraction#

PyODPS DataFrame does not have indexes. This means that a majority of pandas APIs with indexes cannot be used or not fully supported.

For instance, arithmetic operations in pandas relies on index alignment. That is, two DataFrames are aligned first, and then arithmetic operation is performed.

>>> series1 = pd.Series([2, 1, 3], index=[1, 2, 4])
>>> series2 = pd.Series([1, 5, 6], index=[1, 3, 4])
>>> series1 + series2
1    3.0
2    NaN
3    NaN
4    9.0
dtype: float64

However, when indexes are absent, this kind of operation is not supported.

To support this kind of operation, in MaxFrame, it is required to add an index column to DataFrame or Series. If the index is absent, a default RangeIndex is added. Therefore the statement above can be supported.

Another huge difference between PyODPS DataFrame and MaxFrame is that in PyODPS DataFrame, representation of data objects and operators are mixed, and this may confuse newcomers. For instance,

df = o.get_table('table_name').to_df()  # df is a DataFrame instance
df2 = df["col1", "col2"]  # df2 is a CollectionExpr instance

In the second line, df2 is an instance of CollectionExpr which means it is an expression and different from a DataFrame instance. However, all DataFrame functions can be applied directly onto df2 and there is nothing different from DataFrame instance.

In MaxFrame, however, data objects and operators are defined separately. Data objects users interact with are all instances of a few data classes, namely DataFrame, Series or Index. For the example above, now all instances are DataFrame now.

df = md.read_odps_table('table_name')  # df is a DataFrame instance
df2 = df[["col1", "col2"]]  # df2 is also a DataFrame instance

Functions#

Functions in PyODPS DataFrame are not fully compatible with pandas. Therefore to write code with PyODPS DataFrame, users need to read the documents first before start coding. However, the target of MaxFrame is to create a pandas-compatible API. Hence there are API differences between PyODPS DataFrame and MaxFrame. These differences are listed below. Methods starts with mf. mean that these non-pandas methods are added in MaxFrame to facilitate migrating from PyODPS DataFrame to MaxFrame. Note that you need to read API documents of these functions before rewriting your code.

PyODPS DataFrame API	MaxFrame API
DataFrame.append_id	Not needed. DataFrame index is added by default
DataFrame.bloom_filter	Not implemented yet
DataFrame.boxplot	DataFrame.plot.boxplot
DataFrame.concat	maxframe.dataframe.concat
DataFrame.describe	DataFrame.describe
DataFrame.distinct	DataFrame.drop_duplicates
DataFrame.except_	DataFrame.merge with filter
DataFrame.exclude	DataFrame.drop
DataFrame.extract_kv	Not implemented yet
DataFrame.hist	DataFrame.plot.hist
DataFrame.inner_join	DataFrame.merge
DataFrame.intersect	DataFrame.merge
DataFrame.left_join	DataFrame.merge
DataFrame.limit	DataFrame.head
DataFrame.map_reduce	DataFrame.mf.map_reduce
DataFrame.minmax_scale	Not implemented yet
DataFrame.outer_join	DataFrame.merge
DataFrame.persist	DataFrame.to_odps_table
DataFrame.reshuffle	DataFrame.mf.reshuffle
DataFrame.right_join	DataFrame.merge
DataFrame.setdiff	DataFrame.merge
DataFrame.split	Not implemented yet
DataFrame.std_scale	Not implemented yet
DataFrame.sort	DataFrame.sort_values
DataFrame.switch	maxframe.dataframe.case_when
DataFrame.to_kv	Not implemented yet
DataFrame.union	maxframe.dataframe.concat
DatetimeSequenceExpr.date	Series.dt.date
DatetimeSequenceExpr.day	Series.dt.day
DatetimeSequenceExpr.dayofweek	Series.dt.dayofweek
DatetimeSequenceExpr.dayofyear	Series.dt.dayofyear
DatetimeSequenceExpr.hour	Series.dt.hour
DatetimeSequenceExpr.is_month_end	Series.dt.is_month_end
DatetimeSequenceExpr.is_month_start	Series.dt.is_month_start
DatetimeSequenceExpr.is_year_end	Series.dt.is_year_end
DatetimeSequenceExpr.is_year_start	Series.dt.is_year_start
DatetimeSequenceExpr.microsecond	Series.dt.microsecond
DatetimeSequenceExpr.min	Series.dt.min
DatetimeSequenceExpr.minute	Series.dt.minute
DatetimeSequenceExpr.month	Series.dt.month
DatetimeSequenceExpr.second	Series.dt.second
DatetimeSequenceExpr.strftime	Series.dt.strftime
DatetimeSequenceExpr.unix_timestamp	Not implemented yet
DatetimeSequenceExpr.week	Series.dt.week
DatetimeSequenceExpr.weekday	Series.dt.weekday
DatetimeSequenceExpr.weekofyear	Series.dt.weekofyear
DatetimeSequenceExpr.year	Series.dt.year
SequenceExpr.degrees	np.degrees(Series)
SequenceExpr.radians	np.radians(Series)
SequenceExpr.tolist	Series.to_numpy
SequenceExpr.to_datetime	maxframe.dataframe.to_datetime
SequenceExpr.topk	Not implemented yet
SequenceExpr.trunc	np.trunc(Series)
SequenceExpr.hll_count	Not implemented yet
StringSequenceExpr.capitalize	Series.str.capitalize
StringSequenceExpr.contains	Series.str.contains
StringSequenceExpr.count	Series.str.count
StringSequenceExpr.endswith	Series.str.endswith
StringSequenceExpr.find	Series.str.find
StringSequenceExpr.len	Series.str.len
StringSequenceExpr.ljust	Series.str.ljust
StringSequenceExpr.lower	Series.str.lower
StringSequenceExpr.lstrip	Series.str.lstrip
StringSequenceExpr.pad	Series.str.pad
StringSequenceExpr.repeat	Series.str.repeat
StringSequenceExpr.replace	Series.str.replace
StringSequenceExpr.rfind	Series.str.rfind
StringSequenceExpr.rjust	Series.str.rjust
StringSequenceExpr.rstrip	Series.str.rstrip
StringSequenceExpr.slice	Series.str.slice
StringSequenceExpr.startswith	Series.str.startswith
StringSequenceExpr.strip	Series.str.strip
StringSequenceExpr.swapcase	Series.str.swapcase
StringSequenceExpr.title	Series.str.title
StringSequenceExpr.translate	Series.str.translate
StringSequenceExpr.upper	Series.str.upper
StringSequenceExpr.zfill	Series.str.zfill
StringSequenceExpr.isalnum	Series.str.isalnum
StringSequenceExpr.isalpha	Series.str.isalpha
StringSequenceExpr.isdigit	Series.str.isdigit
StringSequenceExpr.isspace	Series.str.isspace
StringSequenceExpr.islower	Series.str.islower
StringSequenceExpr.isupper	Series.str.isupper
StringSequenceExpr.istitle	Series.str.istitle
StringSequenceExpr.isnumeric	Series.str.isnumeric
StringSequenceExpr.isdecimal	Series.str.isdecimal

Execution#

PyODPS DataFrame and MaxFrame both use lazy execution to leverage efficiency of code optimization. However, the way to invoke these jobs is changed.