十分钟入门 MaxFrame#
这里以 movielens 100K 为例。假设已存在三张表,分别是 ``maxframe_ml_100k_movies``(电影相关数据)、``maxframe_ml_100k_users``(用户相关数据)和 ``maxframe_ml_100k_ratings``(评分相关数据)。
在开始以下步骤前,请先创建一个 MaxFrame 会话对象:
import os
from odps import ODPS
from maxframe import new_session
# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='**your-project**',
endpoint='**your-endpoint**',
)
session = new_session(o)
您只需要使用 read_odps_table API 来创建一个 DataFrame 对象。例如,
import maxframe.dataframe as md
users = md.read_odps_table('pyodps_ml_100k_users')
通过 dtypes 属性查看 DataFrame 的列及其类型,如下代码所示:
>>> users.dtypes
user_id int64
age int64
sex object
occupation object
zip_code object
dtype: object
简单查看对象的表示形式,将自动显示 DataFrame 的前几行和后几行。
>>> users
user_id age sex occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
...
5 6 42 M executive 98101
6 7 57 M administrator 91344
7 8 36 M administrator 05201
8 9 29 M student 01002
9 10 53 M lawyer 90703
您可以使用 head 方法获取前 N 条数据记录,以便快速预览数据。例如:
>>> users.head(10).execute().fetch()
user_id age sex occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
5 6 42 M executive 98101
6 7 57 M administrator 91344
7 8 36 M administrator 05201
8 9 29 M student 01002
9 10 53 M lawyer 90703
如果您不想查看所有列,可以在列上添加筛选器。例如:
>>> users[['user_id', 'age']].head(5).execute().fetch()
user_id age
0 1 24
1 2 53
2 3 23
3 4 24
4 5 33
您也可以删除某些列。例如:
>>> users.drop(columns=['zip_code', 'age']).head(5)
user_id sex occupation
0 1 M technician
1 2 F other
2 3 M writer
3 4 M technician
4 5 F other
在排除某些列时,您可能希望通过计算得到新列。例如,添加 sex_bool 属性,当性别为 Male 时设为 True,否则设为 False。例如:
>>> users = users.drop(['zip_code', 'sex'])
>>> users["sex_bool"] = users.sex == "M"
>>> users.head(5).execute().fetch()
user_id age occupation sex_bool
0 1 24 technician True
1 2 53 other False
2 3 23 writer True
3 4 24 technician True
4 5 33 other False
获取年龄在 20 到 25 岁之间的人数,如下代码所示:
>>> users[users.age.between(20, 25)].count().execute().fetch()
195
获取男性和女性用户的数量,如下代码所示:
>>> users.groupby(users.sex).user_id.size()
F 273
M 670
dtype: int64
按职业划分用户,获取人数最多的前 10 个职业,并按人数降序排列。见下文:
>>> df = users.groupby("occupation").agg({"user_id": "count"})
>>> df.sort_values("user_id", ascending=False)[:10]
user_id
occupation
student 196
other 105
educator 95
administrator 79
engineer 67
programmer 66
librarian 51
writer 45
executive 32
scientist 31
DataFrame API 提供了 value_counts 方法来快速实现相同的效果。示例如下。
>>> uses.occupation.value_counts()[:10]
student 196
other 105
educator 95
administrator 79
engineer 67
programmer 66
librarian 51
writer 45
executive 32
scientist 31
dtype: int64
以更直观的图表展示数据,如下代码所示:
%matplotlib inline
使用水平条形图来可视化数据,如下代码所示:
>>> users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession')
<matplotlib.axes._subplots.AxesSubplot at 0x10653cfd0>
_images/df-value-count-plot.png
将年龄分为 30 组,查看年龄分布的直方图,如下代码所示:
>>> users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users')
<matplotlib.axes._subplots.AxesSubplot at 0x10667a510>
_images/df-age-hist.png
使用 join 将三张表连接,并将连接后的表保存为一张新表。例如:
>>> movies = md.read_odps_table('pyodps_ml_100k_movies')
>>> ratings = md.read_odps_table('pyodps_ml_100k_ratings')
>>>
>>> o.delete_table('pyodps_ml_100k_lens', if_exists=True)
>>> lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens')
>>>
>>> lens.dtypes
odps.Schema {
movie_id int64
title string
release_date string
video_release_date string
imdb_url string
user_id int64
rating int64
unix_timestamp int64
age int64
sex string
occupation string
zip_code string
}