10 minutes to MaxFrame#

Here, movielens 100K is used as an example. Assume that three tables already exist, which are maxframe_ml_100k_movies (movie-related data), maxframe_ml_100k_users (user-related data), and maxframe_ml_100k_ratings (rating-related data).

Create a MaxFrame session object before starting the following steps:

import os
from odps import ODPS
from maxframe import new_session

# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='**your-project**',
    endpoint='**your-endpoint**',
)
session = new_session(o)

You only need to use read_odps_table API to create a DataFrame object. For instance,

import maxframe.dataframe as md

users = md.read_odps_table('pyodps_ml_100k_users')

View columns of DataFrame and the types of the columns through the dtypes attribute, as shown in the following code:

>>> users.dtypes
user_id        int64
age            int64
sex           object
occupation    object
zip_code      object
dtype: object

Simply view the representation of the object will automatically show the first and last rows of the DataFrame.

>>> users
   user_id  age  sex     occupation  zip_code
0        1   24    M     technician     85711
1        2   53    F          other     94043
2        3   23    M         writer     32067
3        4   24    M     technician     43537
4        5   33    F          other     15213
...
5        6   42    M      executive     98101
6        7   57    M  administrator     91344
7        8   36    M  administrator     05201
8        9   29    M        student     01002
9       10   53    M         lawyer     90703

You can use the head method to obtain the first N data records for easy and quick data preview. For example:

>>> users.head(10).execute().fetch()
   user_id  age  sex     occupation  zip_code
0        1   24    M     technician     85711
1        2   53    F          other     94043
2        3   23    M         writer     32067
3        4   24    M     technician     43537
4        5   33    F          other     15213
5        6   42    M      executive     98101
6        7   57    M  administrator     91344
7        8   36    M  administrator     05201
8        9   29    M        student     01002
9       10   53    M         lawyer     90703

You can add a filter on the columns if you do not want to view all of them. For example:

>>> users[['user_id', 'age']].head(5).execute().fetch()
   user_id  age
0        1   24
1        2   53
2        3   23
3        4   24
4        5   33

You can also drop several columns. For example:

>>> users.drop(columns=['zip_code', 'age']).head(5)
   user_id  sex  occupation
0        1    M  technician
1        2    F       other
2        3    M      writer
3        4    M  technician
4        5    F       other

When excluding some columns, you may want to obtain new columns through computation. For example, add the sex_bool attribute and set it to True if sex is Male. Otherwise, set it to False. For example:

>>> users = users.drop(['zip_code', 'sex'])
>>> users["sex_bool"] = users.sex == "M"
>>> users.head(5).execute().fetch()
   user_id  age  occupation  sex_bool
0        1   24  technician      True
1        2   53       other     False
2        3   23      writer      True
3        4   24  technician      True
4        5   33       other     False

Obtain the number of persons at age of 20 to 25, as shown in the following code:

>>> users[users.age.between(20, 25)].count().execute().fetch()
195

Obtain the numbers of male and female users, as shown in the following code:

>>> users.groupby(users.sex).user_id.size()
F   273
M   670
dtype: int64

To divide users by job, obtain the first 10 jobs that have the largest population, and sort the jobs in the descending order of population. See the following:

>>> df = users.groupby("occupation").agg({"user_id": "count"})
>>> df.sort_values("user_id", ascending=False)[:10]
               user_id
occupation
student            196
other              105
educator            95
administrator       79
engineer            67
programmer          66
librarian           51
writer              45
executive           32
scientist           31

DataFrame APIs provide the value_counts method to quickly achieve the same result. An example is shown below.

>>> uses.occupation.value_counts()[:10]
student        196
other          105
educator        95
administrator   79
engineer        67
programmer      66
librarian       51
writer          45
executive       32
scientist       31
dtype: int64

Show data in a more intuitive graph, as shown in the following code:

%matplotlib inline

Use a horizontal bar chart to visualize data, as shown in the following code:

>>> users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession')
<matplotlib.axes._subplots.AxesSubplot at 0x10653cfd0>

_images/df-value-count-plot.png

Divide ages into 30 groups and view the histogram of age distribution, as shown in the following code:

>>> users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users')
<matplotlib.axes._subplots.AxesSubplot at 0x10667a510>

_images/df-age-hist.png

Use join to join the three tables and save the joined tables as a new table. For example:

>>> movies = md.read_odps_table('pyodps_ml_100k_movies')
>>> ratings = md.read_odps_table('pyodps_ml_100k_ratings')
>>>
>>> o.delete_table('pyodps_ml_100k_lens', if_exists=True)
>>> lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens')
>>>
>>> lens.dtypes
odps.Schema {
movie_id                            int64
title                               string
release_date                        string
video_release_date                  string
imdb_url                            string
user_id                             int64
rating                              int64
unix_timestamp                      int64
age                                 int64
sex                                 string
occupation                          string
zip_code                            string
}