.. _examples_multimodal_audio_maxframe: MaxFrame Multimodal Audio Operator Practice =========================================== .. raw:: html
Available at MaxFrame 2.7.0
Background ---------- As multimodal foundation models and speech applications grow quickly, audio becomes a key data source for training and content understanding. Workloads such as ASR, subtitle generation, speech retrieval, and corpus preparation all require reliable text extraction from large-scale raw audio. In production, audio files are usually scattered across storage locations with inconsistent formats, durations, and quality. Traditional pipelines rely on many separate tools for download, transcoding, transcription, and aggregation, which increases development and operational complexity at scale. With MaxCompute + MaxFrame, you can run a unified distributed pipeline from OSS audio ingestion to structured transcription output. This tutorial focuses on practical audio processing with built-in MaxFrame ``.audio`` operators. Use cases --------- - **Speech transcription** for podcasts, courses, interviews, and meeting recordings. - **Subtitle generation** and text extraction from audio tracks in media files. - **Training data preparation** for speech recognition and speech understanding models. - **Structured processing** for customer service calls and business recordings. Pipeline -------- .. image:: ../_static/examples/audio-pipeline.png :alt: Audio processing pipeline with MaxFrame audio operators :width: 100% Prerequisites ------------- .. list-table:: :header-rows: 1 :widths: 8 24 68 * - # - Requirement - Description * - 1 - **MaxCompute enabled** - A MaxCompute project with valid Access ID / Access Key. * - 2 - **DPE enabled** - ``.url.download()`` and ``.audio.*`` operators run on DPE. * - 3 - **Audio uploaded to OSS** - Source audio files are available in a target OSS bucket. * - 4 - **OSS RAM role authorization** - Configure Role ARN for ``.url.download(storage_options={"role_arn": ...})``. * - 5 - **MaxFrame SDK version** - Use MaxFrame SDK **2.7.0** or above (``pip install maxframe>=2.7.0``). Configure OSS RAM role ---------------------- When using ``.url.download(storage_options={"role_arn": ...})``, MaxFrame reads OSS data by assuming a RAM role. Make sure: 1. The role has OSS read permission (for example ``AliyunOSSReadOnlyAccess``). 2. The role trust policy allows MaxCompute service to assume the role. Environment setup ----------------- MaxCompute placeholders: .. code-block:: python ODPS_ACCESS_ID = "" ODPS_ACCESS_KEY = "" ODPS_PROJECT = "" ODPS_ENDPOINT = "" OUTPUT_TABLE = "" OSS placeholders: .. code-block:: python ROLE_ARN = "" OSS_ENDPOINT = "" OSS_BUCKET_NAME = "" OSS_DATA_PREFIX = "" Complete code example --------------------- Initialize ODPS and MaxFrame session: .. code-block:: python import maxframe assert maxframe.__version__ >= "2.7.0", ( f"maxframe >= 2.7.0 is required, current version: {maxframe.__version__}. " f"Please run: pip install --upgrade maxframe" ) print(f"maxframe version: {maxframe.__version__} ✓") import pandas as pd import maxframe.dataframe as md from maxframe import new_session from maxframe.config import options from odps import ODPS o = ODPS( access_id=ODPS_ACCESS_ID, secret_access_key=ODPS_ACCESS_KEY, project=ODPS_PROJECT, endpoint=ODPS_ENDPOINT, ) options.sql.enable_mcqa = False options.dag.settings = {"engine_order": ["DPE"]} session = new_session(o) print(f"Session ID : {session.session_id}") print(f"LogView : {session.get_logview_address()}") Build OSS audio URL list: .. code-block:: python file_names = ["audio_011.flac"] audio_urls = [ f"oss://{OSS_ENDPOINT}/{OSS_BUCKET_NAME}/{OSS_DATA_PREFIX}{name}" for name in file_names ] print(f"Processing {len(audio_urls)} audio files:") for u in audio_urls: print(f" - {u}") Use ``.audio`` operators for decode, language detection, transcription, and VAD: .. code-block:: python df = md.DataFrame(pd.DataFrame({"name": file_names, "url": audio_urls})) # Download OSS audio as bytes via RAM role df["audio_bytes"] = df["url"].url.download( storage_options={"role_arn": ROLE_ARN}, errors="raise", ) # Decode to target sample rate df["decoded"] = df["audio_bytes"].audio.decode(target_sample_rate=16000) # Basic properties df["sample_rate"] = df["decoded"].audio.sample_rate df["duration"] = df["decoded"].audio.duration df["format"] = df["decoded"].audio.format # Language detection df["language"] = df["audio_bytes"].audio.detect_language( max_duration_sec=30.0, cpu=4, memory="16GiB", ) # Speech-to-text transcribed = df["audio_bytes"].audio.transcribe(cpu=4, memory="16GiB") df["text"] = transcribed["text"] # Voice activity detection df["vad"] = df["audio_bytes"].audio.vad_detect(threshold=0.5) result_df = df[[ "name", "sample_rate", "duration", "format", "language", "text", "vad", ]] md.to_odps_table(result_df, OUTPUT_TABLE, overwrite=True).execute() print(result_df.execute().fetch()) Cleanup: .. code-block:: python session.destroy() Technical highlights -------------------- - **Direct OSS ingestion** with ``.url.download()`` and no object table dependency. - **Built-in distributed audio operators** for decode, metadata, language detection, transcription, and VAD. - **Flexible resource control** for CPU/GPU style workloads through operator and engine parameters. - **Low operational overhead** by reusing MaxCompute elastic compute and storage stack. Troubleshooting --------------- OSS access denied ~~~~~~~~~~~~~~~~~ **Symptom**: ``.url.download(storage_options={"role_arn": ...})`` fails with an access denied error. **Cause**: Wrong or missing role permissions. **Solution**: Verify the following: 1. **Role ARN is correct** — double-check the ``role_arn`` value in ``storage_options`` matches the RAM role configured in the Alibaba Cloud console. 2. **OSS read permission** — ensure the RAM role has the ``AliyunOSSReadOnlyAccess`` policy (or equivalent custom policy) attached, granting ``oss:GetObject`` permission on the target bucket.