arrow_down arrow2 back chevron_down chevron_right cross Facebook linkedin menu minus plus search social toggle_sort
Case Study

Media Company

Category
Cloud Services
Share

Project overview: Crafting exciting demos from extensive media

Imagine a world where you can experience the essence of a 60-minute video podcast in just a few short, engaging clips. By processing audio and video media content, we generate demos highlighting the most captivating moments of games, players, and coaches. The magic lies in transforming a lengthy media into a user-defined number of short, impactful clips, each meticulously selected to represent the media's most thrilling segments.

Solution overview

The following diagram shows the resources and services used in the solution.

The depicted architecture outlines a sophisticated cloud-based service engineered to distil lengthy video or audio medias into concise, engaging clips. Users engage with this service through a web application, accessing it directly from their browsers. Once the user uploads the media, AWS Lambda acts as the orchestrator, invoking Amazon Transcribe to transcribe the spoken content into text. This textual data forms the substrate for the subsequent content curation process.

In the core phase of Automatic Clip Generation, the service employs a dual-strategy approach. Firstly, 'Intelligent Chunking' segments the transcribed text into semantically coherent segments, potentially identifying key moments within the media's narrative. Next, 'Ranking & clip generation' evaluates these segments for their informational richness and entertainment value, ultimately synthesizing them into a series of short, impactful clips. These clips are then processed through Amazon SageMaker and AWS Bedrock, leveraging foundational AI models to enhance the selection, ensuring that each clip encapsulates the media's most exhilarating moments. The resultant clips are stored in an Amazon S3 bucket, while metadata such as named entities, summary, etc. are catalogued, ready for retrieval and display. This transformative process enables users to experience the zenith of a 60-minute media in just a few carefully crafted snippets.

Preprocessing: The first step to precision

Our journey begins with preprocessing, utilizing Amazon Transcribe for diarised, timestamped transcriptions. We break down the media into chunks, each a single sentence with its duration, disregarding any exceeding 40 seconds for brevity. We then normalize these chunks to ensuring speaker consistency and incorporating summaries of the chunks for a concise, informative overview.

The LLM trees solution: Harnessing AI for content selection

To elucidate the core functionality of our media processing solution, we adopt a structured approach, incorporating a Large Language Model (LLM) within a batched decision tree mechanism. This process effectively condenses extensive media content into short, engaging clips. The approach is encapsulated in the following pseudocode and detailed explanation:

Detailed explanation:

  1. Chunk Creation: The media is split into normalized chunks, each at most 'max_length' seconds long. This segmentation ensures continuity and relevance.
  2. Chunk Enrichment: Each chunk is enhanced with a summary or a translated version (if non-English), streamlining the content for subsequent processing.
  3. Zoning: The media is divided into zones, facilitating a diverse selection of clips.
  4. Batch Formation: Chunks within each zone are batched, preparing them for LLM processing.
  5. Initial LLM Interaction (20%): The LLM evaluates each batch and selects 20% of the chunks, identifying those with the most engaging content.
  6. Iterative LLM Interaction Loop: This process iteratively refines the selection, ensuring the final set aligns with the user-defined criteria.
  7. Final LLM Interaction: The final selection is fine-tuned to meet the specific requirements of clip quantity per zone.
  8. Demo Creation: Two versions of the demo are produced - a short version with the most impactful clips, and a longer version with additional relevant content.

This methodology leverages the LLM's capacity to discern and prioritize media segments, ensuring the final demos are both concise and representative of the original content's essence. The use of zones and batch processing enables a systematic, comprehensive evaluation of the media, ensuring diversity and coverage across different sections.

Functionality extension: Beyond basic processing

Our project is not just about generating clips; it is about enriching the user experience with advanced functionalities:

  • User Interface (UI): We have revamped our UI, making it more user-friendly while integrating our new features seamlessly.
  • Semantic Tagging of Clips: Our "Enrich the Chunks" phase now includes semantic tagging, using LLM to identify and tag named entities. This enriches each clip with meta-information and thematic keywords.
  • Holistic Summary of Input Media: To provide an overarching context, we employ models like GPT-4 or Claude-V2, with their extended context capabilities, for generating comprehensive summaries of the entire media input.

Future Work – LLM Agnostic Architecture

One limitation of the existing approach lies in the suboptimal transitions between consecutive audio clips, even when intelligent chunking strategies are employed. Specifically, our algorithm aims to avoid truncating sentences midway and ensures that each chunk features a single speaker. Despite these measures, the auditory transition remains discontinuous in our demonstrations. To ameliorate this issue, we have developed an extension to the current framework. This extension is largely model-agnostic, capable of functioning with various large language models (LLMs) such as J2 Ultra, Claude-V2, or GPT-4. However, it has been specifically tailored for GPT-4 due to the inadequacies of other models in executing this particular task.

In this extension, we extend the maximum allowable length for each chunk during parameter initialization. Subsequently, after identifying the most salient clips via the original algorithm, we employ more capable LLMs like Cluade-2 or GPT-4 to discern optimal starting and ending points within the elongated content of each chunk. This enhances the smoothness of the transitions, thereby elevating the overall quality of the audio experience.

We use Amazon Bedrock which helps us in creating more model-agnostic solutions in language model (LLM) technology. Although, we need hard prompt engineering specific to individual models, gradually the models can produce standard output with the same prompts. By standardizing outputs across different LLMs, the aim is to enable programs that can interact with various models seamlessly, accommodating the subjective nature of tasks where there is no definitive right or wrong output.

A new horizon in media processing

Our project represents a significant leap in media processing, offering an AI-powered solution to distil the essence of lengthy medias into captivating, concise demos. By integrating advanced AI techniques and user-centric functionalities, we are not only transforming the way we consume media but also setting new standards in content processing technology.