原谅我没有好好写commit信息。
大致就是删除了多余的代码和模型依赖,这降低了显存占用和模型存储与下载负担。
然后有个gradio界面,用来上传音频视频、预览/下载 目标音频视频与分离后的音频视频
只支持英文文本提示,不支持范围提示和视觉提示,这些依赖我都删除了。
然后,就是这样,可用。
license: other license_name: sam-license license_link: LICENSE extra_gated_fields: First Name: text Last Name: text Date of birth: date_picker Country: country Affiliation: text Job title: type: select options: - Student - Research Graduate - AI researcher - AI developer/engineer - Reporter - Other geo: ip_location By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox extra_gated_description: >- The information you provide will be collected, stored, processed and shared in accordance with the Meta Privacy Policy. extra_gated_button_content: Submit language:
SAM-Audio is a model for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans.
Before using SAM-Audio, you need to:
huggingface-cli loginSAM-Audio supports three types of prompting: text, visual, and span. Each method allows you to specify which sounds to isolate in different ways.
Use natural language descriptions to isolate sounds.
import torch
import torchaudio
from sam_audio import SAMAudio, SAMAudioProcessor
# Load model and processor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SAMAudio.from_pretrained("facebook/sam-audio-large").to(device).eval()
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
# Load audio file
audio_file = "path/to/audio.wav"
# Describe the sound you want to isolate
description = "A man speaking"
# Process and separate
inputs = processor(audios=[audio_file], descriptions=[description]).to(device)
with torch.inference_mode():
result = model.separate(inputs)
# Save results
torchaudio.save("target.wav", result.target[0].unsqueeze(0).cpu(), processor.audio_sampling_rate)
torchaudio.save("residual.wav", result.residual[0].unsqueeze(0).cpu(), processor.audio_sampling_rate)
Examples of text descriptions:
Isolate sounds associated with specific visual objects in a video using masked video frames.
import torch
import numpy as np
from sam_audio import SAMAudio, SAMAudioProcessor
from torchcodec.decoders import VideoDecoder
# NOTE: Requires SAM3 for creating masks
# pip install git+https://github.com/facebookresearch/sam3.git
from sam3.model_builder import build_sam3_video_predictor
# Load SAM-Audio model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SAMAudio.from_pretrained("facebook/sam-audio-large").to(device).eval()
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
# Load video
video_file = "path/to/video.mp4"
decoder = VideoDecoder(video_file)
frames = decoder[:]
# Create mask using SAM3 (example with text prompt)
video_predictor = build_sam3_video_predictor()
response = video_predictor.handle_request({
"type": "start_session",
"resource_path": video_file,
})
session_id = response["session_id"]
masks = []
for frame_index in range(len(decoder)):
response = video_predictor.handle_request({
"type": "add_prompt",
"session_id": session_id,
"frame_index": frame_index,
"text": "The person on the left", # Visual object to isolate
})
mask = response["outputs"]["out_binary_masks"]
if mask.shape[0] == 0:
mask = np.zeros_like(frames[0, [0]], dtype=bool)
masks.append(mask[:1])
mask = torch.from_numpy(np.concatenate(masks)).unsqueeze(1)
# Process with visual prompting
inputs = processor(
audios=[video_file],
descriptions=[""],
masked_videos=processor.mask_videos([frames], [mask]),
).to(device)
with torch.inference_mode():
result = model.separate(inputs)
Specify time ranges where the target sound occurs or doesn't occur. This provides a specific example to the model of what to isolate
import torch
import torchaudio
from sam_audio import SAMAudio, SAMAudioProcessor
# Load model and processor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SAMAudio.from_pretrained("facebook/sam-audio-large").to(device).eval()
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
# Define anchors: [type, start_time, end_time]
# "+" means the sound IS present in this time range
# "-" means the sound is NOT present in this time range
anchors = [
["+", 6.3, 7.0], # Sound occurs between 6.3 and 7.0 seconds
]
# Process with span prompting
inputs = processor(
audios=[audio_file],
descriptions=["A horn honking"],
anchors=[anchors],
).to(device)
with torch.inference_mode():
result = model.separate(inputs)
Example with multiple anchors:
anchors = [
["+", 2.0, 3.5], # Sound present from 2.0 to 3.5 seconds
["+", 8.0, 9.0], # Sound present from 8.0 to 9.0 seconds
["-", 0.0, 1.0], # Sound NOT present from 0.0 to 1.0 seconds
]
The model.separate() method returns a result object with:
result.target: The isolated sound (what you asked for)result.residual: Everything else (the remainder)Both are list[torch.Tensor] where each tensor is a 1D waveform
If you use SAM-Audio in your research, please cite:
@article{sam-audio, title={SAM-Audio: Segment Anything in Audio}, author={Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, Ann Lee}, year={2025} url={arxiv link coming soon} }
This project is licensed under the SAM License. See the LICENSE file for details.