Shipfeed. AI News Channel

items50 latest

▶ gpt·02:00

Kimi K3 vs GPT-5.6 Sol on DeepSWE: Cost, Coding, and Routing

We ran 904 DeepSWE rollouts on Kimi K3 and GPT-5.6 Sol. Sol leads pass@1; Kimi K3 wins pass@4 at 2.8x the solves per dollar, and routing between them reaches ~85.6%.

Together AI — Blog

▶ claude·02:00

Kimi K3 vs Claude Fable 5 on DeepSWE: Cost and Coding

We ran 452 DeepSWE rollouts on Kimi K3 and Claude Fable 5. Fable leads pass@1 by 1.4 points; Kimi K3 wins pass@4 and delivers 2.8x the solves per dollar.

Together AI — Blog

▶ ai·02:00

The production platform for open-weight AI inference

Run open models in production with full control over performance, cost, and quality. Deploy in minutes, roll out safely, and scale to your SLOs.

Together AI — Blog

▶ ai·02:00

Together AI and Y Combinator partner to launch the first dedicated GPU cluster for the YC community

No more two-year compute contracts. Together AI and YC just gave YC startups a faster way to get GPUs.

Together AI — Blog

▶ ai·02:00

What does 99.9% uptime mean for inference?

Reliability numbers are easy to publish. We break down what 99%, 99.9%, and 99.99% uptime actually require, the failure domains each tier has to survive, and the questions to ask any inference provider before you commit.

Together AI — Blog

▶ ai·02:00

New in Together GPU Clusters: Reliability and control for production GPU clusters

See how Together AI is improving production GPU clusters with passive health checks, node repair, stronger Slurm reliability, OIDC, and startup scripts.

Together AI — Blog

▶ ai·02:00

Together AI brings Thinking Machines Lab’s new model Inkling on day 0

Together AI — Blog

▶ ai·02:00

Open, convenient and predictable: Introducing Provisioned Throughput

Provisioned Throughput gives you reserved inference capacity for frontier open models like MiniMax M3 and GLM-5.2. Token-based pricing, a 99% uptime SLA, and up to 90% lower cost than proprietary APIs. No GPU-hour…

Together AI — Blog

▶ ai·02:00

Announcing our $800M Series C to accelerate the shift to open-source AI

We raised $800M to accelerate the shift to open-source AI. Here's why the economics of closed models don't scale, and what we're building next.

Together AI — Blog

▶ ai·02:00

Together AI at ICML 2026: frontier research across the full stack

Eight papers at ICML 2026 across the full stack. The research that becomes the Together platform. Find us at booth B714 in Seoul.

Together AI — Blog

▶ ai·02:00

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

ParallelKernelBench tests whether LLMs can write fast multi-GPU CUDA kernels across 87 real workloads. The best model solves under a third, but a few generated kernels beat any public implementation.

Together AI — Blog

▶ claude·02:00

Kimi K2.7 Code vs Claude Fable 5: Landing pages that cost 94% less

We generated 12 landing pages with Kimi K2.7 Code and Claude Fable 5. Kimi cost 94% less and scored within a few points on every page. Here's what actually moved the needle.

Together AI — Blog

▶ ai·02:00

Building trust in enterprise AI: Together AI earns ISO 27001:2022 certification

Together AI has earned ISO 27001:2022 certification, validating our commitment to enterprise-grade security for production AI workloads.

Together AI — Blog

▶ ai·02:00

Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets

How Together served MiniMax-M3 efficiently with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway.

Together AI — Blog

▶ ai·02:00

How Together AI built the world’s fastest speech-to-text stack

Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem.

Together AI — Blog

▶ ai·02:00

Benchmarking inference at scale: coding agents

Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.

Together AI — Blog

▶ ai·02:00

Together AI and Pearl Research Labs Team Up to Reduce the Cost of AI Inference

Together AI partners with Pearl Research Labs to launch a discounted Pearl-powered inference endpoint for Gemma-4-31B-it-pearl, using Proof of Useful Work to turn AI workloads into crypto emissions.

Together AI — Blog

▶ ai·02:00

Violin: An open-source video translation skill that breaks language barriers

Violin is an open-source AI video translation tool that combines speech recognition, LLM translation, and text-to-speech to make video content accessible across languages.

Together AI — Blog

▶ ai·02:00

Introducing voice finder — a new tool to quickly find the right voice for your app from over 600+ voices

Voice finder helps developers search, match, filter, and audition 600+ voices across Together AI TTS models using natural-language prompts or uploaded audio samples.

Together AI — Blog

▶ deepseek·02:00

Serving DeepSeek-V4: why million-token context is an inference systems problem

DeepSeek-V4 makes million-token context a serving-systems problem. Together AI explores the inference work behind V4 on NVIDIA HGX B200, including compressed KV layouts, prefix caching, kernel maturity, and endpoint…

Together AI — Blog

▶ ai·02:00

Deploy and inference any model from HuggingFace

Learn how to deploy any Hugging Face model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment…

Together AI — Blog

▶ ai·02:00

Foundational research powering efficient inference at scale

As AI moves from research to production, the challenge for AI-native teams shifts from building models to running them — efficiently, reliably, and at scale.

Together AI — Blog

▶ ai·02:00

Announcing Together AI and Adaption Partnership

Together AI and Adaption partner to bring Together Fine-Tuning natively into Adaptive Data, helping teams optimize datasets, run fine-tuning, evaluate results, and deploy stronger open models.

Together AI — Blog

▶ ai·02:00

From 732 bytes to nowhere: shutting down Copy Fail in production

Together AI — Blog

▶ deepseek·02:00

DeepSeek-V4 Pro now available on Together AI

DeepSeek-V4 Pro is now available on Together AI with 512K context, controllable reasoning modes, and cached-input pricing for long-context reasoning workloads like code agents, document intelligence, and research…

Together AI — Blog

▶ nemotron·02:00

Together AI Brings NVIDIA Nemotron 3 Nano Omni to Developers on Day 0

NVIDIA Nemotron 3 Nano Omni is now on Together AI: a single open model that reasons across video, images, audio, and text, built for agentic workloads at scale.

Together AI — Blog

▶ ai·02:00

Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.

Together AI — Blog

▶ ai·02:00

Capacity without conflict: A guide to multi-tenant GPU cluster design for AI-native teams

Learn how AI-native companies design multi-tenant GPU clusters that pool capacity without sacrificing team isolation — and how Together AI makes it work in practice.

Together AI — Blog

▶ ai·02:00

Parcae: Doing more with fewer parameters using stable looped models

Parcae is a stable looped language model that matches the quality of a Transformer twice its size — a 770M model reaching 1.3B-level performance. We introduce the first scaling laws for looping and show that increasing…

Together AI — Blog

▶ ai·02:00

EinsteinArena: Harnessing the collective intelligence of agents in the wild to advance science

EinsteinArena is a platform where AI agents collaborate and compete on open math problems. AI agents on EinsteinArena have already set 11 new state-of-the-art results on open math problems — including pushing the…

Together AI — Blog

▶ ai·02:00

What is an AI Native Cloud?

AI-native companies need infrastructure built for models, not legacy workloads. Learn what defines an AI Native Cloud and why it matters for the next platform shift.

Together AI — Blog

▶ ai·02:00

Wan 2.7 video model suite now available on Together AI

A four-model video suite for generation, continuation, reference-driven workflows, and editing, rolling out on Together AI starting with text-to-video.

Together AI — Blog

▶ ai·02:00

AI for Systems: Using LLMs to Optimize Database Query Execution

New research shows LLMs can optimize database query execution plans—achieving up to 4.78x speedups by correcting the cardinality estimation errors that statistical heuristics miss.

Together AI — Blog

▶ ai·02:00

Deepgram speech-to-text and voice models now available natively on Together AI

Production STT and TTS from Deepgram, available on Together AI Dedicated Model Inference for real-time voice agents.

Together AI — Blog

▶ ai·02:00

Inside the Together AI kernels team

The team behind FlashAttention and ThunderKittens — how Together AI's kernel researchers close the gap between GPU hardware and production AI.

Together AI — Blog

▶ ai·02:00

Aurora

1.25x over a well-trained static speculator. Aurora is an open-source RL framework that turns speculative decoding from a one-time offline setup into a self-improving system that learns from every request it serves.

Together AI — Blog

▶ ai·01:00

Plan, divide, and conquer: How weak models excel at long context tasks

As context windows grow, LLM performance degrades in unexpected ways. We show how a "Divide & Conquer" framework — breaking long documents into parallel chunks with a planner, workers, and manager — lets smaller models…

Together AI — Blog

▶ ai·01:00

Together AI expands fine-tuning service with tool calling, reasoning, and vision support

Together AI expands fine-tuning with native support for tool call, reasoning, and vision-language models, plus 100B+ model training, up to 6× higher throughput, and job cost and ETA estimates.

Together AI — Blog

▶ ai·01:00

Mamba-3

Meet Mamba-3: the SSM built for inference. Faster than Transformers at decode, stronger than Mamba-2, and open-source from day one.

Together AI — Blog

▶ ai·01:00

Together AI at NVIDIA GTC 2026: Explore our latest innovations across research and products

Together AI arrives at NVIDIA GTC 2026 with new launches in inference, agents, voice AI, and open models — plus technical sessions from its research and engineering leaders.

Together AI — Blog

▶ ai·01:00

Build real-time voice agents on Together AI

Build real-time voice agents on Together AI with co-located STT, LLM, and TTS infrastructure, native Deepgram and Cartesia support, and end-to-end latency under 500ms.

Together AI — Blog

▶ nemotron·01:00

Together AI Brings NVIDIA Nemotron 3 to Developers on Day 0

NVIDIA Nemotron 3 Super is now available on Together AI Dedicated Inference, delivering efficient multi-agent reasoning, a 1M-token context window, and production-grade deployment on managed infrastructure.

Together AI — Blog

▶ ai·01:00

New in Together GPU Clusters: Autoscaling, observability, and self-healing

Together GPU Clusters now include built-in autoscaling, RBAC, full-stack observability, and self-healing node repair—giving teams production-ready GPU infrastructure that scales efficiently, stays resilient, and…

Together AI — Blog

▶ ai·01:00

Key research and product announcements at the AI Native Conf

At AI Native Conf, Together AI announced breakthroughs across kernels, RL, and inference optimization — including FlashAttention-4, ThunderAgent, and together.compile. Research that ships to production. That's the AI…

Together AI — Blog

▶ ai·01:00

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

As GPU throughput outpaces memory bandwidth, kernels must evolve. We introduce FlashAttention-4, featuring new pipelining for maximum overlap, 2-CTA MMA modes to reduce shared memory traffic, and a hardware-software…

Together AI — Blog

▶ ai·01:00

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token…

Together AI — Blog

▶ ai·01:00

Introducing Together AI’s new look

We've refreshed our visual identity — designed with Pentagram to express how Together AI connects open-source innovation, systems research, and builders to unlock new possibilities.

Together AI — Blog

▶ ai·01:00

CoderForge-Preview: SOTA open dataset for training efficient coding agents

Together AI — Blog

▶ ai·01:00

How speech models fail where it matters the most and what to do about it

State-of-the-art speech models like Whisper and Deepgram score near-human on benchmarks — then fail 39% of the time on street names. New research from Together AI exposes the gap and a fix.

Together AI — Blog

▶ ai·01:00

Consistency diffusion language models: Up to 14x faster inference without sacrificing quality

Standard diffusion language models can't use KV caching and need too many refinement steps to be practical. CDLM fixes both with a post-training recipe that enables exact block-wise KV caching and trajectory-consistent…

Together AI — Blog