Rust图形计算插件库cubecl-wgpu的使用,基于WGPU的高性能GPU计算与渲染加速
Rust图形计算插件库cubecl-wgpu的使用,基于WGPU的高性能GPU计算与渲染加速
CubeCL是一个多平台高性能计算语言扩展,用于Rust编程语言。它允许你使用Rust编写GPU程序,利用零成本抽象来开发可维护、灵活且高效的计算内核。
支持的平台
平台 | 运行时 | 编译器 | 硬件 |
---|---|---|---|
WebGPU | wgpu | WGSL | 大多数GPU |
CUDA | CUDA | C++ (CUDA) | NVIDIA GPU |
ROCm | HIP | C++ (HIP) | AMD GPU |
Metal | wgpu | C++ (Metal) | Apple GPU |
Vulkan | wgpu | SPIR-V | Linux和Windows上的大多数GPU |
示例
以下是一个使用cubecl-wgpu进行矩阵乘法的完整示例:
use cubecl::prelude::*;
use cubecl_wgpu::WgpuRuntime;
// 定义矩阵乘法GPU内核
#[cube(launch_unchecked)]
fn matrix_multiply<F: Float>(
a: &Array<Line<F>>,
b: &Array<Line<F>>,
output: &mut Array<Line<F>>,
m: u32,
n: u32,
p: u32
) {
let row = ABSOLUTE_POS / p;
let col = ABSOLUTE_POS % p;
if row < m && col < p {
let mut sum = F::zero();
for k in 0..n {
let a_idx = row * n + k;
let b_idx = k * p + col;
sum = sum + a[a_idx] * b[b_idx];
}
output[ABSOLUTE_POS] = sum;
}
}
fn main() {
// 初始化WGPU运行时
let instance = wgpu::Instance::new(wgpu::Backends::all());
let adapter = futures::executor::block_on(instance.request_adapter(&wgpu::RequestAdapterOptions {
power_preference: wgpu::PowerPreference::HighPerformance,
compatible_surface: None,
force_fallback_adapter: false,
})).unwrap();
let (device, queue) = futures::executor::block_on(adapter.request_device(
&wgpu::DeviceDescriptor {
features: wgpu::Features::empty(),
limits: wgpu::Limits::default(),
label: None,
},
None,
)).unwrap();
// 创建运行时客户端
let runtime = WgpuRuntime::new(&device, &queue);
let client = runtime.client(&device);
// 定义矩阵维度 (2x3) * (3x2) = (2x2)
let m = 2;
let n = 3;
let p = 2;
// 准备输入数据
let a = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]; // 2x3矩阵
let b = [7.0, 8.0, 9.0, 10.0, 11.0, 12.0]; // 3x2矩阵
let vectorization = 1; // 矩阵乘法通常不需要向量化
// 创建GPU缓冲区
let output_size = m * p;
let output_handle = client.empty(output_size * core::mem::size_of::<f32>());
let a_handle = client.create(f32::as_bytes(&a));
let b_handle = client.create(f32::as_bytes(&b));
// 启动GPU内核
unsafe {
matrix_multiply::launch_unchecked::<f32, WgpuRuntime>(
&client,
CubeCount::Static(1, 1, 1),
CubeDim::new(output_size as u32, 1, 1),
ArrayArg::from_raw_parts::<f32>(&a_handle, a.len(), vectorization as u8),
ArrayArg::from_raw_parts::<f32>(&b_handle, b.len(), vectorization as u8),
ArrayArg::from_raw_parts::<f32>(&output_handle, output_size, vectorization as u8),
m,
n,
p,
)
};
// 读取结果
let bytes = client.read_one(output_handle.binding());
let output = f32::from_bytes(&bytes);
println!("Matrix multiplication result: {:?}", output);
// 应该输出 [58.0, 64.0, 139.0, 154.0]
}
特殊功能
CubeCL提供了几个关键功能:
-
自动向量化:可以指定每个输入变量的向量化因子,运行时能够编译内核并使用最佳指令
-
编译时计算:可以在运行时使用Rust运行时属性计算常量,并在编译期间将其注入内核
-
自动调优:通过运行小型基准测试来确定在当前硬件上运行的最佳内核和配置
安装
将以下内容添加到你的Cargo.toml中:
[dependencies]
cubecl-wgpu = "0.6.0"
或者运行:
cargo add cubecl-wgpu
注意事项
CubeCL目前处于alpha阶段,虽然已在Burn中使用,但仍有许多需要改进的地方。
1 回复
cubecl-wgpu: 基于WGPU的高性能GPU计算与渲染加速库
完整示例demo
以下是一个完整的计算与渲染结合的示例,展示了如何使用cubecl-wgpu进行GPU计算并将结果可视化:
use cubecl_wgpu::{ComputeContext, ComputeContextDescriptor, ComputePipeline, ComputePipelineDescriptor};
use cubecl_wgpu::{RenderPipeline, RenderPipelineDescriptor};
use winit::{
event::{Event, WindowEvent},
event_loop::{ControlFlow, EventLoop},
window::WindowBuilder,
};
async fn run_example() {
// 创建窗口和事件循环
let event_loop = EventLoop::new();
let window = WindowBuilder::new()
.with_title("cubecl-wgpu 示例")
.build(&event_loop)
.unwrap();
// 初始化计算上下文
let instance = wgpu::Instance::new(wgpu::Backends::all());
let surface = unsafe { instance.create_surface(&window) };
let adapter = instance
.request_adapter(&wgpu::RequestAdapterOptions {
power_preference: wgpu::PowerPreference::default(),
compatible_surface: Some(&surface),
force_fallback_adapter: false,
})
.await
.unwrap();
let descriptor = ComputeContextDescriptor {
device_features: wgpu::Features::empty(),
device_limits: wgpu::Limits::default(),
};
let compute_context = ComputeContext::new(adapter, descriptor).await.unwrap();
// 配置交换链
let config = wgpu::SurfaceConfiguration {
usage: wgpu::TextureUsages::RENDER_ATTACHMENT,
format: surface.get_supported_formats(&compute_context.adapter)[0],
width: 800,
height: 600,
present_mode: wgpu::PresentMode::Fifo,
};
surface.configure(&compute_context.device, &config);
// 创建计算管道
let compute_shader = r#"
@group(0) @binding(0)
var<storage, read_write> data: array<f32>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
let idx = id.x;
data[idx] = sin(f32(idx) / 10.0) * 0.5 + 0.5;
}
"#;
let compute_pipeline = ComputePipeline::new(
&compute_context.device,
ComputePipelineDescriptor {
shader_source: compute_shader,
entry_point: "main",
label: Some("wave_compute_pipeline"),
},
);
// 创建渲染管道
let render_pipeline = RenderPipeline::new(
&compute_context.device,
RenderPipelineDescriptor {
shader_source: r#"
struct VertexOutput {
@builtin(position) position: vec4<f32>,
@location(0) color: vec4<f32>,
};
@group(0) @binding(0)
var<storage> data: array<f32>;
@vertex
fn vs_main(@builtin(vertex_index) vertex_index: u32) -> VertexOutput {
var output: VertexOutput;
let x = f32(vertex_index) / 100.0 - 0.5;
let y = data[vertex_index % 100] - 0.5;
output.position = vec4<f32>(x, y, 0.0, 1.0);
output.color = vec4<f32>(1.0, 0.0, 0.0, 1.0);
return output;
}
@fragment
fn fs_main(input: VertexOutput) -> @location(0) vec4<f32> {
return input.color;
}
"#,
vertex_entry: "vs_main",
fragment_entry: Some("fs_main"),
render_targets: vec![config.format],
label: Some("wave_render_pipeline"),
},
);
// 创建数据缓冲区
let buffer_size = (100 * std::mem::size_of::<f32>()) as wgpu::BufferAddress;
let data_buffer = compute_context.device.create_buffer(&wgpu::BufferDescriptor {
label: Some("data_buffer"),
size: buffer_size,
usage: wgpu::BufferUsages::STORAGE | wgpu::BufferUsages::COPY_DST | wgpu::BufferUsages::COPY_SRC,
mapped_at_creation: false,
});
// 创建绑定组
let bind_group_layout = compute_pipeline.get_bind_group_layout(0);
let bind_group = compute_context.device.create_bind_group(&wgpu::BindGroupDescriptor {
label: Some("compute_bind_group"),
layout: &bind_group_layout,
entries: &[wgpu::BindGroupEntry {
binding: 0,
resource: data_buffer.as_entire_binding(),
}],
});
// 渲染绑定组
let render_bind_group = compute_context.device.create_bind_group(&wgpu::BindGroupDescriptor {
label: Some("render_bind_group"),
layout: &render_pipeline.get_bind_group_layout(0),
entries: &[wgpu::BindGroupEntry {
binding: 0,
resource: data_buffer.as_entire_binding(),
}],
});
// 事件循环
event_loop.run(move |event, _, control_flow| {
*control_flow = ControlFlow::Poll;
match event {
Event::RedrawRequested(window_id) if window_id == window.id() => {
// 创建命令编码器
let mut encoder = compute_context.device.create_command_encoder(
&wgpu::CommandEncoderDescriptor {
label: Some("frame_encoder"),
},
);
// 执行计算
{
let mut compute_pass = encoder.begin_compute_pass(
&wgpu::ComputePassDescriptor {
label: Some("compute_pass"),
},
);
compute_pass.set_pipeline(&compute_pipeline);
compute_pass.set_bind_group(0, &bind_group, &[]);
compute_pass.dispatch_workgroups(2, 1, 1); // 计算100个数据点
}
// 获取当前帧
let frame = surface.get_current_texture().unwrap();
let view = frame.texture.create_view(&wgpu::TextureViewDescriptor::default());
// 执行渲染
{
let mut render_pass = encoder.begin_render_pass(
&wgpu::RenderPassDescriptor {
label: Some("render_pass"),
color_attachments: &[Some(wgpu::RenderPassColorAttachment {
view: &view,
resolve_target: None,
ops: wgpu::Operations {
load: wgpu::LoadOp::Clear(wgpu::Color::BLACK),
store: true,
},
})],
depth_stencil_attachment: None,
},
);
render_pass.set_pipeline(&render_pipeline);
render_pass.set_bind_group(0, &render_bind_group, &[]);
render_pass.draw(0..100, 0..1); // 绘制100个点
}
// 提交命令
compute_context.queue.submit(Some(encoder.finish()));
frame.present();
}
Event::WindowEvent {
event: WindowEvent::CloseRequested,
window_id,
} if window_id == window.id() => {
*control_flow = ControlFlow::Exit;
}
Event::MainEventsCleared => {
window.request_redraw();
}
_ => {}
}
});
}
fn main() {
futures::executor::block_on(run_example());
}
示例说明
- 初始化:创建窗口和WGPU上下文,配置交换链
- 计算管道:创建一个计算正弦波的GPU计算管道
- 渲染管道:创建一个读取计算数据并绘制曲线的渲染管道
- 数据缓冲区:创建共享的数据缓冲区,用于计算和渲染之间传递数据
- 绑定组:为计算和渲染分别创建绑定组,共享同一数据缓冲区
- 主循环:每帧执行计算和渲染,将计算结果可视化
这个示例展示了如何:
- 使用cubecl-wgpu创建计算和渲染管道
- 在计算和渲染之间共享数据
- 将GPU计算结果实时可视化
- 构建一个完整的GPU计算+渲染应用