Rust HDF5数据处理库hdf5的使用，支持高性能科学数据存储与分析的HDF5格式操作

简介

hdf5 crate（之前称为hdf5-rs）为HDF5库API提供了线程安全的Rust绑定和高层封装。主要特性包括：

通过可重入互斥锁保证线程安全
原生支持大多数HDF5类型，包括可变长度字符串和数组
通过派生宏自动将用户结构体和枚举映射到HDF5类型
通过ndarray支持多维数组读写接口

底层直接绑定也通过hdf5-sys crate提供。

需要HDF5库版本1.8.4或更高。

示例代码

#[cfg(feature = "blosc")]
use hdf5::filters::blosc_set_nthreads;
use hdf5::{File, H5Type, Result};
use ndarray::{arr2, s};

#[derive(H5Type, Clone, PartialEq, Debug)] // 注册到HDF5
#[repr(u8)]
pub enum Color {
    R = 1,
    G = 2,
    B = 3,
}

#[derive(H5Type, Clone, PartialEq, Debug)] // 注册到HDF5
#[repr(C)]
pub struct Pixel {
    xy: (i64, i64),
    color: Color,
}

impl Pixel {
    pub fn new(x: i64, y: i64, color: Color) -> Self {
        Self { xy: (x, y), color }
    }
}

fn write_hdf5() -> Result<()> {
    use Color::*;
    let file = File::create("pixels.h5")?; // 打开文件用于写入
    let group = file.create_group("dir")?; // 创建组
    #[cfg(feature = "blosc")]
    blosc_set_nthreads(2); // 设置blosc线程数
    let builder = group.new_dataset_builder();
    #[cfg(feature = "blosc")]
    let builder = builder.blosc_zstd(9, true); // zstd + shuffle
    let ds = builder
        .with_data(&arr2(&[
            // 写入2维数组数据
            [Pixel::new(1, 2, R), Pixel::new(2, 3, B)],
            [Pixel::new(3, 4, G), Pixel::new(4, 5, R)],
            [Pixel::new(5, 6, B), Pixel::new(6, 7, G)],
        ]))
        // 完成并写入数据集
        .create("pixels")?;
    // 创建固定形状的属性但不写入数据
    let attr = ds.new_attr::<Color>().shape([3].create("colors")?;
    // 写入属性数据
    attr.write(&[R, G, B])?;
    Ok(())
}

fn read_hdf5() -> Result<()> {
    use Color::*;
    let file = File::open("pixels.h5")?; // 打开文件用于读取
    let ds = file.dataset("dir/pixels")?; // 打开数据集
    assert_eq!(
        // 读取2维数据集的切片并验证
        ds.read_slice::<Pixel, _, _>(s![1.., ..])?,
        arr2(&[
            [Pixel::new(3, 4, G), Pixel::new(4, 5, R)],
            [Pixel::new(5, 6, B), Pixel::new(6, 7, G)],
        ])
    );
    let attr = ds.attr("colors")?; // 打开属性
    assert_eq!(attr.read_1d::<Color>()?.as_slice().unwrap(), &[R, G, B]);
    Ok(())
}

fn main() -> Result<()> {
    write_hdf5()?;
    read_hdf5()?;
    Ok(())
}

完整示例

以下是一个更完整的HDF5数据读写示例，展示了如何创建文件、组、数据集和属性：

use hdf5::{File, Result};
use ndarray::Array;

fn main() -> Result<()> {
    // 创建HDF5文件
    let file = File::create("example.h5")?;
    
    // 创建组
    let group = file.create_group("my_group")?;
    
    // 创建2维数据集
    let data = Array::from_shape_vec((3, 3), vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0])?;
    let dataset = group.new_dataset::<f64>().shape([3, 3]).create("my_data")?;
    dataset.write(&data)?;
    
    // 创建并写入属性
    let attr = dataset.new_attr::<i32>().shape([1]).create("my_attr")?;
    attr.write(&[42])?;
    
    // 读取数据
    let read_data = dataset.read_2d::<f64>()?;
    println!("Read data: {:?}", read_data);
    
    // 读取属性
    let read_attr = dataset.attr("my_attr")?.read_1d::<i32>()?;
    println!("Read attribute: {:?}", read_attr.as_slice().unwrap());
    
    Ok(())
}

兼容性

平台

hdf5 crate已知可在以下平台运行：Linux、macOS、Windows（测试环境包括Ubuntu 16.04、18.04和20.04；Windows Server 2019的MSVC和GNU工具链；macOS Catalina）。

Rust

hdf5 crate持续测试所有三个官方发布通道，需要较新的Rust编译器（如1.51或更高版本）。

HDF5

需要HDF5版本1.8.4或更高。即使库未启用线程安全选项，用户代码也能保证线程安全。

vueper 1楼

以下是根据您提供的内容整理的Rust HDF5库使用指南，包含完整示例代码：

Rust HDF5数据处理库hdf5的使用指南

HDF5是一种高性能的科学数据存储格式，Rust的hdf5库提供了对HDF5格式的完整支持，适用于需要处理大型科学数据集的场景。

安装

在Cargo.toml中添加依赖：

[dependencies]
hdf5 = "0.8"
ndarray = "0.15"  # 用于数组操作

完整示例代码

use hdf5::{File, Result};
use ndarray::{arr2, ArrayD};

fn main() -> Result<()> {
    // 示例1: 创建HDF5文件并写入数据
    create_and_write()?;
    
    // 示例2: 读取HDF5文件
    read_file()?;
    
    // 示例3: 处理大型数据集
    large_dataset()?;
    
    // 示例4: 创建组和复杂数据结构
    groups_and_compound()?;
    
    Ok(())
}

// 示例1: 创建HDF5文件并写入数据
fn create_and_write() -> Result<()> {
    // 创建新文件
    let file = File::create("data.h5")?;
    
    // 创建数据集
    let dataset = file
        .new_dataset::<f64>()
        .shape([3, 3])
        .create("my_matrix")?;
    
    // 使用ndarray创建数据并写入
    let data = arr2(&[
        [1.0, 2.0, 3.0], 
        [4.0, 5.0, 6.0], 
        [7.0, 8.0, 9.0]
    ]);
    dataset.write(&data)?;
    
    // 添加属性
    dataset
        .new_attr::<&str>()
        .create("description")?
        .write("A simple 3x3 matrix")?;
    
    Ok(())
}

// 示例2: 读取HDF5文件
fn read_file() -> Result<()> {
    // 打开文件
    let file = File::open("data.h5")?;
    
    // 读取数据集
    let dataset = file.dataset("my_matrix")?;
    let data: ArrayD<f64> = dataset.read()?;
    
    println!("Data shape: {:?}", data.shape());
    println!("Data:\n{:?}", data);
    
    // 读取属性
    if let Ok(attr) = dataset.attr("description") {
        let desc: String = attr.read_scalar()?;
        println!("Description: {}", desc);
    }
    
    Ok(())
}

// 示例3: 处理大型数据集（分块存储）
fn large_dataset() -> Result<()> {
    let file = File::create("large.h5")?;
    
    // 创建分块数据集（1000x1000矩阵，分块大小为100x100）
    file.new_dataset::<f64>()
        .shape([1000, 1000])
        .chunk([100, 100])
        .create("large_matrix")?;
    
    Ok(())
}

// 示例4: 创建组和复杂数据结构
fn groups_and_compound() -> Result<()> {
    let file = File::create("compound.h5")?;
    
    // 创建组
    let group = file.create_group("experiment")?;
    
    // 创建复合数据类型
    let dtype = hdf5::CompoundType::new(&[
        ("timestamp", hdf5::Datatype::from_type::<f64>()),
        ("value", hdf5::Datatype::from_type::<f32>()),
        ("valid", hdf5::Datatype::from_type::<bool>()),
    ])?;
    
    // 创建复合数据集
    group.new_dataset_with_dtype(&dtype)
        .shape([10])  // 10条记录
        .create("measurements")?;
    
    Ok(())
}

高级功能示例

并行访问（需要启用mpio特性）

// 在Cargo.toml中添加:
// hdf5 = { version = "0.8", features = ["mpio"] }

#[cfg(feature = "mpio")]
use hdf5::{File, Result, mpi};

#[cfg(feature = "mpio")]
fn parallel_example() -> Result<()> {
    let comm = mpi::Communicator::world()?;
    let file = File::open_parallel("parallel.h5", "rw", comm)?;
    
    // 这里可以添加并行读写操作
    
    Ok(())
}

压缩数据集

fn compressed_dataset() -> Result<()> {
    let file = File::create("compressed.h5")?;
    
    file.new_dataset::<f64>()
        .shape([100, 100])
        .chunk([10, 10])
        .deflate(6)  // 压缩级别(0-9)
        .create("compressed_data")?;
    
    Ok(())
}

关键注意事项

类型匹配非常重要，使用hdf5::Datatype::from_type::<T>()确保Rust类型与HDF5类型兼容
错误处理应使用?操作符传播错误，因为大多数操作都返回Result
多线程环境下需要启用threadsafe特性：

hdf5 = { version = "0.8", features = ["threadsafe"] }

性能优化建议：

对大型数据集使用分块存储
考虑启用压缩
批量写入比多次小写入更高效