Rust文件格式解析库file-format的使用,支持多种文件类型的高效识别与数据提取
Rust文件格式解析库file-format的使用,支持多种文件类型的高效识别与数据提取
用于确定给定文件或流的文件格式的crate。
它提供了多种功能来识别各种文件格式,包括ZIP、复合文件二进制(CFB)、可扩展标记语言(XML)等。
它检查文件的签名以确定其格式,并在可用时智能地使用特定阅读器进行准确识别。如果签名未被识别,crate会回退到默认文件格式,即任意二进制数据(BIN)。
示例
从文件确定:
use file_format::{FileFormat, Kind};
let fmt = FileFormat::from_file("fixtures/document/sample.pdf")?;
assert_eq!(fmt, FileFormat::PortableDocumentFormat);
assert_eq!(fmt.name(), "Portable Document Format");
assert_eq!(fmt.short_name(), Some("PDF"));
assert_eq!(fmt.media_type(), "application/pdf");
assert_eq!(fmt.extension(), "pdf");
assert_eq!(fmt.kind(), Kind::Document);
从字节确定:
use file_format::{FileFormat, Kind};
let fmt = FileFormat::from_bytes(&[0xFF, 0xD8, 0xFF]);
assert_eq!(fmt, FileFormat::JointPhotographicExpertsGroup);
assert_eq!(fmt.name(), "Joint Photographic Experts Group");
assert_eq!(fmt.short_name(), Some("JPEG"));
assert_eq!(fmt.media_type(), "image/jpeg");
assert_eq!(fmt.extension(), "jpg");
assert_eq!(fmt.kind(), Kind::Image);
完整示例代码
// 添加依赖到Cargo.toml
// [dependencies]
// file-format = "0.28"
use file_format::{FileFormat, Kind};
use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
// 示例1:从文件识别格式
let file_path = "sample.pdf";
let fmt = FileFormat::from_file(file_path)?;
println!("文件: {}", file_path);
println!("格式名称: {}", fmt.name());
println!("短名称: {:?}", fmt.short_name());
println!("媒体类型: {}", fmt.media_type());
println!("扩展名: {}", fmt.extension());
println!("类型: {:?}", fmt.kind());
// 示例2:从字节识别格式
let jpeg_bytes = [0xFF, 0xD8, 0xFF, 0xE0, 0x00, 0x10, 0x4A, 0x46, 0x49, 0x46];
let fmt_bytes = FileFormat::from_bytes(&jpeg_bytes);
println!("\n从字节识别:");
println!("格式名称: {}", fmt_bytes.name());
println!("短名称: {:?}", fmt_bytes.short_name());
println!("媒体类型: {}", fmt_bytes.media_type());
println!("扩展名: {}", fmt_bytes.extension());
println!("类型: {:?}", fmt_bytes.kind());
// 示例3:检查特定文件类型
if fmt == FileFormat::PortableDocumentFormat {
println!("\n这是一个PDF文件!");
}
// 示例4:根据文件类型分类处理
match fmt.kind() {
Kind::Document => println!("\n这是一个文档文件"),
Kind::Image => println!("\n这是一个图像文件"),
Kind::Audio => println!("\n这是一个音频文件"),
Kind::Video => println!("\n这是一个视频文件"),
Kind::Archive => println!("\n这是一个归档文件"),
Kind::Executable => println!("\n这是一个可执行文件"),
_ => println!("\n这是其他类型的文件"),
}
Ok(())
}
使用
添加到您的Cargo.toml
:
[dependencies]
file-format = "0.28"
Crate功能
以下所有功能默认禁用。
阅读器功能
这些功能启用需要特定阅读器进行识别的文件格式检测。
reader
- 启用所有阅读器功能。reader-asf
- 启用基于高级系统格式(ASF)的文件格式检测。reader-cfb
- 启用基于复合文件二进制(CFB)的文件格式检测。reader-ebml
- 启用基于可扩展二进制元语言(EBML)的文件格式检测。reader-exe
- 启用基于MS-DOS可执行文件(EXE)的文件格式检测。reader-id3v2
- 启用基于ID3v2(ID3)的文件格式检测。reader-mp4
- 启用基于MPEG-4第14部分(MP4)的文件格式检测。reader-pdf
- 启用基于便携式文档格式(PDF)的文件格式检测。reader-rm
- 启用基于RealMedia(RM)的文件格式检测。reader-sqlite3
- 启用基于SQLite 3的文件格式检测。reader-txt
- 启用纯文本(TXT)文件格式检测。reader-xml
- 启用基于可扩展标记语言(XML)的文件格式检测。reader-zip
- 启用基于ZIP的文件格式检测。
支持的文件格式
归档
- 7-Zip (7Z)
- ACE
- ALZ
- Archived by Robert Jung (ARJ)
- Cabinet (CAB)
- Extensible Archive (XAR)
- LArc (LZS)
- LHA
- Mozilla Archive (MAR)
- Multi Layer Archive (MLA)
- PMarc (PMA)
- Roshal Archive (RAR)
- SeqBox (SBX)
- Squashfs
- StuffIt (SIT)
- StuffIt X (SITX)
- Tape Archive (TAR)
- UNIX archiver (archiver)
- Windows Imaging Format (WIM)
- ZIP
- ZPAQ
- cpio
- zoo
音频
- 8-Bit Sampled Voice (8SVX)
- Adaptive Multi-Rate (AMR)
- Advanced Audio Coding (AAC)
- Apple iTunes Audio (M4A)
- Apple iTunes Audiobook (M4B)
- Apple iTunes Protected Audio (M4P)
- Au
- Audio Codec 3 (AC-3)
- Audio Interchange File Format (AIFF)
- Audio Visual Research (AVR)
- Creative Voice (VOC)
- FastTracker 2 Extended Module (XM)
- Flash MP4 Audio (F4A)
- Flash MP4 Audiobook (F4B)
- Free Lossless Audio Codec (FLAC)
- Impulse Tracker Module (IT)
- MPEG-1/2 Audio Layer 2 (MP2)
- MPEG-1/2 Audio Layer 3 (MP3)
- MPEG-4 Part 14 Audio (MP4)
- Matroska Audio (MKA)
- Monkey’s Audio (APE)
- Musepack (MPC)
- Musical Instrument Digital Interface (MIDI)
- Ogg FLAC (OGA)
- Ogg Opus (Opus)
- Ogg Speex (Speex)
- Ogg Vorbis (Vorbis)
- Qualcomm PureVoice (QCP)
- Quite OK Audio (QOA)
- RealAudio (RA)
- Scream Tracker 3 Module (S3M)
- Sony DSD Stream File (DSF)
- SoundFont 2 (SF2)
- Ultimate Soundtracker Module (MOD)
- WavPack (WV)
- Waveform Audio (WAV)
- Windows Media Audio (WMA)
压缩
- BZip3 (BZ3)
- LZ4
- Lempel-Ziv Finite State Entropy (LZFSE)
- Lempel-Ziv-Markov chain algorithm (LZMA)
- Long Range ZIP (LRZIP)
- Snappy
- UNIX compress (compress)
- XZ
- Zstandard (zstd)
- bzip (BZ)
- bzip2 (BZ2)
- gzip (GZ)
- lzip (LZ)
- lzop (LZO)
- rzip (RZ)
数据库
- Microsoft Access 2007 Database (ACCDB)
- Microsoft Access Database (MDB)
- Microsoft Works Database (WDB)
- OpenDocument Database (ODB)
- SQLite 3
图表
- Circuit Diagram Document (CDDX)
- Microsoft Visio Drawing (VSD)
- Office Open XML Drawing (VSDX)
- StarChart (SDS)
- draw.io (DRAWIO)
磁盘
- Amiga Disk File (ADF)
- Apple Disk Image (DMG)
- ISO 9660 (ISO)
- Microsoft Virtual Hard Disk (VHD)
- Microsoft Virtual Hard Disk 2 (VHDX)
- QEMU Copy On Write (QCOW)
- Virtual Machine Disk (VMDK)
- VirtualBox Virtual Disk Image (VDI)
文档
- AbiWord (ABW)
- AbiWord Template (AWT)
- Adobe InDesign Document (INDD)
- DjVu
- InDesign Markup Language (IDML)
- LaTeX (TeX)
- Microsoft Publisher Document (PUB)
- Microsoft Word Document (DOC)
- Microsoft Works Word Processor (WPS)
- Microsoft Write (WRI)
- Office Open XML Document (DOCX)
- OpenDocument Text (ODT)
- OpenDocument Text Master (ODM)
- OpenDocument Text Master Template (OTM)
- OpenDocument Text Template (OTT)
- OpenXPS (OXPS)
- Portable Document Format (PDF)
- PostScript (PS)
- Rich Text Format (RTF)
- StarWriter (SDW)
- Sun XML Writer (SXW)
- Sun XML Writer Global (SGW)
- Sun XML Writer Template (STW)
- Uniform Office Format Text (UOT)
- WordPerfect Document (WPD)
电子书
- Broad Band eBook (BBeB)
- Electronic Publication (EPUB)
- FictionBook (FB2)
- FictionBook ZIP (FBZ)
- Microsoft Reader (LIT)
- Mobipocket (MOBI)
可执行文件
- Commodore 64 Program (PRG)
- Common Object File Format (COFF)
- Dalvik Executable (DEX)
- Dynamic Link Library (DLL)
- Executable and Linkable Format (ELF)
- Java Class
- LLVM Bitcode (BC)
- Linear Executable (LE)
- Lua Bytecode
- MS-DOS Executable (EXE)
- Mach-O
- New Executable (NE)
- Nintendo Switch Executable (NSO)
- Optimized Dalvik Executable (DEY)
- Portable Executable (PE)
- WebAssembly Binary (Wasm)
- Xbox 360 Executable (XEX)
- Xbox Executable (XBE)
字体
- BMFont ASCII (FNT)
- BMFont Binary (FNT)
- Embedded OpenType (EOT)
- FIGlet Font (FLF)
- Glyphs
- OpenType (OTF)
- TrueType (TTF)
- TrueType Collection (TTC)
- Web Open Font Format (WOFF)
- Web Open Font Format 2 (WOFF2)
公式
- Mathematical Markup Language (MathML)
- OpenDocument Formula (ODF)
- OpenDocument Formula Template (OTF)
- StarMath (SMF)
- Sun XML Math (SXM)
地理空间
- Flexible and Interoperable Data Transfer (FIT)
- GPS Exchange Format (GPX)
- Geography Markup Language (GML)
- Keyhole Markup Language (KML)
- Keyhole Markup Language ZIP (KMZ)
- Shapefile (SHP)
- Training Center XML (TCX)
图像
- AV1 Image File Format (AVIF)
- AV1 Image File Format Sequence (AVIFS)
- Adaptable Scalable Texture Compression (ASTC)
- Adobe Illustrator Artwork (AI)
- Adobe Photoshop Document (PSD)
- Animated Portable Network Graphics (APNG)
- Apple Icon Image (ICNS)
- Better Portable Graphics (BPG)
- Canon Raw (CRW)
- Canon Raw 2 (CR2)
- Canon Raw 3 (CR3)
- Cineon (CIN)
- Digital Picture Exchange (DPX)
- Encapsulated PostScript (EPS)
- Enhanced Metafile (EMF)
- Experimental Computing Facility (XCF)
- Figma Design (FIG)
- Free Lossless Image Format (FLIF)
- Fujifilm Raw (RAF)
- Graphics Interchange Format (GIF)
- High Efficiency Image Coding (HEIC)
- High Efficiency Image Coding Sequence (HEICS)
- High Efficiency Image File Format (HEIF)
- High Efficiency Image File Format Sequence (HEIFS)
- JPEG 2000 Codestream (J2C)
- JPEG 2000 Part 1 (JP2)
- JPEG 2000 Part 2 (JPX)
- JPEG 2000 Part 6 (JPM)
- JPEG Extended Range (JXR)
- JPEG Network Graphics (JNG)
- JPEG XL (JXL)
- JPEG-LS (JLS)
- Joint Photographic Experts Group (JPEG)
- Khronos Texture (KTX)
- Khronos Texture 2 (KTX2)
- Magick Image File Format (MIFF)
- Microsoft DirectDraw Surface (DDS)
- Multiple-image Network Graphics (MNG)
- Nikon Electronic File (NEF)
- Olympus Raw Format (ORF)
- OpenDocument Graphics (ODG)
- OpenDocument Graphics Template (OTG)
- OpenEXR (EXR)
- OpenRaster (ORA)
- Panasonic Raw (RW2)
- Picture Exchange (PCX)
- Portable Arbitrary Map (PAM)
- Portable BitMap (PBM)
- Portable FloatMap (PFM)
- Portable GrayMap (PGM)
- Portable Network Graphics (PNG)
- Portable PixMap (PPM)
- Quite OK Image (QOI)
- Radiance HDR (HDR)
- Scalable Vector Graphics (SVG)
- Silicon Graphics Image (SGI)
- Sketch
- Sketch 43
- StarDraw (SDA)
- Sun XML Draw (SXD)
- Sun XML Draw Template (STD)
- Tag Image File Format (TIFF)
- WebP
- Windows Animated Cursor (ANI)
- Windows Bitmap (BMP)
- Windows Cursor (CUR)
- Windows Icon (ICO)
- Windows Metafile (WMF)
- WordPerfect Graphics (WPG)
- X PixMap (XPM)
- farbfeld (FF)
元数据
- Android Binary XML (AXML)
- BitTorrent (Torrent)
- CD Audio (CDA)
- ID3v2 (ID3)
- Meta Information Encapsulation (MIE)
- TASTy
- Windows Shortcut (LNK)
- macOS Alias
模型
- 3D Manufacturing Format (3MF)
- 3D Studio (3DS)
- 3D Studio Max (MAX)
- Additive Manufacturing Format (AMF)
- AutoCAD Drawing (DWG)
- Autodesk 123D (123DX)
- Autodesk Alias (WIRE)
- Autodesk Inventor Assembly (IAM)
- Autodesk Inventor Drawing (IDW)
- Autodesk Inventor Part (IPT)
- Autodesk Inventor Presentation (IPN)
- Blender (BLEND)
- Cinema 4D (C4D)
- Collaborative Design Activity (COLLADA)
- Design Web Format (DWF)
- Design Web Format XPS (DWFX)
- Drawing Exchange Format ASCII (DXF)
- Drawing Exchange Format Binary (DXF)
- Extensible 3D (X3D)
- Filmbox (FBX)
- Fusion 360 (F3D)
- GL Transmission Format Binary (GLB)
- Google Draco (Draco)
- Initial Graphics Exchange Specification (IGES)
- Inter-Quake Export (IQE)
- Inter-Quake Model (IQM)
- MagicaVoxel (VOX)
- Maya ASCII (MA)
- Maya Binary (MB)
- Model 3D ASCII (A3D)
- Model 3D Binary (M3D)
- Polygon ASCII (PLY)
- Polygon Binary (PLY)
- SketchUp (SKP)
- SolidWorks Assembly (SLDASM)
- SolidWorks Drawing (SLDDRW)
- SolidWorks Part (SLDPRT)
- SpaceClaim Document (SCDOC)
- Standard for the Exchange of Product model data (STEP)
- Stereolithography ASCII (STL)
- Universal 3D (U3D)
- Universal Scene Description ASCII (USDA)
- Universal Scene Description Binary (USDC)
- Universal Scene Description ZIP (USDZ)
- Virtual Reality Modeling Language (VRML)
- openNURBS (3DM)
其他
- ActiveMime (MSO)
- Advanced Systems Format (ASF)
- Android Resource Storage Container (ARSC)
- Apache Arrow Columnar (Arrow)
- Apache Avro (Avro)
- Apache Parquet (Parquet)
- Arbitrary Binary Data (BIN)
- Atom
- Clojure Script
- Compound File Binary (CFB)
- DER Certificate (DER)
- Digital Imaging and Communications in Medicine (DICOM)
- Empty
- Extensible Binary Meta Language (EBML)
- Extensible Markup Language (XML)
- Extensible Stylesheet Language Transformations (XSLT)
- Flash CS5 Project (FLA)
- Flash Project (FLA)
- Flexible Image Transport System (FITS)
- HyperText Markup Language (HTML)
- ICC Profile (ICC)
- JSON Feed
- Java KeyStore (JKS)
- Lua Script
- MPEG-4 Part 14 (MP4)
- MS-DOS Batch (Batch)
- Microsoft Compiled HTML Help (CHM)
- Microsoft Project Plan (MPP)
- Microsoft Visual Studio Solution (SLN)
- MusicXML
- MusicXML ZIP (MXL)
- Ogg Multiplexed Media (OGX)
- PCAP Dump (PCAP)
- PCAP Next Generation Dump (PCAPNG)
- PEM Certificate (PEM)
- PEM Certificate Signing Request (PEM)
- PEM Private Key
1 回复
Rust文件格式解析库file-format的使用指南
概述
file-format是一个高效的Rust库,专门用于识别和解析多种文件格式。它支持超过300种常见文件类型,包括文档、图像、音频、视频、压缩文件等格式。该库通过分析文件签名和内部结构来实现快速准确的文件类型识别。
主要特性
- 支持300+文件格式识别
- 零依赖设计
- 高效的文件签名匹配算法
- 提供详细的格式信息提取
- 支持自定义格式扩展
安装方法
在Cargo.toml中添加依赖:
[dependencies]
file-format = "0.16"
基本使用方法
1. 文件类型识别
use file_format::FileFormat;
fn main() {
let data = std::fs::read("example.pdf").unwrap();
let format = FileFormat::from_bytes(&data);
println!("文件格式: {}", format.name());
println!("媒体类型: {}", format.media_type());
println!("扩展名: {}", format.extension());
}
2. 批量文件识别
use file_format::FileFormat;
use std::path::Path;
fn identify_files_in_directory(dir: &Path) {
for entry in std::fs::read_dir(dir).unwrap() {
let path = entry.unwrap().path();
if path.is_file() {
let data = std::fs::read(&path).unwrap();
let format = FileFormat::from_bytes(&data);
println!("{}: {}", path.display(), format.name());
}
}
}
3. 特定格式检查
use file_format::FileFormat;
fn is_pdf_file(data: &[u8]) -> bool {
let format = FileFormat::from_bytes(data);
format == FileFormat::PortableDocumentFormat
}
fn is_image_file(data: &[u8]) -> bool {
let format = FileFormat::from_bytes(data);
format.media_type().starts_with("image/")
}
4. 错误处理示例
use file_format::FileFormat;
fn safe_file_identification(path: &str) -> Result<String, Box<dyn std::error::Error>> {
let data = std::fs::read(path)?;
let format = FileFormat::from_bytes(&data);
if format == FileFormat::Unknown {
Err("无法识别的文件格式".into())
} else {
Ok(format.name().to_string())
}
}
高级用法
自定义格式识别
use file_format::{FileFormat, CustomFormat};
fn setup_custom_format() {
// 创建自定义文件格式
let custom_format = CustomFormat::new(
"My Custom Format",
"application/x-custom",
"cust",
&[0x89, 0x43, 0x55, 0x53, 0x54, 0x4F, 0x4D], // 文件签名
);
// 注册自定义格式
FileFormat::register_custom_format(custom_format);
}
性能优化示例
use file_format::FileFormat;
use std::io::Read;
fn efficient_identification(path: &str) -> FileFormat {
// 只读取文件开头部分进行识别
let mut file = std::fs::File::open(path).unwrap();
let mut buffer = [0; 512]; // 读取前512字节
file.read_exact(&mut buffer).unwrap();
FileFormat::from_bytes(&buffer)
}
实际应用场景
文件上传验证
use file_format::FileFormat;
fn validate_uploaded_file(data: &[u8]) -> Result<(), String> {
let format = FileFormat::from_bytes(data);
match format {
FileFormat::PortableDocumentFormat
| FileFormat::Jpeg
| FileFormat::Png => Ok(()),
_ => Err("不支持的文件格式".to_string()),
}
}
文件分类器
use file_format::FileFormat;
struct FileCategorizer;
impl FileCategorizer {
fn categorize(data: &[u8]) -> &'static str {
let format = FileFormat::from_bytes(data);
match format.media_type() {
s if s.starts_with("image/") => "图像文件",
s if s.starts_with("audio/") => "音频文件",
s if s.starts_with("video/") => "视频文件",
s if s.starts_with("text/") => "文本文件",
_ => "其他文件",
}
}
}
完整示例demo
use file_format::{FileFormat, CustomFormat};
use std::path::Path;
use std::io::Read;
fn main() {
// 示例1: 基本文件识别
println!("=== 基本文件识别示例 ===");
match std::fs::read("example.pdf") {
Ok(data) => {
let format = FileFormat::from_bytes(&data);
println!("文件格式: {}", format.name());
println!("媒体类型: {}", format.media_type());
println!("扩展名: {}", format.extension());
}
Err(e) => println!("读取文件失败: {}", e),
}
// 示例2: 批量文件识别
println!("\n=== 批量文件识别示例 ===");
let current_dir = Path::new(".");
identify_files_in_directory(¤t_dir);
// 示例3: 特定格式检查
println!("\n=== 特定格式检查示例 ===");
if let Ok(data) = std::fs::read("example.jpg") {
println!("是PDF文件: {}", is_pdf_file(&data));
println!("是图像文件: {}", is_image_file(&data));
}
// 示例4: 自定义格式
println!("\n=== 自定义格式示例 ===");
setup_custom_format();
// 示例5: 性能优化
println!("\n=== 性能优化示例 ===");
if let Ok(_) = std::fs::File::open("example.txt") {
let format = efficient_identification("example.txt");
println!("优化识别结果: {}", format.name());
}
// 示例6: 文件上传验证
println!("\n=== 文件上传验证示例 ===");
if let Ok(data) = std::fs::read("example.png") {
match validate_uploaded_file(&data) {
Ok(()) => println!("文件验证通过"),
Err(e) => println!("文件验证失败: {}", e),
}
}
// 示例7: 文件分类
println!("\n=== 文件分类示例 ===");
if let Ok(data) = std::fs::read("example.mp3") {
let category = FileCategorizer::categorize(&data);
println!("文件分类: {}", category);
}
}
// 批量文件识别函数
fn identify_files_in_directory(dir: &Path) {
if let Ok(entries) = std::fs::read_dir(dir) {
for entry in entries {
if let Ok(entry) = entry {
let path = entry.path();
if path.is_file() {
if let Ok(data) = std::fs::read(&path) {
let format = FileFormat::from_bytes(&data);
println!("{}: {}", path.display(), format.name());
}
}
}
}
}
}
// PDF文件检查函数
fn is_pdf_file(data: &[u8]) -> bool {
let format = FileFormat::from_bytes(data);
format == FileFormat::PortableDocumentFormat
}
// 图像文件检查函数
fn is_image_file(data: &[u8]) -> bool {
let format = FileFormat::from_bytes(data);
format.media_type().starts_with("image/")
}
// 自定义格式设置函数
fn setup_custom_format() {
let custom_format = CustomFormat::new(
"My Custom Format",
"application/x-custom",
"cust",
&[0x89, 0x43, 0x55, 0x53, 0x54, 0x4F, 0x4D],
);
FileFormat::register_custom_format(custom_format);
println!("自定义格式注册成功");
}
// 高效识别函数
fn efficient_identification(path: &str) -> FileFormat {
let mut file = match std::fs::File::open(path) {
Ok(file) => file,
Err(_) => return FileFormat::Unknown,
};
let mut buffer = [0; 512];
if let Ok(_) = file.read_exact(&mut buffer) {
FileFormat::from_bytes(&buffer)
} else {
FileFormat::Unknown
}
}
// 文件上传验证函数
fn validate_uploaded_file(data: &[u8]) -> Result<(), String> {
let format = FileFormat::from_bytes(data);
match format {
FileFormat::PortableDocumentFormat
| FileFormat::Jpeg
| FileFormat::Png => Ok(()),
_ => Err("不支持的文件格式".to_string()),
}
}
// 文件分类器结构体
struct FileCategorizer;
impl FileCategorizer {
fn categorize(data: &[u8]) -> &'static str {
let format = FileFormat::from_bytes(data);
match format.media_type() {
s if s.starts_with("image/") => "图像文件",
s if s.starts_with("audio/") => "音频文件",
s if s.starts_with("video/") => "视频文件",
s if s.starts_with("text/") => "文本文件",
_ => "其他文件",
}
}
}
注意事项
- 对于非常大的文件,建议使用
FileFormat::from_reader
方法避免内存问题 - 某些格式可能有重叠的签名,库会返回最可能的匹配
- 自定义格式的优先级高于内置格式
- 该库主要用于识别而非深度解析文件内容
这个库为文件处理应用提供了强大的格式识别能力,适合用于文件管理器、安全扫描、数据处理管道等场景。