Deep Dive into Rust: Building DedCore, a High-Performance Deduplication Tool

Jun 25, 2025 · 9 min · Rust , Performance , Systems Programming , CLI Tools

Deep Dive into Rust: Building DedCore, a High-Performance Deduplication Tool

Introduction

June 2025 was my “Rust deep dive” month. After months of contributing to Rust projects and reading about the language, I committed to building something substantial: DedCore, a smart file deduplication tool. This post chronicles the journey of building a production-ready CLI tool in Rust, covering everything from project structure to performance optimization.

Why DedCore?

The Problem

Modern storage is cheap, but that doesn’t mean we should waste it. I had:

500GB+ of duplicate photos across backup drives
Multiple downloads of the same files
Old project directories with copied dependencies
No easy way to identify what was actually duplicate

Existing tools were either:

Too slow (scanning takes hours)
Too simple (just compare file sizes)
Too dangerous (auto-delete without verification)
Not flexible (limited hash algorithms)

The Solution: DedCore

A high-performance CLI tool that:

Scans directories in parallel
Uses multiple hashing algorithms (blake3, xxhash, sha256)
Filters intelligently (by size, extension, age)
Provides a TUI for safe review and deletion
Never deletes without explicit confirmation

Rust Learning Curve

Week 1: Fighting the Borrow Checker

Every Rust developer’s rite of passage:

rust

// My first attempt - doesn't compile
fn scan_directory(path: &Path) -> Vec<File> {
    let mut files = Vec::new();
    
    for entry in fs::read_dir(path).unwrap() {
        let entry = entry.unwrap();
        let file = File::from(entry);
        
        // Error: cannot move out of `file` which is borrowed
        files.push(file);
        process_file(&file);  // ❌ file moved above
    }
    
    files
}
 
// After understanding ownership
fn scan_directory(path: &Path) -> Vec<File> {
    let mut files = Vec::new();
    
    for entry in fs::read_dir(path).unwrap() {
        let entry = entry.unwrap();
        let file = File::from(entry);
        
        process_file(&file);  // ✅ Borrow, don't move
        files.push(file);     // ✅ Now we can move
    }
    
    files
}

Lesson: The borrow checker is teaching you to write correct code. Embrace it.

Week 2: Error Handling the Rust Way

Coming from Python/JavaScript, Rust’s error handling was a revelation:

rust

use anyhow::{Context, Result};
use std::fs::File;
use std::io::Read;
 
// Before: Panic-driven development
fn read_file(path: &Path) -> String {
    let mut file = File::open(path).unwrap();  // ❌ Crashes on error
    let mut contents = String::new();
    file.read_to_string(&mut contents).unwrap();  // ❌ More crashes
    contents
}
 
// After: Proper error handling
fn read_file(path: &Path) -> Result<String> {
    let mut file = File::open(path)
        .context("Failed to open file")?;  // ✅ Propagate with context
    
    let mut contents = String::new();
    file.read_to_string(&mut contents)
        .context("Failed to read file contents")?;  // ✅ More context
    
    Ok(contents)
}
 
// Using it
match read_file(path) {
    Ok(contents) => println!("Read {} bytes", contents.len()),
    Err(e) => eprintln!("Error: {:#}", e),  // Prints full error chain
}

Lesson: Result<T, E> forces you to handle errors. The ? operator makes it ergonomic.

Week 3: Concurrency Without Fear

Rust’s ownership system makes concurrent programming safe:

rust

use rayon::prelude::*;
use std::sync::Arc;
 
// Parallel file scanning
fn scan_parallel(paths: Vec<PathBuf>) -> Vec<FileInfo> {
    paths
        .par_iter()  // Parallel iterator
        .filter_map(|path| {
            // Each thread gets its own work
            match analyze_file(path) {
                Ok(info) => Some(info),
                Err(e) => {
                    eprintln!("Error analyzing {:?}: {}", path, e);
                    None
                }
            }
        })
        .collect()
}
 
// Shared state with Arc (Atomic Reference Counting)
fn process_with_progress(files: Vec<PathBuf>) -> Result<Stats> {
    let progress = Arc::new(AtomicUsize::new(0));
    let total = files.len();
    
    let results: Vec<_> = files
        .par_iter()
        .map(|file| {
            let result = process_file(file);
            
            // Safe to share across threads
            let current = progress.fetch_add(1, Ordering::Relaxed);
            print_progress(current, total);
            
            result
        })
        .collect();
    
    Ok(combine_results(results))
}

Lesson: “Fearless concurrency” isn’t marketing—the compiler prevents data races.

Building DedCore: Architecture

Project Structure

dedcore/
├── Cargo.toml              # Dependencies and metadata
├── src/
│   ├── main.rs            # CLI entry point
│   ├── lib.rs             # Library exports
│   ├── scanner/
│   │   ├── mod.rs         # Module definition
│   │   ├── walker.rs      # Directory traversal
│   │   └── filter.rs      # File filtering
│   ├── hasher/
│   │   ├── mod.rs
│   │   ├── blake3.rs      # blake3 hashing
│   │   ├── xxhash.rs      # xxhash implementation
│   │   └── sha256.rs      # sha256 fallback
│   ├── detector/
│   │   ├── mod.rs
│   │   └── duplicates.rs  # Duplicate detection logic
│   ├── tui/
│   │   ├── mod.rs
│   │   ├── app.rs         # TUI application state
│   │   └── widgets.rs     # Custom widgets
│   └── config.rs          # Configuration management
├── tests/
│   ├── integration/       # Integration tests
│   └── fixtures/          # Test data
└── benches/
    └── performance.rs     # Benchmarks

Core Modules

1. Scanner: Walking the Directory Tree

rust

use walkdir::WalkDir;
use rayon::prelude::*;
 
pub struct Scanner {
    config: ScanConfig,
}
 
pub struct ScanConfig {
    pub min_size: u64,           // Ignore tiny files
    pub max_depth: Option<usize>, // Directory depth limit
    pub extensions: Option<Vec<String>>,
    pub exclude_patterns: Vec<String>,
}
 
impl Scanner {
    pub fn scan(&self, root: &Path) -> Result<Vec<FileEntry>> {
        // Parallel directory walk
        let entries: Vec<_> = WalkDir::new(root)
            .max_depth(self.config.max_depth.unwrap_or(usize::MAX))
            .into_iter()
            .par_bridge()  // Convert to parallel iterator
            .filter_map(|e| e.ok())
            .filter(|e| e.file_type().is_file())
            .filter(|e| self.should_include(e))
            .map(|e| FileEntry::from(e))
            .collect();
        
        Ok(entries)
    }
    
    fn should_include(&self, entry: &DirEntry) -> bool {
        // Size filter
        if let Ok(metadata) = entry.metadata() {
            if metadata.len() < self.config.min_size {
                return false;
            }
        }
        
        // Extension filter
        if let Some(ref exts) = self.config.extensions {
            if let Some(ext) = entry.path().extension() {
                if !exts.contains(&ext.to_string_lossy().to_string()) {
                    return false;
                }
            } else {
                return false;  // No extension
            }
        }
        
        // Exclude patterns
        let path = entry.path().to_string_lossy();
        for pattern in &self.config.exclude_patterns {
            if path.contains(pattern) {
                return false;
            }
        }
        
        true
    }
}

2. Hasher: Fast Content Hashing

rust

use blake3::Hasher as Blake3Hasher;
use std::fs::File;
use std::io::{BufReader, Read};
 
pub trait FileHasher {
    fn hash_file(&self, path: &Path) -> Result<Hash>;
}
 
pub struct Blake3HashImpl;
 
impl FileHasher for Blake3HashImpl {
    fn hash_file(&self, path: &Path) -> Result<Hash> {
        let file = File::open(path)?;
        let mut reader = BufReader::new(file);
        let mut hasher = Blake3Hasher::new();
        
        // Stream the file in chunks to avoid loading entire file into memory
        let mut buffer = [0u8; 8192];
        loop {
            let bytes_read = reader.read(&mut buffer)?;
            if bytes_read == 0 {
                break;
            }
            hasher.update(&buffer[..bytes_read]);
        }
        
        Ok(Hash::from(hasher.finalize().as_bytes()))
    }
}
 
// Parallel hashing with rayon
pub fn hash_files_parallel(
    files: Vec<PathBuf>,
    hasher: Arc<dyn FileHasher>,
) -> Vec<(PathBuf, Hash)> {
    files
        .par_iter()
        .filter_map(|path| {
            match hasher.hash_file(path) {
                Ok(hash) => Some((path.clone(), hash)),
                Err(e) => {
                    eprintln!("Failed to hash {:?}: {}", path, e);
                    None
                }
            }
        })
        .collect()
}

3. Detector: Finding Duplicates

rust

use std::collections::HashMap;
 
pub struct DuplicateDetector {
    files: Vec<FileInfo>,
}
 
#[derive(Debug)]
pub struct FileInfo {
    pub path: PathBuf,
    pub size: u64,
    pub hash: Hash,
}
 
#[derive(Debug)]
pub struct DuplicateGroup {
    pub hash: Hash,
    pub files: Vec<FileInfo>,
    pub total_size: u64,
    pub wasted_space: u64,  // (n-1) * file_size
}
 
impl DuplicateDetector {
    pub fn find_duplicates(&self) -> Vec<DuplicateGroup> {
        // Group by hash
        let mut groups: HashMap<Hash, Vec<FileInfo>> = HashMap::new();
        
        for file in &self.files {
            groups.entry(file.hash.clone())
                .or_insert_with(Vec::new)
                .push(file.clone());
        }
        
        // Keep only groups with duplicates
        groups
            .into_iter()
            .filter(|(_, files)| files.len() > 1)
            .map(|(hash, files)| {
                let size = files[0].size;
                let count = files.len() as u64;
                
                DuplicateGroup {
                    hash,
                    total_size: size * count,
                    wasted_space: size * (count - 1),
                    files,
                }
            })
            .collect()
    }
    
    pub fn total_wasted_space(&self) -> u64 {
        self.find_duplicates()
            .iter()
            .map(|g| g.wasted_space)
            .sum()
    }
}

4. TUI: Interactive Review

rust

use ratatui::{
    backend::CrosstermBackend,
    widgets::{Block, Borders, List, ListItem},
    Terminal,
};
use crossterm::{
    event::{self, Event, KeyCode},
    terminal::{disable_raw_mode, enable_raw_mode},
};
 
pub struct App {
    duplicate_groups: Vec<DuplicateGroup>,
    selected_group: usize,
    selected_file: usize,
    marked_for_deletion: HashSet<PathBuf>,
}
 
impl App {
    pub fn run(&mut self) -> Result<()> {
        enable_raw_mode()?;
        let mut terminal = Terminal::new(CrosstermBackend::new(io::stdout()))?;
        
        loop {
            terminal.draw(|f| self.draw(f))?;
            
            if let Event::Key(key) = event::read()? {
                match key.code {
                    KeyCode::Char('q') => break,
                    KeyCode::Up => self.select_previous(),
                    KeyCode::Down => self.select_next(),
                    KeyCode::Char(' ') => self.toggle_marked(),
                    KeyCode::Char('d') => self.delete_marked()?,
                    _ => {}
                }
            }
        }
        
        disable_raw_mode()?;
        Ok(())
    }
    
    fn draw(&self, frame: &mut Frame) {
        // Draw groups list
        let groups: Vec<ListItem> = self.duplicate_groups
            .iter()
            .enumerate()
            .map(|(i, group)| {
                let style = if i == self.selected_group {
                    Style::default().bg(Color::DarkGray)
                } else {
                    Style::default()
                };
                
                ListItem::new(format!(
                    "{} duplicates - {} wasted",
                    group.files.len(),
                    format_bytes(group.wasted_space)
                )).style(style)
            })
            .collect();
        
        // Render...
    }
}

Performance Optimizations

1. Smarter Size-First Grouping

rust

// Don't hash everything - only files with matching sizes
fn optimized_duplicate_detection(files: Vec<FileInfo>) -> Vec<DuplicateGroup> {
    // Group by size first (no I/O needed)
    let mut size_groups: HashMap<u64, Vec<FileInfo>> = HashMap::new();
    for file in files {
        size_groups.entry(file.size)
            .or_insert_with(Vec::new)
            .push(file);
    }
    
    // Only hash files that have size duplicates
    let to_hash: Vec<FileInfo> = size_groups
        .into_iter()
        .filter(|(_, files)| files.len() > 1)  // Skip unique sizes
        .flat_map(|(_, files)| files)
        .collect();
    
    // Now hash and group
    hash_and_group(to_hash)
}

Result: ~60% reduction in files hashed on typical filesystems.

2. Skipping Small Files

rust

// Files under 1KB are rarely worth the I/O cost
const MIN_SIZE: u64 = 1024;
 
// Configuration
pub struct ScanConfig {
    pub min_size: u64,  // User configurable
}

Result: ~30% faster scans, skipping thousands of tiny config files.

3. Parallel Everything

rust

use rayon::prelude::*;
 
// Parallel scanning, hashing, and analysis
let duplicates: Vec<DuplicateGroup> = directories
    .par_iter()  // Parallel scan
    .flat_map(|dir| scan_directory(dir))
    .collect::<Vec<_>>()
    .into_par_iter()  // Parallel hash
    .filter_map(|file| hash_file(file))
    .collect::<Vec<_>>()
    .into_iter()  // Sequential group (HashMap isn't thread-safe)
    .group_by_hash()
    .filter(|group| group.len() > 1)
    .collect();

Result: Near-linear speedup with CPU cores.

Testing Strategy

Unit Tests

rust

#[cfg(test)]
mod tests {
    use super::*;
    use tempfile::tempdir;
    
    #[test]
    fn test_duplicate_detection() {
        let dir = tempdir().unwrap();
        
        // Create test files
        std::fs::write(dir.path().join("file1.txt"), "content").unwrap();
        std::fs::write(dir.path().join("file2.txt"), "content").unwrap();
        std::fs::write(dir.path().join("file3.txt"), "different").unwrap();
        
        let scanner = Scanner::new(dir.path());
        let files = scanner.scan().unwrap();
        let detector = DuplicateDetector::new(files);
        
        let duplicates = detector.find_duplicates();
        assert_eq!(duplicates.len(), 1);
        assert_eq!(duplicates[0].files.len(), 2);
    }
}

Benchmarks

rust

use criterion::{black_box, criterion_group, criterion_main, Criterion};
 
fn bench_blake3_hashing(c: &mut Criterion) {
    let test_file = create_test_file(10 * 1024 * 1024);  // 10MB
    
    c.bench_function("blake3 10MB", |b| {
        b.iter(|| {
            let hasher = Blake3HashImpl;
            hasher.hash_file(black_box(&test_file))
        })
    });
}
 
criterion_group!(benches, bench_blake3_hashing);
criterion_main!(benches);

Results:

blake3: ~3GB/s
xxhash: ~8GB/s
sha256: ~500MB/s

Lessons Learned

1. Rust Makes You Think About Performance

The language design nudges you toward efficient patterns:

Zero-cost abstractions
Explicit memory layout
No hidden allocations

2. The Ecosystem is Excellent

Amazing crates:

rayon for parallelism
walkdir for filesystem traversal
ratatui for TUI
anyhow for error handling
clap for CLI parsing

3. Compilation Time Matters

Large Rust projects take time to compile. Strategies:

Use cargo check for fast feedback
Modularize to improve incremental builds
Use sccache for caching
Consider mold linker for faster linking

What’s Next

July plans for DedCore:

Advanced Filtering: Regex patterns, date ranges
Export Formats: JSON, CSV reports
Dry-Run Mode: Preview changes safely
Compression Analysis: Find candidates for compression
GUI Version: Desktop app with Tauri

Try It Yourself

DedCore is open source! Check it out at dedcore.live

bash

# Install
cargo install dedcore
 
# Scan directory
dedcore scan ~/Documents
 
# Interactive TUI
dedcore tui ~/Documents
 
# Generate report
dedcore report --format json ~/Documents > report.json

What’s your experience with Rust? What projects are you building? Let me know!

Edit this page on GitHub Last updated: 1/9/2026, 3:41:23 AM

Building in Public: Lessons from Contributing to Open Source How Safe is Accessing Discord through Chrome?

Deep Dive into Rust: Building DedCore, a High-Performance Deduplication Tool

#Deep Dive into Rust: Building DedCore, a High-Performance Deduplication Tool

#Introduction

#Why DedCore?

#The Problem

#The Solution: DedCore

#Rust Learning Curve

#Week 1: Fighting the Borrow Checker

#Week 2: Error Handling the Rust Way

#Week 3: Concurrency Without Fear

#Building DedCore: Architecture

#Project Structure

#Core Modules

#1. Scanner: Walking the Directory Tree

#2. Hasher: Fast Content Hashing

#3. Detector: Finding Duplicates

#4. TUI: Interactive Review

#Performance Optimizations

#1. Smarter Size-First Grouping

#2. Skipping Small Files

#3. Parallel Everything

#Testing Strategy

#Unit Tests

#Benchmarks

#Lessons Learned

#1. Rust Makes You Think About Performance

#2. The Ecosystem is Excellent

#3. Compilation Time Matters

#What’s Next

#Try It Yourself

Deep Dive into Rust: Building DedCore, a High-Performance Deduplication Tool

Introduction

Why DedCore?

The Problem

The Solution: DedCore

Rust Learning Curve

Week 1: Fighting the Borrow Checker

Week 2: Error Handling the Rust Way

Week 3: Concurrency Without Fear

Building DedCore: Architecture

Project Structure

Core Modules

1. Scanner: Walking the Directory Tree

2. Hasher: Fast Content Hashing

3. Detector: Finding Duplicates

4. TUI: Interactive Review

Performance Optimizations

1. Smarter Size-First Grouping

2. Skipping Small Files

3. Parallel Everything

Testing Strategy

Unit Tests

Benchmarks

Lessons Learned

1. Rust Makes You Think About Performance

2. The Ecosystem is Excellent

3. Compilation Time Matters

What’s Next

Try It Yourself