Data Compression

Advanced LZMA compression for efficient storage and fast queries.

Overview

AVA uses LZMA-based compression to significantly reduce storage requirements while maintaining query performance. Data is compressed in blocks, allowing selective decompression for optimal query speed.

70-90%

Storage Reduction

<5%

Query Overhead

1MB

Block Size

How It Works

Block-Based Compression

Data is divided into blocks (default 1MB) and compressed independently. This allows:

  • Selective decompression - only needed blocks are decompressed
  • Parallel processing - multiple blocks can be processed concurrently
  • Efficient updates - only modified blocks need recompression
  • Fast queries - skip compressed blocks when filtering

Automatic Background Compression

AVA features a background compression thread that automatically compresses idle tables:

  • Background Thread: Runs continuously, checking for idle tables every 60 seconds
  • Idle Detection: Tables idle for more than 5 minutes (300 seconds) are automatically compressed
  • Access Tracking: Last access time is tracked for all tables to determine idleness
  • Smart Detection: Only data operations (SELECT, INSERT, UPDATE, DELETE) count as access
  • Metadata Exclusion: Metadata operations (.schema, .compression_tables) don't trigger access updates
  • Transparent Decompression: Compressed data is automatically decompressed when accessed

Monitoring & Management

Manual Compression

Use the .compress command to manually compress a table:

.compress table_name

Compression Status

View overall compression statistics:

.compression_status

Table Details with Last Access Time

View detailed per-table compression information including last access time:

.compression_tables

This command displays:

  • Table name and compression status
  • Uncompressed and compressed sizes
  • Compression ratio achieved
  • Last access timestamp (YYYY-MM-DD HH:MM:SS format)

Implementation Details

Background Thread Architecture

The automatic compression system uses a dedicated background thread in the PDBEngine:

  • Thread Lifecycle: Started automatically when PDBEngine initializes, stopped cleanly on shutdown
  • Wake Interval: Thread wakes every 60 seconds to check for idle tables
  • Idle Threshold: Tables must be idle for 300 seconds (5 minutes) before compression
  • Thread Safety: All operations are protected by mutex locks for concurrent access
  • Error Handling: Exceptions in background thread are caught and silently ignored to prevent crashes

Access Time Tracking Design

Each table maintains a last access timestamp that is updated intelligently:

  • Data Operations Update: SELECT, INSERT, UPDATE, DELETE operations update the timestamp
  • Metadata Operations Excluded: .schema, .compression_tables, .compression_status do not update timestamp
  • Timestamp Format: Stored as time_t (seconds since epoch), displayed in human-readable format
  • Implementation: Conditional parameter in getTable() method controls whether to update access time

Configuration

# Enable compression for a table
ALTER TABLE large_table SET COMPRESSION = 'LZMA';

# Set compression level (1-9, default 6)
ALTER TABLE large_table SET COMPRESSION_LEVEL = 9;

# Set block size (default 1MB)
ALTER TABLE large_table SET BLOCK_SIZE = 2097152;

# Disable compression
ALTER TABLE large_table SET COMPRESSION = 'NONE';

Compression Levels

  • Level 1-3: Fast compression, lower ratio (good for frequently updated data)
  • Level 4-6: Balanced (recommended for most use cases)
  • Level 7-9: Maximum compression, slower (archival data)

Best Practices

✓ Use compression for:

  • Large historical datasets
  • Archival data
  • Read-heavy workloads
  • Text and string-heavy data

✗ Avoid compression for:

  • Small tables (< 10MB)
  • Frequently updated data
  • Real-time streaming data
  • Already compressed data