1 comments

  • skuznetsov37 14 hours ago

      Author here. pg_sorted_heap is a PostgreSQL table AM extension that does two things:
    
      1. Keeps data physically sorted by primary key with per-page zone maps. At 100M rows a point query reads 1 buffer page (0.045ms) vs 8 for btree (0.5ms) vs 520K for seq scan (1.2s).
    
      2. Built-in IVF-PQ vector search — no pgvector dependency.
    
      The vector search part is where it gets interesting. The physical clustering by PK prefix IS the inverted file index. You set partition_id (IVF cluster assignment) as the leading PK column, sorted_heap
      clusters rows by it, and the zone map skips irrelevant partitions at the I/O level. No separate index structure, no 800 MB HNSW graph.
    
      Numbers (103K × 2880-dim vectors, 1 Gi k8s pod):
    
        pgvector HNSW:  97% R@1, 14ms, 806 MB index, max 2,000 dims
        IVF-PQ:         97% R@1, 22ms,  27 MB index, max 32,000 dims
    
      Two vector types: svec (float32, 16K dims) and hsvec (float16, 32K dims). We tested float16 vs float32 on the same dataset — no measurable recall difference. PQ quantization is the bottleneck, not storage
      precision.
    
      The 2,000-dim limit matters: models like Nomic embed v2 output 2880 dims, and pgvector simply can't build HNSW or IVFFlat indexes for them.
    
      PostgreSQL 17 + 18, Apache 2.0. Full crash recovery, online compaction, TOAST, pg_upgrade all tested.
    
      Vector search docs: https://skuznetsov.github.io/pg_sorted_heap/vector-search
    
      Happy to answer questions about the architecture, benchmarks, or trade-offs vs pgvector.