Data Strategy — Storage, Streaming & Retrieval

Three-Tier Data Model

TierLocationContentPersistence
Hot (Active)256 GB SSDModel weights, active project state, vector index shards, OS, services, active LoRA adaptersPersistent — managed rotation by project lifecycle
Warm (Archive)5 TB HDDCompleted outputs, full ChromaDB, curated datasets, model checkpoints, versioned adapter archivePersistent — compressed and indexed
Stream (Live)RAM onlyReal-time web content, live patents, GitHub, papersNever written to disk — analyzed in RAM, raw content discarded

ChromaDB Collection Architecture

ChromaDB performs Approximate Nearest Neighbour (ANN) search natively via HNSW. HIIE partitions ChromaDB into three explicit, non-overlapping collections:

  • Active RAG Collection — hot vector shards for currently active projects. Index resides on SSD for sub-millisecond retrieval. Migrated to archive at project close.
  • Training Partition — labeled output pairs from completed projects, batch-consumed by the ANE background fine-tuning process. Partitioned separately to prevent live retrieval queries from contaminating the training data distribution.
  • Archive Collection — embeddings from completed projects retained for cross-project retrieval and long-term pattern analysis. Stored on HDD; queried only on explicit demand.

All embeddings carry four mandatory metadata fields: project_id (UUID), agent_role (Enum), fscore (Float [0–100]), and adapter_version (String).

HDD Directory Structure

HDD ROOT: /hiie-archive/

/projects/{id}_{client}_{date}/
    /outputs/       -- final deliverable package
    /specs/         -- spec sheets and BOMs
    /cad/           -- FreeCAD, KiCad, SVG files
    /patents/       -- patent drafts and prior art records
    /ethics/        -- immutable Ethics Officer reports
    /checkpoints/   -- serialized agent state snapshots
    metadata.json   -- project manifest and F_score summary

/adapters/{agent_role}/{version}_{date}/
    versioned LoRA adapter files per agent role
    active -> symlink to current deployed version

/chromadb/
    /training_partition/   -- labeled pairs for ANE fine-tuning
    /project_embeddings/   -- archive collection
    /archive/              -- compressed snapshots

Data Sources by Domain

DomainSources
PatentsUSPTO, EPO Espacenet, Google Patents, WIPO PatentScope, JPO
Research PapersarXiv, PubMed Central, IEEE Xplore, ChemRxiv, bioRxiv
Hardware & StandardsNIST, IPC, JEDEC, Mouser/Digikey datasheets, OSHWA
CodeGitHub public repos, crates.io, PyPI, npm documentation
AI ResearchHugging Face papers, arXiv cs.AI, Semantic Scholar
MaterialsMatweb, ASM International, NIST Materials Data Repository
ManufacturingJLCPCB/PCBWay capability specs, industrial supplier databases
EnvironmentalEPA databases, carbon accounting frameworks, lifecycle databases