Data Strategy | HIIE Whitepaper

Three-Tier Data Model

Tier	Location	Content	Persistence
Hot (Active)	256 GB SSD	Model weights, active project state, vector index shards, OS, services, active LoRA adapters	Persistent — managed rotation by project lifecycle
Warm (Archive)	5 TB HDD	Completed outputs, full ChromaDB, curated datasets, model checkpoints, versioned adapter archive	Persistent — compressed and indexed
Stream (Live)	RAM only	Real-time web content, live patents, GitHub, papers	Never written to disk — analyzed in RAM, raw content discarded

ChromaDB Collection Architecture

ChromaDB performs Approximate Nearest Neighbour (ANN) search natively via HNSW. HIIE partitions ChromaDB into three explicit, non-overlapping collections:

Active RAG Collection — hot vector shards for currently active projects. Index resides on SSD for sub-millisecond retrieval. Migrated to archive at project close.
Training Partition — labeled output pairs from completed projects, batch-consumed by the ANE background fine-tuning process. Partitioned separately to prevent live retrieval queries from contaminating the training data distribution.
Archive Collection — embeddings from completed projects retained for cross-project retrieval and long-term pattern analysis. Stored on HDD; queried only on explicit demand.

All embeddings carry four mandatory metadata fields: project_id (UUID), agent_role (Enum), fscore (Float [0–100]), and adapter_version (String).

HDD Directory Structure

HDD ROOT: /hiie-archive/

/projects/{id}_{client}_{date}/
    /outputs/       -- final deliverable package
    /specs/         -- spec sheets and BOMs
    /cad/           -- FreeCAD, KiCad, SVG files
    /patents/       -- patent drafts and prior art records
    /ethics/        -- immutable Ethics Officer reports
    /checkpoints/   -- serialized agent state snapshots
    metadata.json   -- project manifest and F_score summary

/adapters/{agent_role}/{version}_{date}/
    versioned LoRA adapter files per agent role
    active -> symlink to current deployed version

/chromadb/
    /training_partition/   -- labeled pairs for ANE fine-tuning
    /project_embeddings/   -- archive collection
    /archive/              -- compressed snapshots

Data Sources by Domain

Domain	Sources
Patents	USPTO, EPO Espacenet, Google Patents, WIPO PatentScope, JPO
Research Papers	arXiv, PubMed Central, IEEE Xplore, ChemRxiv, bioRxiv
Hardware & Standards	NIST, IPC, JEDEC, Mouser/Digikey datasheets, OSHWA
Code	GitHub public repos, crates.io, PyPI, npm documentation
AI Research	Hugging Face papers, arXiv cs.AI, Semantic Scholar
Materials	Matweb, ASM International, NIST Materials Data Repository
Manufacturing	JLCPCB/PCBWay capability specs, industrial supplier databases
Environmental	EPA databases, carbon accounting frameworks, lifecycle databases

Data Strategy — Storage, Streaming & Retrieval

Three-Tier Data Model

ChromaDB Collection Architecture

HDD Directory Structure

Data Sources by Domain