Data Strategy — Storage, Streaming & Retrieval
Three-Tier Data Model
| Tier | Location | Content | Persistence |
|---|---|---|---|
| Hot (Active) | 256 GB SSD | Model weights, active project state, vector index shards, OS, services, active LoRA adapters | Persistent — managed rotation by project lifecycle |
| Warm (Archive) | 5 TB HDD | Completed outputs, full ChromaDB, curated datasets, model checkpoints, versioned adapter archive | Persistent — compressed and indexed |
| Stream (Live) | RAM only | Real-time web content, live patents, GitHub, papers | Never written to disk — analyzed in RAM, raw content discarded |
ChromaDB Collection Architecture
ChromaDB performs Approximate Nearest Neighbour (ANN) search natively via HNSW. HIIE partitions ChromaDB into three explicit, non-overlapping collections:
- Active RAG Collection — hot vector shards for currently active projects. Index resides on SSD for sub-millisecond retrieval. Migrated to archive at project close.
- Training Partition — labeled output pairs from completed projects, batch-consumed by the ANE background fine-tuning process. Partitioned separately to prevent live retrieval queries from contaminating the training data distribution.
- Archive Collection — embeddings from completed projects retained for cross-project retrieval and long-term pattern analysis. Stored on HDD; queried only on explicit demand.
All embeddings carry four mandatory metadata fields: project_id (UUID), agent_role (Enum), fscore (Float [0–100]), and adapter_version (String).
HDD Directory Structure
HDD ROOT: /hiie-archive/
/projects/{id}_{client}_{date}/
/outputs/ -- final deliverable package
/specs/ -- spec sheets and BOMs
/cad/ -- FreeCAD, KiCad, SVG files
/patents/ -- patent drafts and prior art records
/ethics/ -- immutable Ethics Officer reports
/checkpoints/ -- serialized agent state snapshots
metadata.json -- project manifest and F_score summary
/adapters/{agent_role}/{version}_{date}/
versioned LoRA adapter files per agent role
active -> symlink to current deployed version
/chromadb/
/training_partition/ -- labeled pairs for ANE fine-tuning
/project_embeddings/ -- archive collection
/archive/ -- compressed snapshotsData Sources by Domain
| Domain | Sources |
|---|---|
| Patents | USPTO, EPO Espacenet, Google Patents, WIPO PatentScope, JPO |
| Research Papers | arXiv, PubMed Central, IEEE Xplore, ChemRxiv, bioRxiv |
| Hardware & Standards | NIST, IPC, JEDEC, Mouser/Digikey datasheets, OSHWA |
| Code | GitHub public repos, crates.io, PyPI, npm documentation |
| AI Research | Hugging Face papers, arXiv cs.AI, Semantic Scholar |
| Materials | Matweb, ASM International, NIST Materials Data Repository |
| Manufacturing | JLCPCB/PCBWay capability specs, industrial supplier databases |
| Environmental | EPA databases, carbon accounting frameworks, lifecycle databases |