DEC-0027: Online-Source Ingest Hardening and Retry Strategy
Decision
Treat the online-source audit as a discovery success but an ingest-conversion bottleneck.
The next online-source tranche should prioritize:
- hardening the ingest pipeline against the concrete blocker classes exposed by wave 1
- retrying the blocked wave 1 books before broadening to a larger wave 2
- only expecting new essays and deeper outputs from books that successfully
ingest and cross the activation threshold into
kg-relevant
Context
Perplexity's audit in PR-0146 found a substantial free-text opportunity set,
and the operationalization pass converted that into a real ingest backlog.
But the first execution tranche in PR-0150 showed that discovery coverage and
ingest readiness are not the same thing:
- 107 free texts were identified in the audit
- 99 books were normalized into ingest candidates
- 20 books were selected for wave 1
- only 4 of those 20 were successfully ingested
The main failures were not editorial. They were adapter and source-shape problems:
- Sacred Texts request blocking
- malformed or truncated normalized URLs
- Internet Archive viewer/detail pages mistaken for text sources
- partial chapter traversal on Rudolf Steiner Archive
- missing PDF extraction support
- transient network failures
Two of the four successful ingests became kg-relevant and correctly produced
full essay pipelines. The other two remained searchable-internal, which is why
they produced activation briefs but not essays.
Strategy
First priority: adapter hardening
The pipeline should learn the source shapes it already encountered rather than assuming more research alone will solve them.
The immediate hardening targets are:
- canonical Internet Archive text/download resolution
- Sacred Texts fetch behavior or deterministic fallback-source selection
- Rudolf Steiner Archive full-book traversal from index pages
- PDF extraction for public-domain sources where HTML is unavailable
- URL normalization and validation before a candidate reaches execution
Second priority: bounded retry
Retry the blocked wave 1 set after the pipeline is improved.
Do not jump straight to wave 2 until the system can convert a higher fraction of known-good candidates. Otherwise the backlog will expand faster than it can be made real.
Third priority: output expectations
Do not assume that every sourced text should immediately yield essays.
The expected output ladder is:
- sourced candidate
- successful corpus ingest
- activation level assignment
- downstream prompts only when the activation level justifies them
Essays should continue to emerge from kg-relevant or more active sources, not
from every public-domain text found on the web.
Consequences
Positive
- The repo stops confusing "found online" with "operationally ingested."
- Retry work becomes measurable and bounded instead of diffuse.
- Future essay expectations become structurally honest.
Tradeoffs
- Wave 2 breadth is deferred in favor of reliability.
- Some books may still require manual source replacement or research backfill.
- The ingest toolchain becomes more capable and therefore somewhat more complex.
Implementation Priority
- patch the source adapters and preflight validation
- retry the 16 blocked wave 1 books
- record a retry report with resolved vs. still-blocked causes
- only then generate the next ingest wave