OSINT Platform: 300+ Source Architecture

← Back to Blog

When I started building Nuculair, the OSINT platform I discuss on my portfolio page, the initial scope was modest: aggregate a few dozen public data sources into a searchable interface for investigators. Eighteen months later, the platform integrates 312 data sources, resolves entities across them using a Neo4j graph database with 47 million relationship edges, and processes approximately 2.3 million data points per day. Every single architectural decision from that first month was wrong and had to be rebuilt. Here is what I learned.

The Problem with Modern OSINT

Open source intelligence gathering has fundamentally changed in the past five years. The volume of publicly available data has exploded, but so has the fragmentation. Information about a single entity -- a person, organization, or domain -- might be scattered across social media platforms, corporate registries, DNS records, certificate transparency logs, court filings, breach databases, dark web forums, cryptocurrency ledgers, and hundreds of other sources. Each source has its own data format, access method, rate limits, and reliability characteristics.

The investigator's challenge is not finding data. It is connecting data. A phone number from a WHOIS record that matches a number in a social media profile that links to an email address that appears in a data breach that is associated with a cryptocurrency wallet. This chain of connections is invisible if you query each source independently. The relationships only emerge when you bring everything together in a unified data model.

Existing tools at the time fell into two categories. Commercial platforms like Maltego and SpiderFoot offered broad source integration but with rigid data models that lost nuance during normalization. They treated everything as a flat entity-relationship graph with minimal metadata on the relationships themselves. Open source tools were typically single-purpose: Sherlock for username enumeration, theHarvester for email discovery, Shodan for infrastructure scanning. Stitching 15 separate tools together with bash scripts was the standard "platform" for most investigators.

I wanted something different: a platform where data sources are treated as first-class citizens with their own reliability scores, freshness timestamps, and provenance chains, where relationships carry as much information as the entities they connect, and where the entire system could run on an investigator's own hardware without sending query patterns to any external service.

Architecture Decision 1: The Data Normalization Layer

The first and most consequential decision was how to normalize data from 300+ sources into a common model. My initial approach was a rigid schema: define entity types (Person, Organization, Domain, IP, Email, Phone, etc.) with fixed attribute sets, and force all incoming data to conform.

This failed spectacularly within the first month. Source A provides a person's name as a single string. Source B splits it into first/middle/last. Source C uses a name object with given names as an array (to handle cultures with multiple given names). Source D provides it in native script alongside a romanized version. A rigid schema means losing data at ingestion time, and in OSINT, the data you throw away is often the data that solves the case.

The solution was a flexible normalization layer built on three principles:

Preserve original data alongside normalized form. Every data point stores both the raw value exactly as received from the source and a normalized representation. An email address from source A might arrive as "JOHN.DOE@GMAIL.COM " (with trailing spaces and uppercase). The normalized form is john.doe@gmail.com, but the raw form is preserved because sometimes the capitalization pattern or the trailing spaces are the investigative signal.

Use confidence scores, not binary truth. When two sources provide conflicting information about the same entity, the system does not pick a winner. Both values are stored with confidence scores derived from source reliability, data freshness, and corroboration count. A phone number confirmed by 4 independent sources with an average freshness of 3 months has a higher confidence than one appearing in a single source that was last updated 2 years ago. The investigator sees all values and their confidence rankings.

Treat schema as emergent, not prescribed. Instead of defining entity types upfront, the normalization layer uses a property graph model where any entity can have any property. Common properties (name, email, phone) have standardized normalization functions, but uncommon properties pass through without modification. When a cryptocurrency exchange API returns a kyc_verification_level field that no other source uses, the system does not discard it. It stores it as a property on the entity with full provenance metadata.

This flexible approach increased storage requirements by approximately 3.2x compared to the rigid schema design, but it eliminated data loss at ingestion and preserved investigative signals that a rigid model would destroy.

Architecture Decision 2: Neo4j as the Primary Database

Choosing Neo4j as the primary database rather than a relational database was the most debated decision during the early architecture phase. PostgreSQL with recursive CTEs can do graph traversal. Graph databases are harder to administer, have a smaller talent pool, and introduce operational complexity. The counterargument that won: in OSINT, the relationships ARE the product.

An OSINT investigation is fundamentally a graph traversal problem. The investigator starts with a known entity (an email address, a phone number, a company name) and traverses relationships to discover connected entities. The query is rarely "find me this specific person." It is "show me everything connected to this email address within 3 hops that has a confidence score above 0.6 and was last updated within 12 months."

In Neo4j, that query is a native Cypher traversal that returns in 12-80ms on our production dataset of 47 million relationship edges. The equivalent PostgreSQL query with recursive CTEs and self-joins took 3-8 seconds on the same data, and that is after aggressive indexing optimization. When an investigator is doing iterative exploration, clicking on entities and expanding connections in real-time, the difference between 40ms and 4 seconds is the difference between a usable tool and a frustrating one.

The graph model also naturally handles a problem that is extremely awkward in relational databases: many-to-many relationships with metadata. A person entity might be connected to a company entity through multiple relationships: employee (with start and end dates), shareholder (with percentage), director (with appointment date), and defendant in litigation (with case number). Each relationship carries its own properties, confidence scores, and source provenance. In Neo4j, these are first-class relationship objects. In PostgreSQL, they require junction tables with variable schemas or JSON columns, and every query becomes a multi-join operation.

The operational complexity is real, though. Neo4j clustering is less mature than PostgreSQL. Backup and restore procedures require more careful planning. And the Cypher query language, while elegant for graph patterns, is less familiar to most developers. We mitigate this by keeping PostgreSQL in the stack for non-graph data: user accounts, audit logs, API key management, and session state. The dual-database architecture adds a synchronization concern but gives us the best tool for each job.

Architecture Decision 3: Source Abstraction and Rate Management

Integrating 312 data sources is not primarily a technical challenge. It is an operational one. Each source has its own API format, authentication method, rate limits, error codes, pagination scheme, and behavioral quirks. Some sources throttle based on request frequency. Others throttle based on unique query patterns. A few will silently return partial results when overloaded rather than returning an error. At least one source I will not name returns HTTP 200 with a valid-looking JSON body that contains completely fabricated data when you exceed their unpublished rate limit.

The source abstraction layer wraps every data source in a standardized interface with six methods: search(), enrich(), validate(), monitor(), health(), and metadata(). Each wrapper handles the translation between the platform's internal query format and the source's specific API. Rate limiting is managed per-source with token bucket algorithms that respect both published and empirically determined limits.

The rate management system maintains a profile for each source that includes: maximum requests per minute, maximum requests per hour, maximum unique queries per day, minimum inter-request delay, retry behavior (exponential backoff parameters), and a trust score that decreases when the source returns inconsistent results. These profiles are initially configured from documentation and then automatically tuned based on observed behavior. If a source starts returning errors at 45 requests per minute despite documenting a 60 RPM limit, the system adapts within one adjustment cycle (typically 5 minutes).

Crucially, the abstraction layer also handles source degradation gracefully. When a source goes down -- and with 312 sources, some source is always experiencing issues -- the system transparently falls back to cached results (with a reduced confidence score and a freshness penalty) and queues a retry. The investigator sees a subtle indicator that some results are from cache, but their workflow is not interrupted by a single source failure.

Architecture Decision 4: Investigator Privacy

This was non-negotiable from day one and it shaped every other decision. An investigator's query patterns are themselves intelligence. If an adversary can observe what an investigator is searching for -- which names, domains, email addresses, and IP addresses they are querying -- that adversary gains significant operational intelligence. They know they are being investigated. They know which of their identities have been discovered. They know the investigation's direction.

The platform runs entirely on the investigator's own infrastructure. No telemetry. No cloud dependency for core functionality. Every external API call (to data sources) is routed through a configurable proxy layer that supports Tor, residential proxy rotation, and direct connections per-source. The investigator can configure different proxy strategies for different source categories: route dark web queries through Tor, route social media queries through residential proxies, and route corporate registry queries directly (since those are legitimate public records).

DNS resolution for source APIs happens through DNS-over-HTTPS to prevent DNS-level monitoring. TLS certificate fingerprinting is randomized per-session to prevent passive traffic analysis. And the system generates cover traffic -- benign queries to non-relevant sources -- to obscure the actual investigation pattern from any network-level observer.

These privacy measures add approximately 200-400ms of latency to each source query (primarily from Tor routing and proxy overhead). For an interactive investigation session, this is noticeable but acceptable. For bulk enrichment jobs, the system parallelizes across multiple proxy paths to maintain throughput.

Entity Resolution: The Hard Problem

The technically hardest problem in the entire platform is entity resolution: determining when two data points from different sources refer to the same real-world entity. Is "John Smith" at Gmail from source A the same person as "J. Smith" in a corporate filing from source B? Sometimes yes, sometimes no. Getting this wrong in either direction is costly. False merges contaminate investigations with irrelevant data. False splits fragment an entity's profile across multiple nodes, hiding connections that should be visible.

The resolution engine uses a multi-signal approach. For each potential entity match, it evaluates: name similarity (using Jaro-Winkler distance with culture-aware normalization), contact overlap (shared email, phone, or address), temporal correlation (did both records appear in similar time windows), network proximity (are the entities connected to common third entities), and behavioral fingerprinting (do the entities exhibit similar patterns across platforms).

Each signal produces a score between 0 and 1. The scores are combined using a weighted model that was trained on 50,000 manually labeled entity pairs. The combined score determines the resolution action: scores above 0.85 trigger automatic merge with a provenance record. Scores between 0.5 and 0.85 create a "suggested merge" for investigator review. Scores below 0.5 are kept as separate entities.

The 0.85 automatic merge threshold was set conservatively after testing showed that lowering it to 0.75 increased false merges by 340% while only catching 12% more true matches. In OSINT, a false merge -- incorrectly linking an innocent person to a target's criminal network -- has severe consequences. The threshold favors precision over recall.

Lessons from Nuculair

After 18 months of development and deployment with real investigators, a few hard-won lessons stand out:

Freshness metadata is more valuable than the data itself. Investigators care less about what a record says and more about when it was last verified. A phone number from 6 months ago is actionable. The same number from 3 years ago is a starting point for re-verification. Building freshness tracking into the core data model from day one was one of the few things we got right initially.

Graph visualization is necessary but not sufficient. Every OSINT tool has a node-and-edge graph visualization. Most of them are useless beyond 50 nodes because they become unreadable hairballs. The breakthrough was giving investigators multiple views: a graph view for relationship exploration (limited to the investigator's current focus area), a timeline view for temporal analysis, a map view for geographic correlation, and a tabular view for bulk data review. Each view is a lens on the same underlying graph data.

Source reliability changes over time. A source that was excellent in January might start returning stale data by June because they changed their data pipeline. The health monitoring system that tracks response consistency, data freshness trends, and correlation with other sources was added after we discovered that a source we heavily relied on had been serving cached data for 4 months without any API-level indication.

Investigators do not want to understand graph theory. The query interface must abstract away Cypher and graph traversal concepts. "Show me everything connected to this email address" is the right level of abstraction. "Execute a variable-depth BFS traversal with confidence-weighted edge scoring" is not. The natural language query layer, powered by local LLMs, translates investigator intent into optimized Cypher queries and translates results back into investigative narratives.

Building an OSINT platform at this scale is an exercise in managing complexity across every layer: data ingestion, normalization, storage, entity resolution, privacy, and presentation. The technical challenges are significant, but the operational challenges -- keeping 312 sources healthy, reliable, and privacy-safe -- are where the real engineering effort lives.