Blockchain Data: Foundation and Frontiers
Blockchains are not merely networks for value transfer—they are also vast, decentralized data ecosystems. Every transaction, smart contract execution, and protocol update generates structured yet fragmented on-chain data. However, the true potential of this data remains unrealized without the underlying infrastructure that enables its storage, indexing, analysis, and exchange.
This three-part research series examines the growing intersection of Web3 and data infrastructure across five critical layers: data acquisition, storage, indexing & querying, analytics, and marketplaces. This first installment provides a high-level overview of the current landscape, identifying key players, technological developments, and emerging challenges. Subsequent pieces will delve deeper into each segment, assessing their technical architectures, economic models, and long-term viability.
As blockchain adoption expands beyond speculative trading into enterprise and institutional use cases, the efficiency and reliability of data infrastructure will play an increasingly pivotal role. This series aims to provide a structured framework for understanding how these systems operate today—and where they are headed next.
The Data Moment
In today’s digital age, data is often referred to as the new gold. It is easy to see why - data has become the foundation of decision-making, innovation, and economic growth. According to research from S&P Global, 90% of enterprise businesses reported that data is becoming increasingly important to their overall business. The global big data market size was valued at $327.3 billion in 2023, and it is projected to grow at a CAGR of 14.9%, to reach $862.3 billion by 2030.
Since 2022, the private equity industry has spent at least an estimated $170 billion acquiring or investing in over 450 data center companies. Looking at the breakdown by years below, we see that private deal activity for data centers grew exponentially in 2024, and blew past the $70 billion milestone.
As the value of data continues to grow, so does the increasing nexus between data and web3. The blockchain industry market size is estimated to reach $470 billion by 2030. With this growth comes greater demand for high-quality, fast and transparent data both on-chain and off-chain. The global, decentralized and composable nature of the blockchain technology combined with cryptographic incentives and zero-knowledge technology opens a new frontier for data generation, access and utilization.
According to Pitchbook data, Blockchain data infrastructure startups raised approximately $2.1 billion in H1 2024, a 40% increase YoY compared to H1 2023.
Key drivers for the increased investment volume include:
- Integration of AI with blockchain data (e.g., decentralized AI training datasets).
- Demand for institutional-grade compliance tools (e.g., FATF Travel Rule solutions).
- Growth of real-world asset (RWA) tokenization requiring reliable oracles and storage.
The blockchain data landscape has evolved into a complex ecosystem with distinct segments serving various stages of the data value chain. Our following framework of understanding the blockchain data stack aims to closely follow how data flows from its initial capture through processing, analysis, and ultimately monetization:

Anatomy of the Stack
1. Data Acquisition
Blockchain nodes and node operators are generally seen as the entry point of on-chain data extraction. However, the data acquisition we focus on here refers more to capturing both on-chain and off-chain (real-life) datasets, especially hard-to-access data. Traditionally, these data are scraped from online sources, through interactions with particular applications, which may be in the hands of large corporations such as social media platforms and banks. These data may often also be collected and shared without user consent.
Web3 improves data acquisition in two ways:
- Incentivizing user-provided data. Token incentives and DePIN (Decentralized Physical Infrastructure Networks) enable users to share offline behaviors (e.g., habits, financial history), democratizing access.
- Enhancing privacy. web3 technologies can give users greater control over their personal information. Through privacy-preserving technologies like zero-knowledge proofs (ZK-proofs) and multi-party computation (MPC), users are able to share specific insights without exposing raw data. This enables secure and trustless data sharing, addressing privacy concerns while still enabling data monetization. The examples mentioned below all tap on such technologies, to preserve user control and privacy over data access:
- Vana is a decentralized data marketplace that enables users to monetize their personal data while maintaining privacy. Its key differentiator is its focus on structured, user-owned datasets that can be directly integrated into AI models, making it a crucial bridge between web3 data economies and machine learning applications.
- Grass is similarly a web3 data acquisition network that collects browsing data. A web3-native data acquisition network that collects browsing data through a permissioned opt-in model. Unlike Vana, which focuses on structured datasets, Grass specializes in real-time, high-volume web traffic data, making it a valuable resource for AI models, ad-tech firms, and market intelligence platforms.
- ORO (@getoro_xyz) is a new data acquisition company that focuses on proprietary financial and health data. Its value-add lies in gaining access to proprietary data, to provide clients with new data sources and insights.
2. Data Storage
Data storage refers to the warehousing of acquired data. The current centralized model relies on large physical data centers, which can be costly and vulnerable to attacks since the risks are concentrated.
web3 mitigates these risks and concerns by offering a decentralized storage solution. web3 enables the rise of a distributed storage network, where data can be distributed across multiple nodes, improving redundancy and uptime. A distributed storage network mitigates single points of failure, enhances security, and reduces costs. By leveraging blockchain technology and peer-to-peer storage solutions, web3 platforms can create a more resilient and efficient data storage framework. It also promotes greater sustainability for the solution, as it will be easier to scale the network should there be increased demand, instead of having to put in place additional physical data centers.
Examples of web3 data storage projects include Arweave, Filecoin, and Storj:
- Arweave is a decentralized, permanent data storage network designed for long-term archiving. Unlike traditional cloud storage solutions, Arweave uses a “pay once, store forever” model, ensuring data remains accessible indefinitely without recurring costs. Its unique value lies in permanence and immutability, making it ideal for archiving important records and dApps.
- Filecoin is a decentralized storage marketplace that incentivizes users to rent out spare storage capacity. Unlike Arweave, which focuses on permanent storage, Filecoin operates more like a decentralized cloud storage system, allowing users to store and retrieve large datasets efficiently. Its competitive advantage is its scalable, cost-efficient marketplace that dynamically adjusts pricing based on supply and demand, making it suitable for enterprise-scale data storage.
- Storj is a decentralized cloud storage platform that distributes encrypted, sharded files across a global network of nodes, offering enterprise-grade security with end-to-end encryption, S3 compatibility, and pay-as-you-go pricing. Unlike Filecoin (auction-based long-term storage) or Arweave (permanent one-time payment), Storj focuses on fast, scalable storage for enterprises and Web3 apps, with no data egress fees and seamless AWS S3 integration. It aims to provide privacy-first, enterprise-ready decentralized storage focused on institutional clients.
3. Data Indexing and Querying
Data indexing and querying involve organizing and retrieving stored data. Many current systems rely on centralized providers. While a centralized approach gives the firm greater control over the entire process thereby ensuring greater stability and reliability in the data provided, the trade-off is that there will be some delays and requires high fixed costs. This is because the firm will have to sift through entire data catalog to obtain necessary information, leading to inefficiencies and latency issues.
With web3, data can be stored across multiple nodes, and this enables parallel querying, reducing retrieval times and computational demands. Additionally, decentralized indexing solutions can eliminate the need for intermediaries, allowing for more direct and transparent data access. This streamlines application development and fosters a more open data ecosystem.
Examples of web3 data indexing and querying protocols include The Graph, SonarX, SQD.AI and Space and Time:
- The Graph is a decentralized indexing protocol that enables efficient querying of blockchain data. Instead of relying on centralized APIs, The Graph allows developers to create and use subgraphs - specialized data feeds that make blockchain data easily searchable. It is a well-established solution that has gained widespread adoption across Ethereum and other chains, making it the backbone of data accessibility for web3 projects.
- SonarX is a blockchain data infrastructure platform that provides structured, indexed on-chain data from genesis to tip across 100+ chains. Designed for institutions, enterprises, and blockchain-based businesses, it consolidates fragmented data pipelines into a unified platform, delivering data via warehouses (Snowflake, Databricks, BigQuery), flat files (JSON, Parquet), Kafka streams, or direct S3 transfers. The platform supports historical analytics, real-time monitoring, reconciliation, staking rewards, transaction reporting, accounting, tax, and audit operations. SonarX is SOC II certified and GDPR compliant, meeting institutional security and regulatory standards. Its data quality framework, multi-chain coverage, and engineering expertise have made it the preferred provider for major enterprises, including FTSE/LSEG, AWS, and leading crypto analytics firms like Messari, Glassnode, and Kaiko.
- SQD.AI is a decentralized data lake and indexing protocol optimized for efficient querying and data access. It offers a toolkit that allow developers to create custom data pipelines, to obtain the necessary information at high speeds. It also supports local running and batch event processing, such that data requests can be broken down and processed in blocks if needed, making it cost-effective and scalable. Its unique composable approach opens up immense opportunities for decentralized projects and AI agents to leverage raw and structured on-chain data.
- Space and Time is a decentralized data indexing and querying solution. Its core value proposition lies in enabling verifiable and trustless query execution across both on-chain and off-chain data. It allows developers to run SQL-based analytics with cryptographic proofs, ensuring data integrity without relying on centralized databases.
4. Data Analytics
Data analytics transforms raw data into actionable intelligence. Today’s analytics tools primarily focus on off-chain data, often overlooking on-chain insights.
With web3, companies can gain access to on-chain data, to gain a deeper understanding of user behaviors, their social communities, and holistic financial transaction history and spending preferences. These insights can then be leveraged for risk assessment, trading strategies, and community engagement.
Many platforms are also now incorporating artificial intelligence to enhance predictive modeling and automate data-driven decision-making. This convergence of AI and web3 data analytics is unlocking new possibilities for forecasting trends and identifying opportunities.
Examples of web3 data analytics platforms include Nansen, Dune, and Chainalysis:
- Nansen is a blockchain analytics platform specializing in on-chain data intelligence and wallet tracking. Unlike general-purpose analytics tools, Nansen focuses on labeling and clustering blockchain addresses, allowing users to track fund movements, whale activity, and smart money trends.
- Dune is a community-driven data analytics platform that enables users to create and share SQL-based queries on blockchain data. Dune allows users to build their own custom dashboards and visualizations using raw blockchain data, enabling flexible blockchain data exploration.
- Chainalysis is a blockchain forensics and compliance platform focused on tracking illicit transactions and providing regulatory insights. Chainalysis serves law enforcement agencies, financial institutions, and governments by providing risk assessments and transaction monitoring tools. Its focus is on tracing suspicious activity across multiple chains, making it a crucial player in crypto compliance, anti-money laundering, and fraud detection.
5. Data Marketplaces
Data marketplaces enable businesses and individuals to monetize their data. Typically, buyers often struggle to discover datasets, negotiate access, and securely transfer data, creating friction in the data economy. In many cases, access to high-quality datasets is restricted to a few well-funded entities, limiting innovation and inclusivity.
With web3, datasets can be tokenized and listed on marketplaces to allow for easy discovery, seamless transactions, and secondary trading. It enables peer-to-peer data exchanges, creating more equitable access to information. The rise of data marketplaces also allows for users to monetize their personal data directly, shifting power away from centralized corporations and toward data owners themselves.
Examples of web3 data marketplaces include Ocean Protocol, Fetch.ai, and Datum:
- Ocean Protocol connects data providers and consumers through a decentralized marketplace. It helps to tokenize datasets into ERC-20 tokens, allowing data providers to monetize their data while maintaining control and privacy. Consumers can choose to purchase or rent these datasets, enabling secure and efficient data access.
- Fetch.ai is an AI-powered decentralized marketplace that facilitates autonomous data sharing and transactions using AI agents. Its infrastructure allows for the creation and deployment of AI agents that can interact with and process data across different platforms in real-time, allowing for autonomous workflows (including the discovery, acquisition and processing of datasets) without the need for direct human intervention.
- Datum is a blockchain-based marketplace focused on personal data monetization, allowing users to control and sell their own data directly to businesses and researchers. Datum prioritizes individual data sovereignty, and its key advantage lies in its privacy-centric model, where users retain ownership and earn rewards for sharing anonymized personal data.
The Future to Come
The blockchain data landscape is poised for significant evolution in the coming years, driven by a confluence of technological advancements, regulatory developments, and increasing institutional adoption. As blockchain technology matures, several key trends are emerging that will shape the future of data management, analysis, and monetization in this space.
1. The Rise of AI-Driven web3 Data Solutions
AI is becoming a key enabler in the web3 data space, enhancing data analysis, risk assessment, and automation. There are a few possible applications of AI-driven web3 solutions:
- Efficient request routing: web3 data indexing and querying solutions can use AI to more efficiently route queries across their nodes based on their typical response time and locations, allowing for faster and more accurate data retrieval. SQD.AI has already incorporated this in their data solutions.
- AI-powered smart contracts: decentralized data marketplaces can consider implementing such smart contracts to help automate the data monetization process, based on the demand and supply of specific datasets.
- Personalized insights: web3 data analytics platforms can integrate AI in their models, so that more personalized insights can be derived from the analysis, enabling more customizable and relevant data-driven applications to be created for users. Lens Protocol uses AI to analyze on-chain interactions and curate content feeds tailored to each user's interests and behaviors
2. Specialized Data Set Services
Specialized data set services are emerging as a critical component of the blockchain data landscape. They leverage blockchain's inherent properties like immutability, transparency, and security to improve compliance, reduce costs and enhance decision-making:
- Healthcare data management: companies like BurstIQ offer blockchain platforms that enable secure management of sensitive health information. Each smart contract gives users full data ownership, allowing them to decide how their data is used in personalized healthcare products, services, and marketplaces. This ensures privacy, security, and compliance with regulations like HIPAA.
- Financial services: blockchain data providers like Coin Metrics and Digital Asset Research offer specialized data sets for financial institutions. These include real-time market data, historical price data, and trading volumes, which are crucial for risk management, compliance, and market analysis.
- Supply chain and logistics: companies like Maersk and FedEx use blockchain for real-time visibility and data sharing across global logistics. Blockchain-based platforms ensure transparency, traceability, and reduce fraud by providing an immutable record of goods movement.
3. Cross-Chain Data Standardization
Blockchain data is still somewhat fragmented today. With the proliferation of decentralized data sources, standardization of data across chains is the key to interoperability.
- Data normalization: standardizing data formats across chains is essential for accurate analysis. This trend is already evident with companies like Covalent attempting to "standardize blockchain data under a single schema", though challenges remain with non-EVM chains requiring separate schemas.
- Data warehousing: aggregated data from multiple blockchains is often stored in a warehouse, where it's organized for query and analysis. SonarX is a blockchain data indexer with a data warehouse solution as well, enabling data to be ready for SQL-based analytics without additional processing.
4. Enhanced Privacy-Preserving Analytics
As regulatory scrutiny increases, technologies that enable analytics while preserving privacy will become more important for security and ethical considerations. The future of privacy-preserving blockchain data analytics will be shaped by a combination of advanced cryptographic techniques, regulatory compliance, user-centric controls, and innovative blockchain architectures:
- Privacy-focused blockchain architectures: layer 2 scaling solutions like sidechains or off-chain computation can enhance privacy by processing transactions or data off the main blockchain, reducing the visibility of sensitive information.
- View keys and selective disclosure: blockchain networks will increasingly support view keys or selective disclosure mechanisms, allowing users to control who can access their data. This empowers users to share information on a need-to-know basis, enhancing privacy.
- Confidential smart contracts: future smart contracts will incorporate privacy-preserving features, allowing for complex operations to be executed without revealing sensitive data. This could involve integrating ZKPs or homomorphic encryption into smart contract logic.
- Cross-chain privacy solutions: as blockchain interoperability becomes more important, privacy solutions that work across multiple blockchain networks will become increasingly valuable. This could involve standardized privacy protocols or cross-chain privacy oracles.
- Decentralized data marketplaces: these platforms will evolve to offer privacy-enhanced data trading, where data providers can monetize their information while ensuring that buyers receive only the insights they need without accessing raw data.
Stay Tuned
The convergence of Web3 and data represents one of the most dynamic frontiers in technology, driving unprecedented innovation across decentralized infrastructure, AI-powered analytics, and trustless data economies. Ocular is actively mapping this rapidly evolving landscape, engaging with the visionaries and builders who are redefining how data is owned, verified, and leveraged in an on-chain future.
In the next installments of this series, we’ll take a closer look at the key verticals and companies shaping decentralized data.
Stay tuned.