BlockDataTrust Benchmark: A Structured Dataset for Evaluating Blockchain Data Trust
The BlockDataTrust Benchmark Dataset (BlockDataTrust) is a comprehensive synthetic dataset developed to support research on data trustworthiness in blockchain-integrated systems. It was created as part of ongoing research at University College Dublin, specifically to address the growing need for structured, machine learning–ready data to evaluate trust in decentralized environments.
Unlike existing blockchain datasets that typically focus on isolated issues such as transaction fraud or smart contract bugs, BlockDataTrust spans three dimensions of trust:
Data source–related features (e.g., sensor certification, update frequency, false data reports)
Blockchain-internal behaviors (e.g., transaction success rates, node reputation, smart contract consistency)
External entity interactions (e.g., cross-chain validations, crowdsourced trust scores, third-party verification)
The dataset includes both benign and malicious samples, with adversarial patterns modeled using domain-informed statistical distributions. Simulated attacks include Sybil identities, compromised devices, smart contract misbehavior, and collusive validation, among others. To label ambiguous benign data at scale, Snorkel weak supervision was applied — allowing probabilistic labeling beyond hardcoded heuristics.
BlockDataTrust is also designed to be modular and extensible, enabling integration with additional data from synthetic or real-world blockchain applications. It includes both balanced and realistic (5% attack) versions, enabling robustness testing under various threat densities.
Download the dataset from Kaggle:
https://www.kaggle.com/datasets/rashmiratnayake/blockdatatrust-benchmark-dataset
DOI: https://doi.org/10.34740/kaggle/dsv/12159772
This dataset helps address the lack of structured, domain-informed datasets for trust assessment in blockchain environments and serves as a valuable resource for researchers working on blockchain security and trust modeling using machine learning.