Every day, organizations access and utilize data gathered from a variety of sources, some of which has never left the secure confines of their networks and some that was collected (or purchased) externally. While there are plenty of security measures in place to protect systems and services from malicious data like malware and other viruses, these defenses don’t always assess the authenticity of the rest of the data. Failing to identify falsified or manipulated data can potentially lead to significant negative consequences for an organization.
Put in the most simplistic terms, data provenance is the historical record associated with a piece of data. It documents where the data originated, where it has moved over time, what changes have been made to it, and who has made those changes. The concept actually comes from the art world, where complex systems of authentication were required to prove that a piece of art was indeed produced by a specific artist. Data provenance is often used interchangeably with the term data lineage, which can create confusion when organizations discuss how they evaluate metadata, available information, and unique information to determine the origin and history of data.
Determining data provenance is incredibly important for many industries where the integrity of records is critical to business success. Supply chains, scientific research, video and audio services, and identity verification systems are all areas where the manipulation of data could create serious problems for an organization.
As internet and networking services continue to expand, organizations must find ways to deal with more data than ever before. The rapid proliferation of Internet of Things (IoT) devices has turbocharged the data-collection efforts of computing networks. By 2020, the amount of data in existence will even outnumber the stars in the observable universe. Of course, not all of that data is considered useful. Organizations devote substantial computing resources to managing and sifting through massive amounts of unstructured data in search of meaningful insights that can inform business decisions and teach them something about their customers. In an era of data breaches and misinformation, evaluating data provenance in big data is more important than ever.
Just as an appraiser might subject a painting to a range of tests and scrutinies to determine its provenance, today’s companies must evaluate their data rigorously to ensure that it hasn’t been manipulated or replaced somewhere along the way. While cybersecurity experts have long focused on preventing data from being stolen, some now believe that data manipulation and falsification poses an even greater threat.
If hackers gain access to an organization’s network, they could insert false or inaccurate data into existing databases or into automatic systems. This “weaponized data” could drive companies to make major strategic decisions based on fabricated information or introduce disruptive patterns into machine learning programs. While this form of cyberattack would require much greater expertise and have less obvious financial benefits as straightforward data theft, it could be an attractive strategy for nation-states looking to destabilize peer adversaries with disinformation or anyone looking to damage an organization’s reputation.
The ability to evaluate and verify the origin and transmission history of data will be an essential component of cybersecurity strategies in the future. Fortunately, many IT experts are already familiar with how to evaluate data provenance due to its role in debugging systems. When an error occurs in a system, that error has to be traced back to determine when a problem occurred, what caused it, and what effects resulted from it.
Some questions that must be asked when assessing data provenance include:
With so much data to manage, however, determining data provenance can be a complex and time-consuming process of documentation. Data generated and stored within a secure system may not pose too many difficulties, but the tide of unstructured data flowing in from the edge of networks courtesy of billions of IoT devices can quickly become overwhelming. Many organizations are looking to develop and implement tools that help them to evaluate data provenance in big data more easily by looking for unique information. Some of these tools focus on establishing provenance rather than determining it, marking data when it enters their networks to create documentation that accounts for it going forward.
Blockchain technology, which is largely built upon the concept of data provenance, may also provide organizations with an effective means of authenticating data. As a distributed ledger (or information database consisting of several independent nodes) that provides a verified “chain” of transaction records, it’s relatively easy to trace where a block of data within the chain has been. And since that data can’t be modified or erased without approval from the majority of the nodes within the ledger, blockchains provide resilience against potential attacks or attempts to manipulate the data stored within them.
As organizations seek to protect themselves from the dangers of misinformation and ensure the accuracy of their data, tools and processes that emphasize data provenance will become a more important aspect of cybersecurity strategies. Without a system in place for identifying where data originated and determining whether or not it was manipulated along the way, malicious actors could potentially undermine decision-making and create significant disruption within network systems.