In 2015, I saw a metric that said we had created more data in the past two years than in the entire previous history of the human race. We are living in a time where we are generating incredible amounts of data. That leads to incredible opportunities for growth and advancement. But it also leads to huge problems.
One of the biggest problems we see is the rise of inaccurate information, more commonly known as “fake news”, but it’s not necessarily affecting just our news. It is very easy for someone to post data online (via social media, blogging, news articles, research studies, etc.) that is untruthful or nonfactual. That seems innocuous initially, but when others incorrectly assume the data is true and re-post it or upvote it, our systems are designed to give that data preferential treatment. The more times that data is liked or shared, the more it is given the appearance of being truthful and that creates environments where bad data can influence real-world happenings.
There are many people, all around the world, working to find solutions to “fake news” because it is an existential threat. After all, if you don’t know what to believe, how can you make a correct, informed decision?
Many of the solutions are based on artificial intelligence and detecting when something is not truthful. This approach looks at things from the consumers’ perspective: if we can sift through the garbage, we can deliver something meaningful to the end user.
This approach is flawed for a number of reasons. The key ones are:
- As data continues to grow and expand, there is more bad data than good and our systems have to find the needle in the haystack. Systems have to grow bigger, faster, and more efficient at finding the good data at a faster rate than the growth of our data creation. It’s like fighting a fire that keeps expanding, so you have to throw more and more equipment at it.
- All it takes is one bad piece of data to slip through the system and the problem has resurfaced. You could eliminate 99.99% of the bacteria in the world, but if you happen to catch the last remaining bacteria, you’re still going to get sick.
I believe the “fix” we need is to approach the problem from the data creators’ perspective rather than the data consumer. What if we were able to get all the legitimate data creators to agree to produce data in a particular format and only gave preferential treatment to data that met all the requirements? What if a legitimate data creator was penalized monetarily for bad data that they propagated and the penalty was factored by the number of end users that consumed the bad data?
I’m calling for a voluntary Data Validation Standard (analogous to the voluntary ISO9001 Quality Standard in manufacturing) that can apply to all data that is being created, whether that is news, blogs, web sites, or social media posts. A data creator agrees to do certain things to validate their data and ensure its veracity. For instance, prior to publishing a research paper, someone must have independent replication studies that validate the results of the paper. Journalists must follow the SPJ Code of Ethics, which clearly calls for them to “Seek Truth and Report It, Minimize Harm, Act Independently, and Be Accountable and Transparent”, and format pieces in a manner that highlights the facts and clearly delineates the commentary or opinions used to provide context. A web site creator must provide a certification or validated reason why they can claim that they are a subject matter expert on the topic about which they’ve written.
The benefit to the data consuming public is obvious: by doing this, we are providing a clear way to separate the legitimate data sources from those that could potentially be wrong. The benefit to the data creator is that it provides a competitive advantage by distinguishing themselves from all the rest of the creators and it establishes a measure of trust with the consumers who must choose which data to use.
In order for this to be successful, there must be additional measures:
- There must be an independent auditing system that monitors the mechanisms that each data creator is using to validate their data.
- There must be a means for the public to identify and report inaccuracies by data creators that claim to be following the standard.
- There must be penalties, including suspension/revocation of their status as a legitimate data creator, for creators that violate the standard, especially repeatedly. Such penalties should be measured by the impact of the violation, which can be easily measured by reads and shares.
Since this will be a voluntary standard for data creators, it does not limit free speech. Creators are still able to generate any data that they want as they see fit, if there is no benefit to them by following the standard. End users will still be able to consume any data they want, but they will now have the ability to see sources of data who have taken the extra effort to validate their data, and can make better decisions about their data sources.
This approach lends itself to scalability because the onus is placed on the data creator to do the work. Rather than sifting through every piece of data when looking for truthful information, we can more easily filter down to just the data creators that have established themselves as verified sources. It becomes a more manageable problem.