Proper Training: Why Machine Learning Still Needs Human Education

The concept of Garbage In, Garbage Out (GIGO) is almost as old as computing itself. Its origins have been traced back to the 1950s and basically means that if you start with bad information, you get faulty results. It’s a pretty simple concept that remains at the core of computing.  

As machine learning takes off in the marketplace, the GIGO issues have become even more pronounced. In truth, the garbage is everywhere and we need to be careful about training our systems to emulate the wrong human behaviors. This is at the heart of a story in Wired Magazine that points out how photos were helping some machines learn sexist behaviors. A pair of researchers began to notice that some images, like those of kitchens, were more associated with women than with men. As they looked deeper they realized the problem wasn’t in the algorithm or in the core of machine learning, but in the images used as the base datasets designed to train the system. 

The issue turned out to be even worse. As Tom Simonite writes: “Machine-learning software trained on the datasets didn’t just mirror those biases, it amplified them. If a photo set generally associated women with cooking, software trained by studying those photos and their labels created an even stronger association.”

In other words, when our systems are trained by humans, they take on the traits of those humans, even when those traits are not something that we want the systems to emulate. Researchers can counteract these behavior, but it takes human intervention and documented ethical protocols to make it work, it’s not something the system can do automatically. 

This keeps coming up in the world of machine learning  and therefore artificial intelligence. That is, the machines can only be as intelligent as the information we give it. Properly training systems isn’t trivial and it becomes even more important when it comes to AI-driven enterprise search.  It is key to getting the most value out of an enterprise search system, and it’s how we at Attivio make sure that our clients have the best possible results.

Making Machine Learning More Accurate 

Knowing that training machine-learning relevancy is hard and annotating information in a meaningful way is time consuming, we provide an API to capture signals data and avoid the burden of needing to generate this initial training set and maintaining it over time. By calling our Signals API, customers can easily send users’ behavior to the platform without needing to do any data collection and curation. The machine learning relevancy model can be configured to ignore old signals data, so that stale data doesn’t find its way to the model.

Since the machine learning model is as good as the data, we understand that passing the raw data “as is” to the machine learning algorithm is not going to be sufficient to ensure the model is not biased towards a specific user’s behavior. We put restrictions in place to ensure only good data is passed to the machine learning algorithm. For example, if a user clicks 1,000 times on the same document, we don’t want the platform to think that this document is more important than other documents only basing it on one user’s behavior. We guard against this type of behavior to ensure the model being generated is not biased, by prepping the input data before generating a relevancy model.

In addition to using the Signals API, we also provide a way to upload a Golden Set and use that as a training set. A Golden Set will be more curated and will ensure the training data is of higher quality.

As part of our Relevancy Management UI, we provide tools to analyze the signals data and the reasoning behind every relevancy score, so it’s clear why decisions were made. This way our users can make educated decisions on how to fine-tune the relevancy model.

We’re taking the same approach for Question & Answer (Q&A). We provide users tools to enter different types of questions and the desired answer. Customers can immediately see how this will affect their machine learning model.

We’re also working closely with customers that use Attivio for different use cases, to be able to train models that are generic enough so they are cross-use cases and are not biased towards just one use case. For example: searching for a person can have very different question structures. We want to ensure we capture as many options as possible, so we can provide an answer no matter how the question is phrased. 

Machine learning alone won’t change how people do business and make decisions, it takes the right human hand to help it achieve the best results.  

Gartner Magic Quadrant for Insight Engines 2019
Attivio was recognized for our completeness in vision and ability to execute.