ML Data Preparation Demands a Big Toolbox

By Aaron Bianchi
May 11, 2021

Part of the challenge of building machine learning models is that no two are the same. Train the same machine learning algorithm against different sets of data, and you end up with a different model.

If the quality of the raw data is high and the training data sampling is done well, the models shouldn’t vary a lot… but those are all big “if”s. Which is why data preprocessing, the actual data preparation process, is critically important.

A Forbes survey revealed that data scientists spend nearly 80% of their time on data prep, a quarter of that on data collection, and the other three quarters on data cleaning. Other survey results that indicate that real-world data science isn’t everything these practitioners thought it would be; clearly, data collection and data cleaning are not how they imagined they’d spend their working hours.

Data preparation is so time consuming because it is so important. The adage – or more appropriately in this setting, admonition – “Garbage In, Garbage Out” very much applies to data preparation for machine learning, which, in extreme cases, can involve the entire lifecycle of data collection to data cleaning and feature engineering. Missteps at any point in this process will result in low-confidence model predictions, or even a model that just misperforms.

Beyond their importance, training data sets for machine learning algorithms are also voluminous – many millions of data items in the case of complex problem spaces – and much of the data prep work demands human involvement (although much of this work is often repetitive and requires only contextual training to perform).

Finally, data preprocessing usually involves a variety of technologies, both for doing the actual work of preparing the data and for managing quality in the context of volume. If the problem space is simple – say, structured data with duplicates, null values and some lack of standardization – the technology needn’t be complex. But complex problem spaces – say, identifying and tracking video objects with complex taxonomies – can require specialized technology, much of it open source, with very particular feature sets.

In recent years, numerous solution providers have sprung up to fill the human and technological gaps that data scientists confront in preparing quality training data at scale. Some offload the human labor side of data prep. Others provide technology for cleaning and labeling training data sets. And yet, some provide both.

Data science teams would be well-advised to choose carefully when evaluating data preparation partners. DDD has “inherited” more than a few customers whose first vendor fell down, invariably on one or both of these two dimensions:

Something less than full lifecycle data prep. Recall that the Forbes survey indicated 60% of data scientists’ time is spent in data cleaning. Most of today’s data preparation vendors emphasize training data labeling and annotation. They presume that they will be given data that has already been cleaned.

If your data needs cleaning, i.e.,

  • it has not been de-duped

  • it is missing information

  • it is inconsistently presented from different sources

  • it requires entity resolution

  • it is image data of uneven quality or perspective

  • it is handwritten and requires transcription

If you don’t have data, or don’t have enough data, and need data collected or created, these vendors are not a good fit for you.

Reliance on a single technology platform. Every ML project is unique, with unique data. No single set of proprietary tools can possibly match up to every machine learning algorithm and training data set. Data science teams need to know that whoever is doing their data preparation is technology platform agnostic. They need to know that whoever is doing their data preparation has the flexibility and freedom to choose the best tool(s) for the project at hand and is not trying to shoehorn data into an inappropriate tool or trying to jerry-rig a third-party tool onto their platform for the sake of the team's project.

Previous
Previous

Five Key Criteria to Consider When Evaluating a Data Labeling Partner

Next
Next

OCR is Always Evolving, Always Hot