“Start where you are. Use what you have. Do what you can.”-Arthur Ashe
It’s R&D time. The product guys have dreamed up some new features, and now you have to see if its possible to deliver them. If it is possible, you’ll need to build it.
Maybe it’s a consumer product, and it needs to determine exactly what a customer is doing while they’re wearing, holding or using the thing. Maybe it’s a piece of equipment or an automotive component, and it has to predict when the wingamajigger needs its thingamabob replaced. Perhaps it’s something robotic and you need to sense aspects of the surrounding environment.
You’re going to need sensors – and you’re going to need some kind of complex, non-linear model that makes sense of their output. You’re probably going to need machine learning.
So how do you begin?
It all starts with data
The first thing you’re going to need is data – at least enough to establish basic feasibility. Then you’re going to need a lot more. So the first thing you’ll do is think about where that data will come from, how you’ll get it, and how much of it you’ll need. Unless you’re fortunate enough to be working on a problem for which there is ample data available in open source or for purchase, you’re going to have to collect data yourself or work with a company that can collect data for you.
Data collection and curation is easily the most expensive, most time-consuming aspect of any machine learning project, and it’s also the first opportunity to ensure failure.
“Garbage in, garbage out” is just as true with machine learning as with any other data-driven effort. But there are also other things to consider when planning for data collection. Here are four key items we think are particularly important:
- Coverage planning
- (Over)instrumentation and Rich Data
- Curation and labeling
- Readiness assessment
Coverage planning: Get the data you need
Data coverage refers to the distribution of data across target classes or values, as well as across other variables that are significant contributors to variation. Target classes are just the things you are looking to detect or predict. For example, if you’re building a model that will detect whether a machine is behaving normally or is exhibiting a particular kind of fault, then “Normal” and “Fault” are the target classes.
But other variables may also be important. If you have good reason to expect that “Normal” and “Fault” will look differently when the machine is operating in different modes, those modes are contributors to variation and should be tracked as metadata variables. Different hardware setups, for example, are almost always significant contributors to variation.
The purpose of coverage planning is to make sure you collect enough data to capture and overcome the variation inherent in what you’re trying to measure. To do that you’ll need a statistically significant number of observations taken across a reasonable number of combinations of different target classes and metadata variables (contributors to variation).
So in our example above, we have two target classes and a couple of different modes:
This is a data coverage matrix, and allows us to plan for the data we need to collect. At the beginning, since we don’t yet know whether its possible to detect fault conditions in different types of equipment with the same machine learning model, we probably want a separate matrix for each equipment type – making this more of a data cube than a 2×3 matrix.
For basic feasibility, we generally recommend collecting at least 25 different observations in as many cells of this matrix as possible.
In the next stage, where you are trying to prove target accuracy across a larger number of observations, figure you’ll need 40-50.
Finally, to get your machine learning model to the point where its ready for field testing, you’ll want to collect several hundred to several thousand observations. Depending on the degree of variation encountered and the cost of collecting data, make sure the observations you collect cover as broad a range of the variation expected within each cell.
We call these the Three Stages of Proof of Concept for Machine Learning with Sensors:
1 – Prove Feasibility. Start with a little data – see if this works at all and justifies spending more time and money to collect more/better data
2 – Prove Accuracy. Add more data – now approaching true statistical significance – and prove what level of accuracy is achievable
3- Prove Generalization. Add a lot more data. At this stage, you are making sure that the model is able to deal with a wide range of circumstances beyond its original training set, and setting the stage for extensive field testing.
Each of these stages requires increasing amounts of data and better coverage within the matrix.
There are other considerations too – the balance being a key. For more info on balance and its importance, as well as for tests you can run to see whether insufficient data is a problem for your model, see our blog post “What’s Wrong with My Machine Learning Model.”
(Over)Instrumentation and Rich Data
Now that the coverage planning is done, the next step is to design the data collection rig. Our advice: Overdo it.
Since the labor and expense associated with data collection is usually very high, we generally suggest over-instrumenting the test rig in the earliest collects, and then using software to zero in on the most cost-effective rig for larger collection efforts later:
Use too many sensors: If you’re not sure what the right number or placement of sensors is, put them everywhere you think they might be able to go. Not sure whether sound or vibration will be more useful? Do both. Take pictures and videos of the rig in action. Collect data from all sensors simultaneously. Later, in the model building stage, you can exclude data channels selectively to determine the optimum combination.
Use too high a sample rate: Collect at the highest sample rate possible in the early stages, even if you think it’s implausible in production due to processing requirements or power consumption. Downsampling in software during the analysis/model building stage is easy and cheap. Going back and recollecting data is difficult and expensive.
Use rich data: Collect the least pre-preprocessed signal data from your sensors possible. For accelerometers or vibration sensors, keep the time waveform if possible. (See our blog “Rich Data, Poor Data” to understand why.) For sound, collect uncompressed wav files. Compression algorithms often discard exactly the information you need to detect a signature, and what’s worse is they do so unpredictably. Again, you can always compute features from the time waveform or try different compression methods in software later to see what works best. But going back to collect more data is time-consuming and expensive.