The Complete Guide to Machine Learning
for Sensors and Signal Data
Machine learning for sensors and signal data is becoming easier than ever: hardware is becoming smaller and sensors are getting cheaper, making IoT devices widely available for a variety of applications ranging from predictive maintenance to user behavior monitoring.
Whether you are using sounds, vibrations, images, electrical signals or accelerometer or other kinds of sensor data, you can build richer analytics by teaching a machine to detect and classify events happening in real-time, at the edge, using an inexpensive microcontroller for processing - even with noisy, high variation data.
Go beyond the Fast Fourier Transform (FFT). This definitive guide to machine learning for high sample-rate sensor data is packed with tips from our signal processing and machine learning experts.
Download the full version of the e-book to read it at your own pace, or click on a section title to read the article.
Get The Complete Guide to Machine Learning for Sensors and Signal Data
Rich Data, Poor Data:
Getting the most out of Sensors
Accelerometers and vibration sensors are having their day. As prices have come down drastically, we are seeing more and more companies instrumenting all kinds of devices and equipment. Industrial, automotive and consumer products use cases are proliferating almost as fast as startups with “AI” in their names.
In many cases, particularly in industrial applications, the purpose of the new instrumentation is to monitor machines in new ways to improve uptime and reduce cost by predicting maintenance problems before they occur. Vibration sensors are an obvious go-to here, as vibration analysis has a long history in industrial circles for machine diagnosis.
At Reality AI, we see our industrial customers trying to get results from all kinds of sensor implementations. Many of these implementations are carefully engineered to provide reliable, controlled, ground-truthed, rich data. And many are not.
Working with accelerometers and vibrations
In vibration data, there are certainly things you can detect by just looking at how much something shakes. To see how much something is shaking, one generally looks at the amplitudes of the movement and calculates the amount of energy in the movement. Most often, this means using measures of vibration intensity such as RMS and “Peak-to-Peak”. Looking at changes in these kinds of measures can usually determine whether the machine is seriously out of balance, for instance, or whether it has been subject to an impact.
For more subtle kinds of conditions, like identifying wear and maintenance issues, just knowing that a machine is shaking more isn’t enough. You need to know whether it’s shaking differently. That requires much richer information than a simple RMS energy. Higher sample rates are often required, and different measures. Trained vibration analysts would generally go to the Fast Fourier Transform (FFT) to calculate how much energy is present in different frequency bands, typically looking for spectral peaks at different multiples of the rotational frequency of the machine (for rotating equipment, that is; other kinds of equipment are more difficult with Fourier analysis). Other tools, like Reality AI, do more complex transforms based on the actual multidimensional time-waveforms captured directly from the accelerometer.
Figure 1- This example shows a time series of data from an accelerometer attached to a machine in a manufacturing facility. X, Y and Z components of the acceleration vector are averaged over one second. There is very little information in this data – in fact, just about all it can tell us is which direction is gravity. This data was provided from an actual customer implementation, and is basically useless for anomaly detection, condition monitoring, or predictive maintenance.
Figure 2 - This example shows vibration data pre-processed thru a Fast Fourier Transform (FFT) at high-frequency resolution. The X-axis is frequency and the Y-axis is intensity. This data is much more useful than Figure 1 - the spikes occurring at multiples of the base rotation frequency give important information about what’s happening in the machine and is most useful for rotating equipment. FFT data can be good for many applications, but it discards a great deal of information from the time-domain. It shows only a snapshot in time – this entire chart is an expansion of a single data point from Figure 1.
Figure 3 - Raw time-waveform data as sampled directly from the accelerometer. This data is information-dense – being the raw data from which both the simple averages in Figure 1 and the FFT in Figure 2 were computed. Here we have frequency information is much more resolution than the FFT, coupled with important time information such as transient and phase. We also see all of the noise, however, which can make it more difficult for human analysts to use. But data-driven algorithms like those used by Reality AI extract maximum value from this kind of data. It holds important signatures of conditions, maintenance issues, and anomalous behavior.
But rich data brings rich problems – more expensive sensors, difficulty in interrupting the line to install instrumentation, bandwidth requirements for getting data off the local node. Many just go with the cheapest possible sensor packages, limit themselves to simple metrics like RMS and Peak-to-Peak, and basically discard almost all of the information contained in those vibrations. Others use sensor packages that sample at higher rates and compute FFTs locally with good frequency resolution, and tools like Reality AI can make good use of this kind of data. Some, however, make the investment in sensors that can capture the original time-waveform itself at high sample rates, and work with tools like Reality AI to get as much out of their data as possible.
It’s not overkill
But I hear you asking “Isn’t that overkill?"
Do I really need high sample rates and time-waveforms or at least hi-resolution FFT? Maybe you do.
Are you trying to predict bearing wear in advance of a failure? Then you do.
Are you trying to identify subtle anomalies that aren’t manifested by large movements and heavy shaking? Then you do too.
Is the environment noisy? With a good bit of variation both in target and background? Then you really, really do.
Rich data, Poor data
Time waveform and high-resolution FFT are what we describe as “rich data.” There’s a lot of information in there, and they give analytical tools like ours which look for signatures and detect anomalies a great deal to work with. They make it possible to tell that, even though a machine is not vibrating “more” than it used to, it is vibrating “differently.”
RMS and Peak-to-Peak kinds of measures, on the other hand, are “poor data.” They don’t tell you much, and discard much of the information necessary to make the judgements that you most want to make. They’re basically just high-level descriptive statistics that discard almost all the essential signature information you need to find granular events and conditions that justify the value of the sensor implementation in the first place. And as this excellent example from another domain shows, descriptive statistics just don’t let you see the most interesting things.
Figure 4 – Why basic statistics are never enough. All of these plots have the same X and Y means, the same X and Y standard deviations, and the same X:Y correlation. With just the averages, you’d never see any of these patterns in your data. (source: https://www.autodeskresearch.com/publications/samestats)
In practical terms for vibration analysis, what does that mean? It means that by relying only on high-level descriptive statistics (poor data) rather than the time and frequency domains (rich data), you will miss anomalies, fail to detect signatures, and basically sacrifice most of the value that your implementation could potentially deliver. Yes, it may be more complicated to implement. It may be more expensive. But it can deliver exponentially higher value.
It's all about the Features
We’re an AI company, so people always ask about our algorithms. If we could get a dollar for every time we’re asked about which flavor of machine learning we use –convolutional neural nets, K-means, or whatever – we would never need another dollar of VC investment ever again.
But the truth is that algorithms are not the most important thing for building AI solutions -- data is. Algorithms aren’t even #2. People in the trenches of machine learning know that once you have the data, It’s really all about “features.”
In machine learning parlance, features are the specific variables that are used as input to an algorithm. Features can be selections of raw values from input data, or can be values derived from that data. With the right features, almost any machine learning algorithm will find what you’re looking for. Without good features, none will. And that's especially true for real-world problems where data comes with lots of inherent noise and variation.
My colleague Jeff (the other Reality AI co-founder) likes to use this example: Suppose I’m trying to detect when my wife comes home. I’ll take a sensor, point it at the doorway and collect data. To use machine learning on that data, I’ll need to identify a set of features that help distinguish my wife from anything else that the sensor might see. What would be the best feature to use? One that indicates, “There she is!” It would be perfect -- one bit with complete predictive power. The machine learning task would be rendered trivial.
If only we could figure out how to compute better features directly from the underlying data… Deep Learning accomplishes this trick with layers of convolutional neural nets, but that carries a great deal of computational overhead. There are other ways.
At Reality AI, where our tools create classifiers and detectors based on high sample rate signal inputs (accelerometer, vibration, sound, electrical signals, etc) that often have high levels of noise and natural variation, we focus on discovering features that deliver the greatest predictive power with the lowest computational overhead. Our tools follow a mathematical process for discovering optimized features from the data before worrying about the particulars of algorithms that will make decisions with those features. The closer our tools get to perfect features, the better end results become. We need less data, use less training time, are more accurate, and require less processing power. It's a very powerful method.
Features for signal classification
For an example, let’s look at feature selection in high-sample rate (50Hz on up) IoT signal data, like vibration or sound. In the signal processing world, the engineer’s go-to for feature selection is usually frequency analysis. The usual approach to machine learning on this kind of data would be to take a signal input, run a Fast Fourier Transform (FFT) on it, and consider the peaks in those frequency coefficients as inputs for a neural network or some other algorithm.
Why this approach? Probably because it’s convenient, since all the tools these engineers use support it. Probably because they understand it, since everyone learns the FFT in engineering school. And probably because it’s easy to explain, since the results are easily relatable back to the underlying physics. But the FFT rarely provides an optimal feature set, and it often blurs important time information that could be extremely useful for classification or detection in the underlying signals.
Take for example this early test comparing our optimized features to the FFT on a moderately complex, noisy group of signals. In the first graph below we show a time-frequency plot of FFT results on this particular signal input (this type of plot is called a spectrogram). The vertical axis is frequency, and the horizontal axis is time, over which the FFT is repeatedly computed for a specified window on the streaming signal. The colors are a heat-map, with the warmer colors indicating more energy in that particular frequency range.
Time-frequency plot showing features based on FFT
Time-frequency plot showing features based on Reality AI
Compare that chart to one showing optimized features for this particular classification problem generated using our methods. On this plot you can see what is happening with much greater resolution, and the facts become much easier to visualize. Looking at this chart it’s crystal clear that the underlying signal consists of a multi-tone low background hum accompanied by a series of escalating chirps, with a couple of other transient things going on. The information is de-blurred, noise is suppressed, and you don’t need to be a signal processing engineer to understand that the detection problem has just been made a whole lot easier.
There’s another key benefit to optimizing features from the get go – the resulting classifier will be significantly more computationally efficient. Why is that important? It may not be if you have unlimited, free computing power at your disposal. But if you are looking to minimize processing charges, or are trying to embed your solution on the cheapest possible hardware target, it is critical. For embedded solutions, memory and clock cycles are likely to be your most precious resources, and spending time to get the features right is your best way to conserve them.
Deep Learning and Feature Discovery
At Reality AI, we have our own methods for discovering optimized features in signal data (read more about our Technology), but ours are not the only way.
As mentioned above, Deep Learning (DL) also discovers features, though they are rarely optimized. Still, DL approaches have been very successful with certain kinds of problems using signal data, including object recognition in images and speech recognition in sound. It can be a highly effective approach for a wide range of problems, but DL requires a great deal of training data, is not very computationally efficient, and can be difficult for a non-expert to use. There is often a sensitive dependence of classifier accuracy on a large number of configuration parameters, leading many of those who work with DL to focus heavily on tweaking previously used networks rather than focusing on finding the best features for each new problem. Learning happens “automatically”, so why worry about it?
My co-founder Jeff (the mathematician) explains that DL is basically “a generalized non-linear function mapping – cool mathematics, but with a ridiculously slow convergence rate compared to almost any other method.” Our approach, on the other hand, is tuned to signals but delivers much faster convergence with less data. On applications for which Realty AI is a good fit, this kind of approach will be orders of magnitude more efficient than DL.
The very public successes of Deep Learning in products like Apple’s Siri, the Amazon Echo, and the image tagging features available on Google and Facebook have led the community to over-focus a little on the algorithm side of things. There has been a tremendous amount of exciting innovation in ML algorithms in and around Deep Learning. But let's not forget the fundamentals.
It’s really all about the features.
Machine Learning: the lab vs the real world
Not long ago, TechCrunch ran a story reporting on Carnegie Mellon research showing that an “Overclocked smartwatch sensor uses vibrations to sense gestures, objects and locations.” These folks at the CMU Human-Computer Interaction Institute had apparently modified a smartwatch OS to capture 4 kHz accelerometer waveforms (most wearable devices capture at rates up to 0.1 kHz), and discovered that with more data you could detect a lot more things. They could detect specific hand gestures, and could even tell a what kind of thing a person was touching or holding based on vibrations communicated thru the human body. (Is that an electric toothbrush, a stapler, or the steering wheel of a running automobile?”)
To those of us working in the field, including those at Carnegie Mellon, this was no great revelation. “Duh! Of course, you can!” It was a nice-but-limited academic confirmation of what many people already know and are working on. TechCrunch, however, in typical breathless fashion, reported as if it were news. Apparently, the reporter was unaware of the many commercially available products that perform gesture recognition (among them Myo from Thalmic Labs, using its proprietary hardware, or some 20 others offering smartwatch tools). It seems he was also completely unaware of commercially available toolkits for identifying very subtle vibrations and accelerometry to detect machines conditions in noisy, complex environments (like our own Reality AI for Industrial Equipment Monitoring), or to detect user activity and environment in wearables (Reality AI for Consumer Products).
But my purpose is not to air sour grapes over lazy reporting. Rather, I’d like to use this case to illustrate some key issues about using machine learning to make products for the real world: Generalization vs Overtraining, and the difference between a laboratory trial (like that study) and a real-world deployment.
Generalization and Overtraining
Generalization refers to the ability of a classifier or detector, built using machine learning, to correctly identify examples that were not included in the original training set. Overtraining refers to a classifier that has learned to identify with high accuracy the specific examples on which it was trained, but does poorly on similar examples it hasn't seen before. An overtrained classifier has learned its training set “too well” – in effect memorizing the specifics of the training examples without the ability to spot similar examples again in the wild. That’s ok in the lab when you’re trying to determine whether something is detectable at all, but an overtrained classifier will never be useful out in the real world.
Typically, the best guard against overtraining is to use a training set that captures as much of the expected variation in target and environment as possible. If you want to detect when a type of machine is exhibiting a particular condition, for example, include in your training data many examples of that type of machine exhibiting that condition, and exhibiting it under a range of operating conditions, loads, etc.
It also helps to be very skeptical of “perfect” results. Accuracy nearing 100% on small sample sets is a classic symptom of overtraining.
It’s impossible to be sure without looking more closely at the underlying data, model, and validation results, but this CMU study shows classic signs of overtraining. Both the training and validation sets contain a single example of each target machine collected under carefully controlled conditions. And to validate, they appear to use a group of 17 subjects holding the same single examples of each machine. In a nod to capturing variation, they have each subject stand in different rooms when holding the example machines, but it's a far cry from the full extent of real-world variability. Their result has most objects hitting 100% accuracy, with a couple of objects showing a little lower.
Small sample sizes. Reuse of training objects for validation. Limited variation. Very high accuracy... Classic overtraining.
Illustration from the CMU study using vibrations captured with an overclocked smartwatch to detect what object a person is holding.
Detect overtraining and predict generalization
It is possible to detect overtraining and estimate how well a machine learning classifier or detector will generalize. At Reality AI, our go-to diagnostic is the K-fold Validation, generated routinely by our tools.
K-fold validation involves repeatedly 1) holding out a randomly selected portion of the training data (say 10%), 2) training on the remainder (90%), 3) classifying the holdout data using the 90% trained model, and 4) recording the results. Generally, hold-outs do not overlap, so, for example, 10 independent trials would be completed for a 10% holdout. Holdouts may be balanced across groups and validation may be averaged over multiple runs, but the key is that in each iteration the classifier is tested on data that was not part of its training. The accuracy will almost certainly be lower than what you compute by applying the model to its training data (a stat we refer to as “class separation”, rather than accuracy), but it will be a much better predictor of how well the classifier will perform in the wild – at least to the degree that your training set resembles the real world.
Counter-intuitively, classifiers with weaker class separation often hold up better in K-fold. It is not uncommon that a near perfect accuracy on the training data drops precipitously in K-fold while a slightly weaker classifier maintains excellent generalization performance. And isn’t that what you’re really after? Better performance in the real world on new observations?
Getting high-class separation, but low K-fold? You have a model that has been overtrained, with poor ability to generalize. Back to the drawing board. Maybe select a less aggressive machine learning model, or revisit your feature selection. Reality AI does this automatically.
Be careful, though, because the converse is not true: A good K-fold does not guarantee a deployable classifier. The only way to know for sure what you've missed in the lab is to test in the wild. Not perfect? No problem: collect more training data capturing more examples of underrepresented variation. A good development tool (like ours) will make it easy to support rapid, iterative improvements of your classifiers.
Lab Experiments vs Real World Products
Lab experiments like this CMU study don’t need to care much about generalization – they are constructed to illustrate a very specific point, prove a concept, and move on. Real-world products, on the other hand, must perform a useful function in a variety of unforeseen circumstances. For machine learning classifiers used in real-world products, the ability generalize is critical.
But it's not the only thing. Deployment considerations matter too. Can it run in the cloud, or is it destined for a processor-, memory- and/or power-constrained environment? (To the CMU guys – good luck getting acceptable battery life out of an overclocked smartwatch!) How computationally intensive is the solution, and can it be run in the target environment with the memory and processing cycles available to it? What response-time or latency is acceptable? These issues must be factored into a product design, and into the choice of machine-learning model supporting that product.
Tools like Reality AI can help. R&D engineers use Reality AI Tools to create machine learning-based signal classifiers and detectors for real-world products, including wearables and machines and can explore connections between sample rate, computational intensity, and accuracy. They can train new models and run k-fold diagnostics (among others) to guard against overtraining and predictability to generalize. And when they’re done, they can deploy to the cloud, or export code to be compiled for their specific embedded environment.
R&D engineers creating real-world products don’t have the luxury of controlled environments – overtraining leads to a failed product. Lab experiments don’t face that reality. Neither do TechCrunch reporters.
Model-Driven vs Data-Driven methods for working with Sensors and Signals
There are two main paradigms for solving classification and detection problems in sensor data: Model-driven, and Data-driven.
Model-Driven is the way everybody learned to do it in Engineering School.
Start with a solid idea of how the physical system works -- and by extension, how it can break. Consider the states or events you want to detect and generate a hypothesis about what aspects of that might be detectable from the outside and what the target signal will look like. Come collected samples in the lab and try to confirm a correlation between what you record and what you are trying to detect. Then engineer a detector by hand to find those hard-won features out in the real world, automatically.
Data-Driven is a new way of thinking, enabled by machine learning. Find an algorithm that can spot connections and correlations that you may not even know to suspect. Turn it loose on the data. Magic follows. But only if you do it right.
Both of these approaches have their pluses and minuses:
Model-Driven approaches limit complexity
Model-driven approaches are powerful because they rely on a deep understanding of the system or process, and can benefit from scientifically established relationships. Models can’t accommodate infinite complexity and generally must be simplified. They have trouble accounting for noisy data and non-included variables. At some level they’re limited by the amount of complexity their inventors can hold in their heads.
Model-Driven is expensive and takes time
Who builds models? The engineers that understand the physical, mechanical, electronic, data flow, or other appropriate details of the complex system -- in-house experts or consultants that work for a company and develop its products or operational machinery. These are generally experienced experts, very busy, and are both scarce and expensive resources.
Furthermore, modeling takes time. It is inherently a trial-and-error approach, rooted in the old scientific method of theory-based hypothesis formation and experiment-based testing. Finding a suitable model and refining it until it produces the desired results is often a lengthy process.
Data-Driven is Data Hungry
Data-Driven approaches based on machine learning require a good bit of data to get decent results. AI tools that discover features and train-up classifiers learn from examples, and there need to be enough examples to cover the full range of expected variation and null cases. Some tools (like our Reality AI) are powerful enough to generalize from limited training data and discover viable feature sets and decision criteria on their own, but many machine learning approaches require truly Big Data to get meaningful results and some demand their own type of experts to set them up.
Reality AI tools are data-driven machine learning tools optimized for sensors and signals.
To learn more about our data-driven methods visit our Technology page and download our technical white-paper.
What is a Sensor, anyway?
Sensor: a device that detects or measures a physical property and records, indicates, or otherwise responds to it.
We all know what a sensor is, right? A sensor makes "sense" of physical property -- it turns something about the physical world into data upon which a system can act. Traditionally, sensors have filled well defined, single-purpose roles: A thermostat, a pressure switch, a motion detector, an oxygen sensor, a knock detector, a smoke detector, a voltage arrestor. Measure one thing, and transmit a very simple message about that one thing. This thinking stems from several hundred years of physical engineering of devices and persists today in part because of the convenience of modular thinking in system design.
But this is changing. Fundamentally. Software is becoming the new sensor.
Consider your smartphone: it has sound and image capability, along with a multi-axis accelerometer, 3-axis gyroscope, magnetic compass, air pressure, light levels, touch, you name it. The sensor suite on current generations of smartphone would completely outclass many sensor packages flown by the US military not long ago. Some of these phone-based sensors still use dedicated hardware to reduce transduced data to information, but increasingly it's all done in software: acceleration, gyro and other data is reduced to a screen orientation, to a "phone-to-ear" detector, and to navigational inputs. Instead of the old-school sensor design, chips capable of capturing highly granular physical inputs at high-frequency sample rates feed software run in local memory on a local processor, reducing that data stream into specific inputs needed for a variety of different purposes by the OS and by apps. Even the radio components are becoming a software function.
Because the decision engine no longer owns the transducer, the underlying data is also available in its raw form. This means a smartphone app can increasingly leverage the same sensor data to make its own decisions in ways never specifically intended by the hardware designer. Am I running, walking, or standing in line? Am I on the bus or in the car? How's my driving? How’s my workout going? Is it getting dark out? Is it the ambient crowd noise loud enough I should turn up the volume? What's the gender and age of the speaker?
A home security system based on similar thinking, with a microphone and a suitable microcontroller, can do much with a software-defined audio processing capability: A glass break detector, a footstep detector, a heartbeat counter, a doorknob rattle detector, a dog bark or shout detector, a trip and fall sensor for grandma, an unauthorized teenager party alarm, all of these sensors defined in software within the same, flexible hardware box. No longer the "one box, one answer", traditional security sensor design.
Industrial IoT applications are numerous, and many are already in production. Dedicated physical thermocouples and vibration-limit-switches are being replaced with digital temperature probes and accelerometers attached to embedded microcontrollers. New software-defined sensing can now employ AI and predictive analytics (like ours) to intervene before a problem happens. We can now alert operators to a pending issue or needed maintenance with time for critical, high-value-processes to be spooled down in a controlled, planned fashion – or resolved during the next scheduled downtime so no interruption is necessary at all. Manufacturers and insurers can be kept in the loop regarding equipment field issues and parts needs, and can perform post-event forensics after critical failures.
Automotive sensors are also driving this way: decision making functions are trending away from hard wired, end-point transducers and toward onboard computers. Makers know this places increasing flexibility and software-adaptive capability into the hands of the system designers.
Modularity is increasingly moving from the physical layer to a network layer, in which modules are connected on a peer-to-peer network, exchanging packetized information. While this network layer begins as a digital substitute for individual electrical circuits, with increasing bandwidth capacity it can also provide the flexibility for devices to share underlying data as well as local yes/no decisions. This creates an unprecedented opportunity both to integrate information across modalities and to add brand new capability, ad hoc, in the form of software sensors.
This shift in thinking also opens the door to incorporating more complex, AI-based algorithms, rather than just simple condition thresholds. Sensor information can be integrated in ever more complex ways, and even the innocuous electrical panel circuit breaker is becoming a micro-controller powered, software sensing device.
This only makes sense.
5 tips for collecting Machine Learning data from high-sample-rate Sensors
Machine learning is hard. Programming for embedded environments, where processing cycles, memory and power are all in short supply is really hard. Deploying machine learning to embedded targets, well… That’s been pretty close to impossible for all but the simplest scenarios… Until now.
New modern tools, including ours from Reality AI, are making it possible to use highly sophisticated machine learning models in embedded solutions unlike ever before. But there are some important considerations.
What is high-sample-rate data?
But let’s start by being clear about what we’re talking about: high-sample-rate sensor data includes things like sound (8 kHz – 44kHz), accelerometry (25 Hz and up), vibration (100Hz on up to MHz), voltage and current, biometrics, and any other kind of physical-world data that you might think of as a waveform. With this kind of data, you are generally out of the realm of the statistician, and firmly in the territory of the signal processing engineer.
Machine logs and slower time series (eg pressure and temperature once per minute) can be analyzed effectively using both statistical and machine learning methods intended for time series data. But these higher-sample-rate datasets are much more complex, and these basic tools just won’t work.
One second of sound captured at 14.4kHz contains 14,400 data points, and the information it contains is more than just a statistical time series of pressure readings. It's a physical wave, with all of the properties that come along with physical waves, including oscillations, envelopes, phase, jitter, transients, and so on.
1 second of data captured at 300Hz - at this speed, it becomes possible to see the underlying vibration (a fan turning at around 70 revs per second).
The same 1 second of data sampled at 50Hz
The same 1 second of data captured at 10Hz
It’s all about the features
For machine learning, this kind of data also presents another problem – high dimensionality. That one second of sound with 14,400 points, if used raw, is treated by most machine learning methods as a single vector with 14,400 columns. With thousands, let alone tens of thousands of observations, most machine learning algorithms will choke. Deep Learning (DL) methods offer a way of dealing with this high dimensionality, but the need to stream real-time, high-sample-rate data to a deep learning cloud service leaves DL impractical for many applications.
So to apply machine learning, we compute “features” from the incoming data that reduce the large number of incoming data points to something more tractable. For data like sound or vibration, most engineers would probably try selecting peaks from a Fast Fourier Transform (FFT) – a process that reduces raw waveform data to a set of coefficients, each representing the amount of energy contained in a slice of the frequency spectrum. But there is a wide array of options available for feature selection, each more effective in different circumstances. For more on features, see our blog called "It's all about the features".
But this is about collecting data
But this post is really about collecting data – in particular about collecting data from high-sample-rate sensors for use with machine learning. In our experience, data collection is the most expensive, most time-consuming part of any project. So, it makes sense to do it right, right from the beginning. Don't forget to check our Data Collection and Labeling services.
Here are our five top suggestions for data collection to make your project successful:
1. Collect rich data
Though it may be difficult to work with directly, raw, fully-detailed, time-domain input collected by your sensor is extremely valuable. Don’t discard it after you’ve computed an FFT and RMS – keep the original sampled signal. The best machine learning tools available can make sense out of it and extract the maximum information content. For more on why this is important, see our blog post on “Rich Data, Poor Data”.
2. Use the maximum sample rate available, at least at first
It takes more bandwidth to transmit and more space to store, but it's much easier to downsample in software than to go back and re-collect data to see if a higher sample rate will help improve accuracy. Really great tools for working with sensor data will let you use software to downsample repeatedly and explore the relationship between sample-rate and model accuracy. If you do this with your early data, once you have a preliminary model in place you can optimize your rig and design the most cost-effective solution for broader deployment later, knowing that you’re making the right call.
3. Don’t over-engineer your rig
Do what’s easiest first, and what’s more expensive once you know it’s worth it. If one is available to support your use case, try a prototyping device for your early data collects to explore both project feasibility and the real requirements for your data collection rig before you commit.
4. Plan your data collect to cover all sources of variation
Successful real-world machine learning is an exercise in overcoming variation with data. Variation can be related both to the target (what you are trying to detect) and to the background (noise, different environments and conditions) as well as to the collection equipment (different sensors, placement, variations in mounting). Minimize any unnecessary variation – usually variation in the equipment is the easiest to eliminate or control – and make sure you capture data that gets as much of the likely real-world target variation in as many different backgrounds as possible. The better you are at covering the gamut of background, target and equipment variation, the more successful your machine learning project will be – meaning the better it will be able to make accurate predictions in the real world.
5. Collect iteratively
Machine learning works best as an iterative process. Start off by collecting just enough data to build a bare-bones model that proves the effectiveness of the technique, even if not yet for the full range of variation expected in the real world, and then use those results to fine-tune your approach. Engage with the analytical tools early – right from the beginning – and use them to judge your progress. Take the next data you get from the field and test your bare-bones model against it to get an accuracy benchmark. Take note of specific areas where it performs well and performs poorly. Retrain using the new data and test again. Use this to chart your progress and also to guide your data collection – circumstances where the model performs poorly are circumstances where you’ll want to collect more data. When you get to the point where you’re getting acceptable accuracy on new data coming in – what we call “generalizing” – you’re just about done. Now you can focus on model optimization and tweaking to get the best possible performance.
Access Reality AI Tools
Become part of the Reality AI community and help shape the AI revolution