What would it take to build an open-source sensor-data collection and analysis platform?
I’ve been thinking about a form of this question for several months. I’ve always been intrigued by the notion of push-button-analysis that could be generically applied to a data set. The goal of this platform wouldn’t be to provide meaning or understanding to data, that’s left as an exercise to the end user. The broader goal is to make time series data more useful more quickly by removing the boilerplate.
Back to the original question, what would it take to build an open-source sensor-data collection and analysis platform? That’s actually an umbrella question. The specific questions (and some answers) include, but are not limited to, the following:
What capabilities need to be present in such a platform?
- Sensor data may be sent as individual events or in batch (more likely).
- The batch form would either be a single sensor (e.g. once per hour) or a group of sensors (via local collection device) sending at predetermined intervals.
- Data would be made available via multiple views
- Simple aggregations (think of dashboards and basic reports)
- Advanced analytical algorithms
- Ad-hoc exploration via Hive (perhaps)
- Basic alerting functionality (tied to one or more metrics)
- Ability to schedule analysis functions on an ongoing basis, e.g. each week compute regression over the past 7 days
What are the necessary hardware and software components?
There’s the ideal version:
- Multiple endpoints fronted by a load balancer to accept and queue up incoming data
- Queue to hold incoming data
- Consumer to pull data from queue and insert into database
- User-facing app to display dashboards/reports, schedule analysis, view analysis results
- App provisions temporary compute nodes to analyze data (using Hadoop)
And the minimally-viable version:
- Simple app/endpoint to accept incoming data
- Endpoint inserts data into database
- User-facing app to display dashboards/reports, view analysis results
- App provisions temporary compute nodes to analyze data (using Python)
Is any of this available off-the-shelf?
Several components that would be useful here are available in pre-packaged form. Still, much of this is disparate pieces; what we’re building here is “a lot of glue”. Several services from AWS are a great fit (compute, storage, map reduce, queuing) but the cloud is not always the best choice. Another approach is via OpenStack - several of the core components exist (compute, storage) and the platform is open and being extended.
Some components that will definitely be useful to start:
- AWS (see above)
- Pandas, SciPy, NumPy
- MongoDB (optimized for storing time series data in a pre-aggregated form)
- Arduinos and ZigBee to create a simple wireless sensor network.
What sorts of analysis algorithms make sense for sensor data?
This pretty much captures it: Time Series on Wikipedia
- Prototype individual components of the minimally-viable approach
- Prototype some networks that will send data
- Consider scalability and where the system can be made to support more sensors, analysis, throughput