Monday, August 9, 2010

What’s Next for Data Analysis? Part II

In my last post I focused on the emerging trends that will drive the next generation of data analysis. I cited four substantial shifts in both the technologies and customer uses that will be amplified in the next several years. I also mentioned that these trends and technologies are surely influencing our road map and plans at Jaspersoft.

For this blog post, I’ll describe which technologies will likely fuel these changing usage patterns and some product categories that will, therefore, get a boost.

Analytic Databases
These are data stores that use sophisticated indexing, compression, columnar, and/or other technologies to deliver fast querying for large data sets. Increasingly, newer entrants in this category are less expensive that their enterprise data warehouse and OLTP counterparts. Although natively these databases require structured data formats, they provide a tremendous new capability to deal with large data volumes affordably and with greater processing power. When combined with a sophisticated analytic tool (such as a ROLAP engine or in-memory analysis techniques), an analytic database can deliver speed, volume, and sophisticated multi-dimensional insight – a powerful combination. For more on this product category, check out this prior post.

Distributed Data Processing via Hadoop
Large volumes of distributed data, typically generated through web activity and transactions, is the fastest growing data type. This data is commonly unstructured, semi-structured or complex, and holds great promise for delivering keen business insight if tapped properly. With the open source project Hadoop, and some upstart open source companies working to commercialize it, that previously untapped information capital is now ready to be unlocked. By enabling massive sets of complex data to be manipulated in parallel processes, Hadoop provides businesses a powerful new tool to perform “big data” analysis to find trends and act on data previously out-of-reach. Increasingly, big data will be a big deal and this is an important area to watch.

Complex Event Processing
On their own, large data volumes already create difficult analytic challenges. When that data is being created and updated rapidly (even imperceptibly to humans), a different approach to analysis is required. CEP tools monitor streaming data looking for events to help identify otherwise imperceptible patterns. I’ve referred to this technological concept elsewhere as the converse of traditional ad hoc analysis where the data persists and the queries are dynamic. With CEP, in a sense, the query persists and the data is dynamic. You can expect CEP-based, dynamic data analysis functionality to become more interesting and capable across a wider variety of uses each year.

In-Memory Analysis
More simple, integrated, multi-dimensional views of data should not be available only to those who spent two weeks in a special class (think ROLAP or MOLAP). They should exist alongside your favorite bar or line chart and tabular view of data. The analysis should also be constructed for you by the server, persist in memory as long as you need it (and no longer), and then get out of your way when finished. Interacting with it should be as straightforward as navigating a hyperlink report and pivot table -- although a variety of cross-tab types, charts, maps, gauges and widgets should be available for you to do so.

Statistical Analysis
Ever since IBM acquired SPSS, statistical modeling is cool again (since when is IBM cool, btw?). The truth is that the natural progression when analyzing past data is to project it forward. With the need to deal with larger volumes of data and at lower latency, it stands to reason that predicting future results becomes more important. This is why I believe the R revolution is here to stay (R is the open source statistical analysis tool used by many in the academic and scientific world). I predict a growing commercial need for this open source juggernaut, and by this I mean a growing demand for tools based on R with more robust features and a commercial business model – and a few software companies are delivering.

If you follow the Open Book on BI, you know I’m a big fan of mash-up dashboards. I expect these flexible, web-based constructs to deliver the most pervasive set of contextually relevant data, gaining broader use and enabling better decisions even without fancy predictive tools (although the output from a statistical model should be embeddable within a mashboard, maintaining its link back to the model and data source along with any relevant filters). Earlier this year, I wrote an article about making better, faster decisions through the clever use of mashboards. Making those good decisions is about understanding the past and recognizing current patterns, all while understanding the proper context. These relevant visual data elements should come together in a single, navigable view. Perfect for a mashboard.

So, this is my short list of business intelligence product categories and technologies that stand to gain substantially in the next few years. Surely I’ve not covered them all so your comments and feedback are encouraged.

Brian Gentile
Chief Executive Officer