Friday, July 27, 2012

Big Data: Approaches, Myths & Skills


Last month, my 18-year old daughter asked me about Big Data. This is my first sure sign that a technology has reached a fever pitch in the hype cycle.  Ironically, I found that as I explained this
enterprise IT topic to my daughter, our conversation and the questions she asked did not vary greatly from many conversations I’ve had with other CEOs, journalists, financial analysts and industry colleagues.  Despite how widely Big Data is being covered these days, it appears to me that Big Data is a big mystery to many.


Trying not to be labeled a cynic, I have three big worries about Big Data:

1. My biggest worry is the poor percentage of successful Big Data projects that will emerge as we too quickly throw these new technologies at a wide variety of prospective projects in the enterprise
2. The low success rate of Big Data projects will be amplified by the current hype and subsequent misconceptions about Big Data technologies, and
3. This low project success rate could stay challenged over time because of the relative dearth of
knowledgeable, data-savvy technology and business professionals ready for a world where data are plentiful and analytic skills are not.

Successful Big Data Projects

As organizations race to evaluate and pilot Big Data tools and technologies, in search of an answer to a Big Data opportunity, I’ve seen evidence that architectural steps are being skipped in favor of speed.  Sometimes, speed is good.  In the case of Big Data, building the right data and platform architecture is critical to actually solving the business problem, which means the right amount of thoughtful planning should occur in advance.  Many missteps could be avoided by simply being clear up-front on the business problem (or opportunity) to be solved and how quickly the data must be used to enable a solution (i.e., how much latency is acceptable?).

Recently, I’ve tried to do my part to help explain successful Big Data (technical) architectures by starting with three simple, latency-driven approaches.  The specifics, including an architectural diagram, are described in my recent E-Commerce Times article, entitled “Match the Big Data Job to the Big Data Solution.” We’ve also posted additional graphics and explanation to the Big Data section of the Jaspersoft website.

Big Data Misconceptions (or Myths)
To reduce the hype, first we must overcome the misconceptions. My many conversations on the topic of Big Data yield equally many misconceptions and misunderstanding. Some examples of the most common myths: Big Data is all unstructured, Big Data means Hadoop and Big Data is just for sentiment analysis. Or course, each of these myths is only partially true and requires a deeper understanding of the technologies and their potential uses to gain real clarity.


I’ve recently offered a brief article that seeks to dispel the “Top 5 Myths About Big Data.” Published last month on Mashable. The article has garnered some great comments with the most completewritten by IBM’s James Kobielus. James improves and amplifies several of my major points. I hope you’ll join the conversation.

Analytic Skills Shortage
Worldwide digital content will grow 48% in 2012 (according to IDC), reaching 2.7 zettabytes by the end of the year.   As a result, big data expertise is fast-becoming the “must-have” expertise in every organization.  At the same time, in its 2011 research report, titled “Big data: The Next Frontier for Innovation, Competition, and Productivity,” McKinsey offered the following grim statistic:

“By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”

Without the solid analytic skills needed to support a growing array of Big Data projects, the risk potential grows rapidly.  Anyone in or near data science should take the coming skills shortage as a call-to-arms.  Every college and university should be building data analytics coursework into compulsory classes across a wide variety of disciplines and subject areas. Because of its importance, I’ll save this Big Data skills topic as the thesis for a future post.

Despite these primary worries, I remain hopeful (even energized) by the enormous Big Data opportunity ahead of us.  My hope is that, armed with good information and good technology, more Big Data customers and projects will become more quickly successful.

Brian Gentile
Chief Executive Officer
Jaspersoft Corporation

Wednesday, March 21, 2012

Cloud BI Progress & Pitfalls

In my on-going effort to uncover and discuss key BI industry trends, I recently authored a new article for my TDWI column (called “The BI Revolution”), under the same headline as this post. In that article, I focused on the big market that will emerge for BI in the cloud. Even more importantly, I shed light on the definitional and technological pitfalls that are confusing this market as it seeks to deliver more efficient cloud-based business intelligence.

Rather than address my main points here, I encourage you to read my post at the TDWI website and then add your comments and thoughts here.

Cloud BI = BI for SaaS + BI for PaaS
I note that the cloud as a transformational infrastructure will drive big use of BI for SaaS (on-demand analytical applications) and BI for PaaS (application development and deployment in the cloud). I am less bullish on SaaS BI (on-demand, general-purpose BI in the cloud) because I believe growth will continue to be fueled by BI embedded in data-driven applications, rather than delivered in any standalone use.

We’re constantly tuning the Jaspersoft website on this topic, building out content that seeks to explain, educate and amplify the technological and business benefits of BI in the Cloud. One important point left out of my TDWI post describes Jaspersoft’s focus on and success in BI for PaaS (platform-as-a-service).

Recently, Jaspersoft has been very active in BI for PaaS. We are working with all the major PaaS providers to ensure our BI platform is available within these new cloud-based development and deployment environments. Just last month, Jaspersoft announced an important partnership with Red Hat, making our BI server available immediately in the OpenShift (public cloud) and CloudForms (private cloud) environments. Then, Jaspersoft produced a blog post and video to highlight its support of VMWare’s CloudFoundry PaaS environment, with a more formal announcement pending. Overall, our head of Product & Alliances summed it up best:

“Jaspersoft’s intention is to be the de facto standard in BI for PaaS, enabling the broadest community of software developers to use our tools in their favorite cloud environment,” said Karl Van den Bergh, Vice President of Product & Alliances at Jaspersoft. “We are uniquely positioned to capitalize on this shift of application development to the cloud with our modern architecture, the world’s largest BI community building data-driven applications, and our open source model.”

Through my recent TDWI article and this post, my goal is to clarify the cloudy definitions around Cloud BI, the important pitfalls already witnessed, and the progress we can point to as a sense of optimism for what will be a bright Cloud BI future.

Brian Gentile
Chief Executive Officer
Jaspersoft

Thursday, March 1, 2012

Got Big Data?

If competing based on time and information really will drive the next major economic era, then Big Data is real and represents a huge opportunity. If you’re a business analyst or technologist responsible for mapping data to decisions, then the variety, velocity, and volume of data available to you today has never been richer. And, your responsibility has never been greater.

I’ve previously discussed the different classes of data source technologies that can legitimately be used to harness (or tame) big data. Hadoop is one of those technologies, as the most popular software framework associated with this rising trend. Others include NoSQL databases, MPP data stores and even ETL/Data Integration approaches (for moving Big Data by the batch into some more usable format). Each of these technologies align with an appropriate use-case that makes more understandable the variety of products emerging in this world of Big Data.

For simplicity, I like to talk about three popular approaches to connecting to and making use of Big Data for business intelligence reporting and analysis.

Interactive Exploration – the most dynamic because it involves native connectivity directly from the BI tool to the Big Data source and can offer results in near-real-time. Hadoop HBase, Hadoop HDFS, and MongoDB are just three of the most popular data sources to which direct connection would be an advantage.

Direct Batch Reporting – an important and mainstream approach (especially in this early market of Big Data) that relies on tried-and-true SQL access to Big Data. Hadoop Hive is the best known example, but Cassandra offers CQL access that delivers similar results and functionality.

Batch ETL – using extract, transform and load techniques to create a more usable subset of the Big Data is also popular, especially when the insight being sought is less urgent, probably in the order of hours or days after data capture. Most every ETL tool has now been improved to connect to and transform Big Data. Some even integrate nicely with underlying Hadoop technologies (like Pig), making the data steward’s life potentially simpler.

Sometime last year, it occurred to me that Jaspersoft is in a unique position with regard to Big Data. Because of Jaspersoft’s data-agnostic architecture, we’ve quickly offered a broad variety of native Big Data connectors, many of which have been available for more than one year (for free download) . . . and because of our large, growing community of developers (we have more than 260,000 registered community members, growing at about 6,000/month at the time of this writing), we have important data about Big Data. This realization led us to the Big Data Index.

Big Data Index

We’ve tracked the downloads of our Big Data connectors over the last year, charting the ups and downs with each, corresponding to the relative rise and fall of their popularity. Over this time, we’ve seen more than 15,000 downloads, so our view is pretty good. Here’s a static version of the latest data for the four most popular Big Data connector downloads:



During the course of the past year, the Hadoop technologies (HBase & Hive combined) proved the most popular. The fastest growing and the leader at the moment is MongoDB (from 10gen). Cassandra holds a solid and consistent fourth position (which should benefit DataStax, the commercial company behind Cassandra). Many other Big Data connectors are tracked as well, with a dynamic chart updated monthly.

As interest in Big Data grows, so will the potential uses for these technologies that are designed to map this data to decisions and insights. At the moment, I’m just content knowing I have a front-row seat via the Big Data Index.

We’re at the very beginning of this era, which will surely be reliant on more data than we could barely fathom just ten years ago. This is why your thoughts and comments on this topic are appreciated.

Brian Gentile
Chief Executive Officer
Jaspersoft

Thursday, January 26, 2012

The New Factors of Production and The Rise of Data-Driven Applications

For the last ten years, I’ve been partially obsessed with the notion that the formula for creating economic value needs to be updated. I’ve worked in the technology industry for 26 years and I’ve seen information systems radically change the landscape of competition and value creation. My most recent article on this topic appears in Forbes under the same title as this post.

Because this article represents just a fraction of my thoughts on this matter, I’d like to revisit the basic premise, which is captured in the excerpt below, and then describe how some of my current experiences at Jaspersoft corroborate this newly posited IT-driven economic theory.

“Classical economic theory describes three primary factors, or inputs, to the production of any good or service: land, labor, and capital. These factors facilitate production, but do not become part of the end product (as a raw material would). While these three factors have been much discussed and extended at different points in economic evolution, I believe that they, in any of the advanced economies of the world today, are vastly antiquated.

Sometime even prior to this new millennium, the primary factors of production have now assuredly become: Time, Information and Capital. I submit that the primary relevance of land and labor has diminished, not completely but measurably, from their prominence during agrarian and industrial economic times. In a sense, owning land and employing lots of people no longer highly correlate to a valuable and successful enterprise. Although in certain industries these two factors will remain prominent (think mining and energy production, for example). By and large, land and labor have yielded to two more important factors – time and information.”

I was very pleased when Silicon Angle asked to speak with me about my background and the thoughts that led to this newly posited IT-driven economic theory as well as the contributions Jaspersoft is making to this new economic landscape. I discussed how Jaspersoft’s mission is precisely to help its customers compete on the basis of time and information.

“From its very start, Jaspersoft was determined to build and advance the industry’s most modern, flexible, and scalable Business Intelligence (BI) software. To do this, we consciously chose the open source model of development and distribution, believing that the power and principles of community involvement and broad usage would prove continually more valuable (and it has). We knew time would be important to our business model to rapidly compete in a crowded software category.”

Jaspersoft focuses on delivering its modern BI software to those who are best suited to create value from it. We call these individuals “BI Builders” because they possess a powerful confluence of knowledge about data, analytics, and business (process, function, industry, etc.) that truly yields new value from insight. The result is thousands of commercially available software applications that include Jaspersoft technology. These software applications power the world and deliver faster, more effective insight into data. Jaspersoft’s open source model affords these applications very high quality reporting and analytic capabilities at a very low cost, so our customers create new economic value, arguably, where it could not have been created in the past.

In many ways, the BI Builder is the real hero in the equation that determines how companies can compete more effectively based on time and information. Jaspersoft simply becomes their partner and enabler.

Here’s a chance to continue this dialog at the intersection of economic theory and information technology. I offer an open invitation for comments. Your thoughts are appreciated.

Brian Gentile
Chief Executive Officer
Jaspersoft

Tuesday, November 8, 2011

Making Sense of it All

I’ve been writing about how important it is to build and deliver big data projects that can succeed, because the opportunity to do so has never been better and the business reasons to do so have never been more compelling. Seems like each week, more tools and products are available to make big, complex data types useful for a variety of business purposes.


But, what about the unforgiving worlds of natural language and semi-structured data sources? Is there any hope to generate insight from them, even in this new big data world?


It’s one thing to make sense of more traditionally structured big data sources; its quite another to parse natural language and complex, industry-specific data types. To quickly understand the difficulties of these data environments, I recommend Brett Sheppard’s excellent blog post on this topic.


Informatica’s HParser to the Rescue

Enter Informatica’s HParser, announced last week. Now, accessing and then making sense of practically any data type has just become far simpler. You can learn more about this important new Informatica product here. HParser is a parsing technology that can run inside a MapReduce job and which allows users to structure the unstructured or semi-structured data in Hadoop and ready it for analysis. This takes a lot of the complexity out of creating custom scripts, which is what developers need to do today. HParser is available in both a community and commercial edition and features a visual development environment that, when combined with its myriad out-of-the-box parsers for semi-structured industry standard data, can eliminate up to 80% of the time it takes to turn this data into insight.


Integration with Jaspersoft

I’m thrilled that Jaspersoft has collaborated with Informatica to deliver rich reporting and analysis of natural language and semi-structured data, working directly with Informatica’s new HParser. Through integration with Jaspersoft’s BI server, creating any variety of reports and analyses is drag-and-drop easy. You can learn more about our work together through this brief video.


In short, we’ve worked with Informatica to ensure the Jaspersoft BI platform can provide analytic access to Hadoop for anyone who needs to access and understand data – whether its an executive who wants a summarized dashboard or a manager who needs a detailed operational report. And, our BI platform can handle both batch processing (through Hive) as well as direct, ad hoc and near real-time access to this data, which we uniquely provide through direct HBase access. That should satisfy even the most analytic end user.


Now there’s no reason not to consider any big data source. Toward the goal of genuinely harnessing the opportunity all this new (big) data represents, it’s good to see Informatica and Jaspersoft help lead the way. Your comments are appreciated.


Brian Gentile

Chief Executive Officer

Jaspersoft

Tuesday, October 25, 2011

Too Big (Data) to Fail

Will we look back at 2011 and think of it as “the year of Big Data”? This does feel like the year when organizations can genuinely take advantage of the opportunity presented by big data – harnessing its volume, variety and velocity – in both concept and implementation.


The venture capital and investment community has been betting with its wallet. During the first three quarters of 2011, several high-profile acquisitions occurred and at least one dozen new, early stage investments were made. The key theme is big data analytics and the goal is big insights that drive new business decisions (presumably, decisions that couldn’t have been made without leveraging that big data).


The problem is that, currently, big data analytics is fraught with far too much data and far too little analytics. Should this continue without a more intelligent way to connect to and actually use all this data, the result will often be project failure.


The next generation of big data connectors must be more intelligent, providing views into these vast swaths of data, so the opportunity for big insights can be more commonly realized. Jaspersoft’s recent work and announcement this week with IBM and its InfoSphere BigInsights product take a major step in this direction.


IBM InfoSphere BigInsights

Building on the Apache Hadoop open source framework, IBM InfoSphere BigInsights adds administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities. The IBM software package comes in a Basic Edition (freely downloadable) and an Enterprise Edition. The Basic Edition includes the complete Apache Hadoop install, a web-based management console, and pre-built integration with IBM InfoSphere Warehouse, IBM Smart Analytics System, and DB2. The Enterprise Edition goes on to include text analytics capabilities with a rules engine, a spreadsheet-like browser-based tool (called BigSheets) for data exploration and job creation, a metric-driven scheduler, large scale indexing, a JDBC connector, LDAP support, and a query language that enables analysis of structured and non-traditional data types (called Jaql).


Finding insight from within all the data can be challenging. The BigInsights toolset is made far more useful with a modern, powerful BI server out in front of it. So, IBM’s partnership with Jaspersoft provides this critical component of a complete Big Data analytics solution.


2nd Generation “Intelligent” Connectors

Connecting to a Hadoop-class data source is only useful if done intelligently. Running a query that returns millions of rows (and columns) of data probably won’t answer the business question being posed. Intelligently interrogating the data structure during the query is necessary. To accomplish this, Jaspersoft has delivered a 2nd generation connector for the IBM InfoSphere BigInsights platform. This connector builds incrementally on providing data access via Hive and it builds exponentially on allowing direct and intelligent access to HBase. The Jaspersoft connector supports filters, delivers greater performance and usability, and enables yet unseen flexibility for interacting with Big Data.

1. Filters: Because HBase has no native query language, there's no automatic filtering capability. But there are filtering APIs. The new Jaspersoft connector not only supports simple filters (e.g., StartRow and EndRow) but also supports a wide array of complex filters (like RowFilter, FamilyFilter, ValueFilter, SkipValueFilter, and so on). In fact, the universe of supported Apache Hadoop filters is listed here.

2. Performance & Usability: In addition to the systems monitoring and management niceties provided by IBM, a Jaspersoft HBase query can specify exactly the ColumnFamilies and/or Qualifiers that are to be returned. This is particularly helpful for query performance tuning and usability, in that some HBase users have very wide tables, so accessing just the necessary fields offers a much faster and more usable solution.

3. Flexibility: To unpack data from HBase and make sense of it within a reporting tool, Jaspersoft’s connector supports a deserialization engine framework. The connector automatically understands HBase's shell and Java default serializations. Then, a customer can plug in existing or customized Java deserializers so the connector will automatically convert from HBase's raw bytes into meaningful data types. This delivers flexible support for the widest array of data within Hadoop’s HBase environment.


We’ve truly come a long way from the earliest days of Apache Hadoop, moving beyond the technical elite, on to the IT team (thanks to IBM) and now on to the business user (thanks to Jaspersoft). The result of Jaspersoft’s integration with IBM InfoSphere BigInsights is a complete Big Data solution, including the ability to manage and process large volumes of data and the ability to extract key information using flexible and easy-to-use reporting, dashboard and analytic views in one integrated solution. There’s plenty more to learn about Jaspersoft’s integration with IBM InfoSphere BigInsights.


The fastest path toward uncovering real analytic insight from Hadoop comes through a combination of proven, best-in-class software. Just in time, because the untapped potential for bold new insight from within the growing volumes of data is too big to fail.


Brian Gentile

Chief Executive Officer

Jaspersoft