A Small Step For Man, A Big Data Leap For Mankind

Interview with Jordan Tigani, founding member of the BigQuery team and author of the first book on Google BigQuery Analytics

How do you transform a Big Idea into a Big Product?

“You know, big ideas start off just as small ideas do. It’s the way they grow into projects that change the way we think and do things that makes them big ideas. Back in 2011, Brian - the Site director in the Seattle office - came to the engineering team: Guys, I’ve got this idea about Big Data…

Jordan goes even further back in the future - to the roots of the conceptual work:

“At Google, "Big Data" is just called "data", because we've got these amazing tools, like Dremel, that can work with data of any size. In 2010, there was a realization that these tools can be useful to people, outside of Google, who had more data than they could handle easily.

We got a group of engineers together and brainstormed about how we could make these internal tools useful for external users.”

So, what was the path from idea to version 1.0?

“I must admit - it came out pretty different from what we were asked to build in the first place (laughs) but our project manager confirmed to us that there was a market emerging for this product. And that’s it, this is the story of how BigQuery came to life (he smiles subtly as he recalls in a flash the first days of the project).”

The Small Story of Big Data at Google

Jordan Tigani is a founding member and - for the last 5 years - has been one of the main designers and implementers of Google’s breakthrough Big Data technology - BigQuery - so he lived and took part in the historical moments in the evolution of the Big Data field.

The story of Big Data at Google starts before 2004 with the increased use of the MapReduce programming model for processing and generating large data sets and continues with the Dremel paper in 2010. Jordan recalls that there was an increasing internal demand for the Dremel technology in Google offices:

“Virtually every Google locale has a Dremel cell that processes petabytes of sales, advertising, and technical data every day. Dashboards enable people to visualize at a glance what is happening with their products. Google has always been a data-driven company. This is why at meetings at Google you'll often hear, That argument sounds reasonable, but can you back it up with data?”

And then, in 2010, a prototype of BigQuery was unveiled at Google I/O. Two years later, during GigaOm’s Structure:Data 2012 conference, after two years of brainstorming, working, testing, the engineering team was ready to reveal the publicly available version of BigQuery, the technology that allowed companies working on large datasets and doing major business analytics and intelligence projects to tap into the immense compute power of Google. BIME was the first BI application to partner with Google back then to test and implement BigQuery for its customers, creating and delivering business dashboards based on more than 1 billion rows of data.

The Neverending Book

In 2013, along with his colleague, Siddartha Naidu, also a founding member of the BigQuery team, Jordan was approached by Wiley about writing a book on Big Data. The publishing house was searching for a relevant topic in the increasingly large field of Big Data that would explain both the core science behind it and also it medium and long-term potential. For Jordan and Siddharta, there couldn’t be a faster answer than analytics done with BigQuery. Nine months later, chapter after chapter, the first book on Google BigQuery Analytics was published. Jordan’s neighborhood bookstore, Queen Anne Book Company, which doesn’t usually stock technical books, has been a big supporter, selling autographed copies and even planning a launch party with the authors.

Google BigQuery Analytics is a guide for business and data analysts who want the latest tips on running complex queries and writing code to communicate with the BigQuery API. In addition to the mechanics of BigQuery, the book also covers the architecture of the underlying Dremel query engine, providing a thorough understanding that leads to better query results.

Jordan is adapting quickly to the multiple roles he has to play right now - from the tech lead of the BigQuery storage and data team, he went to being an evangelist of the technology within major online communities such as Stackoverflow, a speaker at the most recent Google I/O conference and now, an acclaimed author.

Being an active participant in the BigQuery StackOverflow virtual community has helped Jordan not only in understanding how to explain the technology to the various types of users that are learning how to fix or enhance their queries but also how to develop the product itself, and eventually, how to structure the book.

No two days pass without Jordan answering a question in the community: “As you are trying to fix the problems of the users, when running complex queries such as the ones for financial modelling tasks, you immerse yourself in their cases and you understand more about the technology you created. Each case is a proof that we can still make improvement, that we can make things even easier, that we can write better documentation, clearer error messages / guideline messages.”

The community has been an important feedback mechanism and an inspiration for Jordan - he details: “There were certain features that people were taking advantage of less than expected so we made it sure to outline those in the book to truly enhance the usage of BigQuery.”

And even though we talk about a business tool, people do not remain just mere users of the technology: “We have our BigQuery fans. For example, there is this one guy - Javier Ramirez - who uses BigQuery all the time, does presentations on BigQuery, blogs about it. And he’s not a Googler.”

How Can You Read a Terabyte in a Second?

Back in March, during its Cloud Platform event, Google delivered a series of important news about their cloud services and the press immediately highlighted the pricing cuts - from the Compute Engine or App Engine services and up to BigQuery, with the largest price cut of 85%.

But what Jordan recalls is the major technological leap that BigQuery made - until then, the service had a limited streaming feature that could ingest up to 1,000 rows per second, per table. In March, this jumped to 100,000 rows of real-time data per second.

I challenged Jordan to think even further and imagine analytics in the near future: “You know, we still have to figure out each time the IO dilemma. And it is the scalability of ingesting data streams where the massive Google clusters and the living organism structure that the Google data centers constitute will prove their essential role.”

Behind Competitor Lines

In this context, live, fast and scalable are, in his view, the words that are going to dominate the market in the coming years and are going to define the differences with the emerging competitors:

“When we started writing the book, there was nothing else like BigQuery. Impala was a rumor, Redshift was just coming out but all these concurrent technologies keep us on our toes. If you have no competitors, it may as well mean that you are not working on the right thing. The innovation of BigQuery was and still is the capacity to scale much better arbitrary data and query sizes. But I believe that companies and developers who are building on top of Google's current Big Data technologies will continue to drive the trends in this field.”

In the book, the authors even launch the idea of Analytics as a Service (AaaS) and, since BigQuery uses the SQL language, Jordan thinks that both IT and business users will be able to increasingly handle the technology. “I've been excited to see people doing cool stuff in SQL without being engineers. Furthermore, cloud business analytics apps such as BIME cut the learning curve significantly. They create a drag-and-drop environment where users can easily build dashboards and data visualizations without learning SQL.”

BIME was one of my favorite tools for visualizing Big Data sets. I wish I had more time in the publication schedule to explore its advanced features. I was pleasantly surprised how natural the integration felt with BigQuery and how it could get good performance without any client-side software. It really shows the power of cloud-based virtualization software.
— Jordan Tigani, founding member the Google BigQuery team

Query All Things

When it was launched in 2012, BigQuery was promoted for solving the issues of processing the huge amounts of data coming from the global advertising or marketing streams but now the technology can be used for a potentially  bigger challenge(r): the IoT data streams - flows of data recorded from mobile devices, tracking devices, sensors or utilitarian robots. The authors describe in the book the mobile and sensors applications that were developed from BigQuery.

But what is truly the next benchmark to be surpassed? “With BigQuery, we can get to the level of petabyte queries today.The Open Datasets movement is exciting and we'd love to see more publicly available in BigQuery. The recently launched GDELT large dataset is an example, but there are lots of other possibilities. In parallel with these developments, I'd love simpler cloud-based tools to design dashboards and visualize these types of datasets. And, as the datasets demand more processing power, the technology will be able to grow to handle them.”

A Small Step for Man, A Big Data Leap for Mankind

In this context of permanent change of the technologies, how often do you think you will have to update the book? Jordan replies immediately: “We try hard to make sure all of our changes are backwards compatible, so the code in the book will continue to work for the foreseeable future. Everything that has been added to BigQuery so farhas been included as a pure addition so that existing code and recipes still continue to work as new features are added.”

After the 2 giant leaps in technology (the MapReduce model and the Dremel-based development of BigQuery) in only 10 years, one would think that we should wait another comet passing Earth for a new breakthrough in the Big Data field. Jordan thinks otherwise: There is a lot of research going on right now within Google. Even the Dremel technology may work very differently than how I described it in the book pretty soon. And... on top of all of these, I believe that the Google Cloud Dataflow, the technology meant to bring together batch and stream processing will prove to be another major breakthrough in the Big Data field.

For more information about Big Data trends, you can contact me through Twitter - @bimeanalytics and @tiberiu_iacomi.