9:10 am to 10:10 am
SESSION CHAIR:  Cristina Videira Lopes
Lessons from the Jungle of Open Source Big Data Development
Co-Founder and Technical Fellow
Keynote Abstract

The most exciting part of working as a software engineer is that the field evolves so rapidly that we are always learning new skills. Because the field is so dynamic, researchers are well positioned to help practitioners with their problems. My talk will use stories from my 11 years of working in Apache’s big data projects (mostly Hadoop, Hive, and ORC) and what kinds of tools would have made a big impact for us. For example, when we originally developed the ORC file format as part of Hive, it was only lightly integrated. However, over the next few years, it became deeply entangled with the rest of Hive. We decided to factor it out to a separate project, but it took a lot of effort because the tool support was not very good. Another example is that while open source is great for allowing small teams to accomplish a lot efficiently, there are also dangers involved if the project owners have a significantly different development model. In particular, Google has released some great open source software like ProtoBuf and Guava. However, Google’s release engineering process rebuilds the entire world every night. That means that they have very different requirements for compatibility and create a lot of pain for the rest of us. My talk will include these examples and others to encourage collaboration between the open source community and researchers.

About the Keynote

Owen O'Malley is a co-founder and technical fellow at Hortonworks, a rapidly growing company (25 to 750 employees in 4 years), which develops the completely open source Hortonworks Data Platform (HDP). HDP includes Hadoop and the large ecosystem of big data tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006 at Yahoo, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. In the last 8 years, he has been the architect of MapReduce, Security, and now Hive. Recently he has been driving the development of the ORC file format and adding ACID transactions to Hive. Before working on Hadoop, he worked on Yahoo Search's WebMap project, which was the original motivation for Yahoo to work on Hadoop. Prior to Yahoo, he wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He received his PhD in Software Engineering from University of California, Irvine.