Airflow workflow system for managing tasks is Airbnb’s answer to monitoring the progress of jobs and ensuring that batches run properly in Hadoop.
Airbnb has become a big user of Hadoop — so much so that it found the few workflow tools available for it were inadequate for its needs. So it produced its own, which it’s dubbed Airflow, and announced that it is now available as open source code.
Airbnb is a thriving online business that’s grown up using Hadoop, so it’s significant that it developed a piece of management software for the open source software. Founded in 2008, Airbnb quickly went from a US phenomenon to a global one. Airbnb now operates in 191 countries, according to Airbnb data scientist Lisa Qian, speaking on the opening day of the Hadoop Summit in San Jose Tuesday, as it just started offering rooms in Cuba, she said.
If there are few places that Airbnb doesn’t reach, then there’s also little data about travel and room rentals in which it is not interested. A year ago, Airbnb switched its orientation from encouraging homeowners to list their spare rooms for rent to encouraging people to fulfill their travel wishes. Its remade website emphasizes weekend getaways and exotic travel destinations rather than guidance to homeowners on how to rent a room. It currently has 1.2 million rooms available in its listings.
Airbnb has been a room reservation and rental service for 25 million travelers since the company was founded. Much of the unstructured information on rooms, room owners, room locations, and Airbnb customers is sorted and analyzed on Hadoop. It uses a Hive data warehouse on top of Hadoop with 1.5 PB of data. It is also attempting to put analysis tools into the hands of marketing and other employees, increasing the pressure to get a high number of Hadoop jobs done on a daily basis, Airbnb officials previously told InformationWeek.
[Want to learn about another tool that Airbnb created for Hadoop? SeeAirbnb Boosts Presto SQL Query Engine For Hadoop.]
Airflow is designed for the batch processing side of Hadoop, where many jobs are waiting to be executed. Airflow ensures that they are assigned the right resources, run in the correct order, and, when completed, don’t get inadvertently repeated. Airflow also monitors the progress of jobs as Hadoop is pressed into service to provide results for a large number of business processes.
Maxime Beauchemin, data engineer and veteran of big data jobs at Yahoo, Ubisoft, and Facebook, spoke at the Hadoop Summit Wednesday about the need for a Hadoop workflow system.
At the end of his talk, titled “Airflow: An Open Source Platform to Author and Monitor Data Pipelines,” one questioner asked him with some impatience why Airbnb didn’t simply extend one of two existing tools, Apache Oozie, created at Yahoo, or Azkaban, created at LinkedIn. (The questioner appeared to be a user of one or both.)
Beauchemin responded, “We looked into the code bases of both and decided either one was a bad choice for us,” Earlier in his talk, he had explained, “Airbnb was outgrowing its workflow.” The firm needs to process 5,000 to 6,000 Hadoop tasks a day. It was having trouble keeping them in the right order and coordinating the results. It also needed to constantly add new data pipelines, which produce additional data sets that must be added into the system. “We knew we needed something much better,” he said.
Any Hadoop workflow engine attempts to bring order to the somewhat chaotic process of scheduling Hadoop “jobs,” as Azkaban calls them, “actions” as Oozie calls them, or “tasks” as Airflow calls them. Some tasks can’t be processed until the results of others are available. It’s the workflow engine’s job to keep the dependencies straight and determine which one is run before another.
Airflow goes beyond mere scheduling, however. It can show how many jobs are running, what resources they’re using, how many jobs have been completed, and where jobs with errors may have stalled a multi-job workflow. It can also track who the job owner is and produce charts showing when jobs start and finish.
Airflow was designed to be a programmable workflow system. Built in Python, “the language of data,” Beauchemin said, it is hosted on six nodes on Amazon Web Services. The nodes are among the largest virtual servers Amazon offers, c3 8xlarge, to ensure plenty of headroom for Airbnb workflow operations.
“It’s a tool built by data engineers for data people … focused on authoring (new) data pipelines,” he said. Instead of a drag-and-drop user interface, Airbnb needed a Python language interface so its users could define new classes of data, dictate how to manage them, and write “for loops,” or Python statements that require code to be repeated a stated number of times.
Beauchemin said Airbnb has tried to make Airflow easy to install in less than a day, “if you’re familiar with Python.” He claimed that if a data engineer started to implement Oozie on Wednesday, he might still be working on it at the end of the week.
Five companies already use Airflow in production, said Beauchemin, but didn’t name them. Airbnb wants to see a successful open source community form around Airflow so that a wider base of developers will continue its advancement. “We’re super excited to share Airflow. We’re part of the sharing economy,” he said.