Gaia in the UK

Taking the Galactic Census

The Data Processing Centre (DPCI)

Hadoop cluster at DPCI

The DPCI Hadoop cluster. The 7 racks containing the 108 Hadoop nodes.

The Data Processing Centre at the IoA (DPCI) is one of 5 computing centres involved in the processing of data coming from Gaia.

DPCI will run the CU5 system known as PhotPipe which performs the photometric processing of the photometric and low-resolution spectroscopic data from the satellite. This includes the Source Environment Analysis. DPCI also provides a convenient access to the incoming data to the science alert pipeline.

The PhotPipe outputs will include: internally calibrated accumulated photometry (i.e. statistics of mean photometry per source) from Sky Mapper (SM), Astrometric Field (AF), and integrated Blue Photometer (BP) and Red Photometer (RP); flux external calibration, to be applied by users to the internally calibrated data; internally calibrated mean BP/RP spectra per source; spectra external calibration, to be applied by users to the internally calibrated data; internally calibrated epoch spectrum shape coefficients and wavenumber from BP/RP data; internally calibrated epoch photometry from SM, AF, and integrated BP and RP internally calibrated epoch BP/RP spectra.

DPCI Hadoop - a close-up

Redundant network cables (yellow for management network, blue for ethernet and black for InifiniBand) plus remotely controlled power strips on the left

The first prototype cluster was installed in February 2008. The architecture was basically a hub and spoke where the hub was a central Oracle database holding all the data, and the spokes were processing nodes pulling data out of the central database, processing it, and sending results (updates) back. After an extensive test campaign, it became clear that this kind of architecture was inadequate to meet the requirements of CU5 data processing. The new architecture is based on Hadoop: a filesystem designed and optimised for resilient, massively distributed bulk data processing and an application programming interface (API), Map/Reduce to develop bulk processing applications (for more information about Map/Reduce see Map/Reduce tutorial). A subsequent test campaign, followed by a hectic prototyping schedule, proved the DPCI bet on this young technology a success. Since then (March 2010), DPCI has been developing the software infrastructure to build an integrated CU5 pipeline, PhotPipe, based on Hadoop and Map/Reduce.

In April 2012 a powerful cluster of 108 Hadoop nodes, almost 1 PB Hadoop Distributed File System (HDFS) disc space, data management nodes and InfiniBand (an input/output architecture, which allows low-latency, high-bandwidth data transmission, see InfiniBand Trade Association's website for more information) network was purchased and installed. Undergoing DPAC operation rehearsals are run on the new cluster.

Read more about PhotPipe.

Page last updated: 06 November 2023