Harsh J

Memoirs of a QWERTY Keyboard

Archive for the ‘MapReduce’ tag

Cloudera Desktop and Hadoop Distribution on ArchLinux

leave a comment

Sure is easy to get the Cloudera hadoop packages up and running in Debian and RPM based distributions. All you need to do is add repositories and issue an instruction to your package manager!

Since ArchLinux, distribution of choice for the wise, rollin’ and version-worry-free, has no AUR packages for installing the same I thought I might entertain you to a more manual approach of setting up your Cloudera Hadoop and running the beautiful Cloudera Desktop on it. Well actually the article caters to almost any other Linux too, but I love ArchLinux and like getting sued. Anyway, lets get on with it.

Took me a couple of minutes to hunt down the archive site of Cloudera which gave away source packages (not source rpms, those wouldn’t be what we need, at least not what I need). You can find the Cloudera’s CDH2 releases here. Navigate above and away for other releases if you want to grow older or to bleed till you feel like stemming it.

Download their Hadoop and Desktop archives (Versions 0.20.1 and 0.3.0 as of this article’s writing date).

Unpack Cloudera-Hadoop and configure as you like, format the namenode, and run it using provided bin/ scripts. Configuring help may be found on Hadoop’s site.

Unpack Cloudera-Desktop and build it using make, install using make install (use a PREFIX if you like). Next, follow this article (from 1.5 on) (README of the desktop package helps too) to pour special sauce into Cloudera-Hadoop for the Desktop to integrate smoothly. Run the desktop using cloudera-desktop/bin/supervisor (This runs a server-like process, so ensure you don’t SIGTERM it — start within screen or with a &).

Connect to your (hopefully working if Hadoop) new Cloudera Desktop using http://localhost:8088 and enjoy using the simple Job Designer tool, amongst others.

I leave the daemon user-setup and other finer cluster-related tuning to your tastes. This guide serves good for a pseudo cluster, $HOME run setup.

Written by Harsh

June 1st, 2010 at 10:13 pm

Writing and reading AVRO Data Files using Python

13 comments

Avro is a data serialization format with rich features like data structures support, RPC support and lacks requiring generating code to read/write its files. From 1.4.0 upwards you can also use AVRO from within Hadoop’s MapReduce (only Java supports that though).

Here is a sample code snippet that helps you understand how one can serialize (or write, in human terms) a ‘Record’ data type of Avro using its Python module (Installable via `easy_install avro`).

# Import the schema, datafile and io submodules
# from avro (easy_install avro)
from avro import schema, datafile, io

OUTFILE_NAME = 'sample.avro'

SCHEMA_STR = """{
    "type": "record",
    "name": "sampleAvro",
    "namespace": "AVRO",
    "fields": [
        {   "name": "name"   , "type": "string"   },
        {   "name": "age"    , "type": "int"      },
        {   "name": "address", "type": "string"   },
        {   "name": "value"  , "type": "long"     }
    ]
}"""

SCHEMA = schema.parse(SCHEMA_STR)

def write_avro_file():
    # Lets generate our data
    data = {}
    data['name']    = 'Foo'
    data['age']     = 19
    data['address'] = '10, Bar Eggs Spam'
    data['value']   = 800

    # Create a 'record' (datum) writer
    rec_writer = io.DatumWriter(SCHEMA)

    # Create a 'data file' (avro file) writer
    df_writer = datafile.DataFileWriter(
                    # The file to contain
                    # the records
                    open(OUTFILE_NAME, 'wb'),
                    # The 'record' (datum) writer
                    rec_writer,
                    # Schema, if writing a new file
                    # (aka not 'appending')
                    # (Schema is stored into
                    # the file, so not needed
                    # when you want the writer
                    # to append instead)
                    writers_schema = SCHEMA,
                    # An optional codec name
                    # for compression
                    # ('null' for none)
                    codec = 'deflate'
                )

    # Write our data
    # (You can call append multiple times
    # to write more than one record, of course)
    df_writer.append(data)

    # Close to ensure writing is complete
    df_writer.close()

def read_avro_file():
    # Create a 'record' (datum) reader
    # You can pass an 'expected=SCHEMA' kwarg
    # if you want it to expect a particular
    # schema (Strict)
    rec_reader = io.DatumReader()

    # Create a 'data file' (avro file) reader
    df_reader = datafile.DataFileReader(
                    open(OUTFILE_NAME),
                    rec_reader
                )

    # Read all records stored inside
    for record in df_reader:
        print record['name'], record['age']
        print record['address'], record['value']
        # Do whatever read-processing you wanna do
        # for each record here ...

if __name__ == '__main__':
    # Write an AVRO file first
    write_avro_file()

    # Now, read it
    read_avro_file()


I hope the snippet explains enough to understand how one could write/read Avro Data Files. The same technique would work for Java/Ruby also, although they may have certain other abstractions. Comment if there needs to be anything corrected or bettered.

Written by Harsh

April 25th, 2010 at 10:09 pm

Posted in Personal

Tagged with , , , ,

Summing Reducers

leave a comment

When attempting to sum all Mapped groups of (Key, Int) or (Key, Long) type while Reducing, don’t implement your own Reducing class if possible. Use IntSumReducer and LongSumReducer instead – save some time.

Written by Harsh

April 19th, 2010 at 11:01 pm