Harsh J

Memoirs of a QWERTY Keyboard

Writing and reading AVRO Data Files using Python

13 comments

Avro is a data serialization format with rich features like data structures support, RPC support and lacks requiring generating code to read/write its files. From 1.4.0 upwards you can also use AVRO from within Hadoop’s MapReduce (only Java supports that though).

Here is a sample code snippet that helps you understand how one can serialize (or write, in human terms) a ‘Record’ data type of Avro using its Python module (Installable via `easy_install avro`).

# Import the schema, datafile and io submodules
# from avro (easy_install avro)
from avro import schema, datafile, io

OUTFILE_NAME = 'sample.avro'

SCHEMA_STR = """{
    "type": "record",
    "name": "sampleAvro",
    "namespace": "AVRO",
    "fields": [
        {   "name": "name"   , "type": "string"   },
        {   "name": "age"    , "type": "int"      },
        {   "name": "address", "type": "string"   },
        {   "name": "value"  , "type": "long"     }
    ]
}"""

SCHEMA = schema.parse(SCHEMA_STR)

def write_avro_file():
    # Lets generate our data
    data = {}
    data['name']    = 'Foo'
    data['age']     = 19
    data['address'] = '10, Bar Eggs Spam'
    data['value']   = 800

    # Create a 'record' (datum) writer
    rec_writer = io.DatumWriter(SCHEMA)

    # Create a 'data file' (avro file) writer
    df_writer = datafile.DataFileWriter(
                    # The file to contain
                    # the records
                    open(OUTFILE_NAME, 'wb'),
                    # The 'record' (datum) writer
                    rec_writer,
                    # Schema, if writing a new file
                    # (aka not 'appending')
                    # (Schema is stored into
                    # the file, so not needed
                    # when you want the writer
                    # to append instead)
                    writers_schema = SCHEMA,
                    # An optional codec name
                    # for compression
                    # ('null' for none)
                    codec = 'deflate'
                )

    # Write our data
    # (You can call append multiple times
    # to write more than one record, of course)
    df_writer.append(data)

    # Close to ensure writing is complete
    df_writer.close()

def read_avro_file():
    # Create a 'record' (datum) reader
    # You can pass an 'expected=SCHEMA' kwarg
    # if you want it to expect a particular
    # schema (Strict)
    rec_reader = io.DatumReader()

    # Create a 'data file' (avro file) reader
    df_reader = datafile.DataFileReader(
                    open(OUTFILE_NAME),
                    rec_reader
                )

    # Read all records stored inside
    for record in df_reader:
        print record['name'], record['age']
        print record['address'], record['value']
        # Do whatever read-processing you wanna do
        # for each record here ...

if __name__ == '__main__':
    # Write an AVRO file first
    write_avro_file()

    # Now, read it
    read_avro_file()


I hope the snippet explains enough to understand how one could write/read Avro Data Files. The same technique would work for Java/Ruby also, although they may have certain other abstractions. Comment if there needs to be anything corrected or bettered.

Written by Harsh

April 25th, 2010 at 10:09 pm

Posted in Personal

Tagged with , , , ,