Writing and reading AVRO Data Files using Python

Avro is a data serialization format with rich features like data structures support, RPC support and lacks requiring generating code to read/write its files. From 1.4.0 upwards you can also use AVRO from within Hadoop’s MapReduce (only Java supports that though).

Here is a sample code snippet that helps you understand how one can serialize (or write, in human terms) a ‘Record’ data type of Avro using its Python module (Installable via `easy_install avro`).

# Import the schema, datafile and io submodules
# from avro (easy_install avro)
from avro import schema, datafile, io

OUTFILE_NAME = 'sample.avro'

    "type": "record",
    "name": "sampleAvro",
    "namespace": "AVRO",
    "fields": [
        {   "name": "name"   , "type": "string"   },
        {   "name": "age"    , "type": "int"      },
        {   "name": "address", "type": "string"   },
        {   "name": "value"  , "type": "long"     }

SCHEMA = schema.parse(SCHEMA_STR)

def write_avro_file():
    # Lets generate our data
    data = {}
    data['name']    = 'Foo'
    data['age']     = 19
    data['address'] = '10, Bar Eggs Spam'
    data['value']   = 800

    # Create a 'record' (datum) writer
    rec_writer = io.DatumWriter(SCHEMA)

    # Create a 'data file' (avro file) writer
    df_writer = datafile.DataFileWriter(
                    # The file to contain
                    # the records
                    open(OUTFILE_NAME, 'wb'),
                    # The 'record' (datum) writer
                    # Schema, if writing a new file
                    # (aka not 'appending')
                    # (Schema is stored into
                    # the file, so not needed
                    # when you want the writer
                    # to append instead)
                    writers_schema = SCHEMA,
                    # An optional codec name
                    # for compression
                    # ('null' for none)
                    codec = 'deflate'

    # Write our data
    # (You can call append multiple times
    # to write more than one record, of course)

    # Close to ensure writing is complete

def read_avro_file():
    # Create a 'record' (datum) reader
    # You can pass an 'expected=SCHEMA' kwarg
    # if you want it to expect a particular
    # schema (Strict)
    rec_reader = io.DatumReader()

    # Create a 'data file' (avro file) reader
    df_reader = datafile.DataFileReader(

    # Read all records stored inside
    for record in df_reader:
        print record['name'], record['age']
        print record['address'], record['value']
        # Do whatever read-processing you wanna do
        # for each record here ...

if __name__ == '__main__':
    # Write an AVRO file first

    # Now, read it

I hope the snippet explains enough to understand how one could write/read Avro Data Files. The same technique would work for Java/Ruby also, although they may have certain other abstractions. Comment if there needs to be anything corrected or bettered.

Published by


Harsh, also known to some as 'Qwerty' or 'QwertyManiac' online, is a Customer Operations Engineer at Cloudera, Inc.. Harsh is a fan of trance and electronic music, distributed systems and GUI programming, and loves to troubleshoot and hack on code in his free time. Formerly a KDE committer, he now is a committer on the Apache Hadoop and Apache Oozie projects and is a great fan of all Open Source Software.

13 thoughts on “Writing and reading AVRO Data Files using Python”

  1. Hey, nice article. I am absolute beginner in python but i want to learn the language along with c++ any good books/links that would help me get on fast track to learning python?

  2. For Python I always recommend using ‘Core Python Programming‘ by Wesley J. Chun.

    In case you want something free and quick (and assuming you know to program well in C), you can go for ‘A Byte of Python‘ by Swaroop C.H.

    I haven’t got clues on the best book for C++, having learnt that almost entirely by trial, error, practice and Google. Sites dedicated to explain C++ standard libraries’ use help a lot too.

  3. Why, develop programs to become a developer of course!

    Start with simple math programs, say a matrix suite which does pretty basic but complete computations like addition/multiplication/determinant/etc. Get on to fairly complex math/algorithms like generating prime numbers efficiently (Sieves, one technique is called) or others. Once you get comfortable using conditional and looping constructs (that’s almost there is to basic programming) — try the problems presented to all at Project Euler. In my opinion that’s the best way to get strong in any language of your choice.

  4. “‘Core Python Programming‘ by Wesley J. Chun.” this book is good for beginners like me also right?
    thanks will go about doing what you just said ! :)

  5. Not many big differences, but Core Python Programming does not cover any of 3.x, its a fairly old book now. You can learn new things in 3.x easily via the Python docs online after mastering 2.5+ (2.5 is what’s used in production today, 3.x will still take a year or more).

  6. Hi,
    I am trying to run this code using python 2.7.2 and getting this error :
    return unicode(self.read_bytes(), “utf-8″)
    UnicodeDecodeError: ‘utf8′ codec can’t decode byte 0xae in position 9: invalid start byte

    The error is thrown from below line:

    df_reader = datafile.DataFileReader(

    Any idea why?


Comments are closed.