Harsh J

Memoirs of a QWERTY Keyboard

Writing and reading AVRO Data Files using Python

13 comments

Avro is a data serialization format with rich features like data structures support, RPC support and lacks requiring generating code to read/write its files. From 1.4.0 upwards you can also use AVRO from within Hadoop’s MapReduce (only Java supports that though).

Here is a sample code snippet that helps you understand how one can serialize (or write, in human terms) a ‘Record’ data type of Avro using its Python module (Installable via `easy_install avro`).

# Import the schema, datafile and io submodules
# from avro (easy_install avro)
from avro import schema, datafile, io

OUTFILE_NAME = 'sample.avro'

SCHEMA_STR = """{
    "type": "record",
    "name": "sampleAvro",
    "namespace": "AVRO",
    "fields": [
        {   "name": "name"   , "type": "string"   },
        {   "name": "age"    , "type": "int"      },
        {   "name": "address", "type": "string"   },
        {   "name": "value"  , "type": "long"     }
    ]
}"""

SCHEMA = schema.parse(SCHEMA_STR)

def write_avro_file():
    # Lets generate our data
    data = {}
    data['name']    = 'Foo'
    data['age']     = 19
    data['address'] = '10, Bar Eggs Spam'
    data['value']   = 800

    # Create a 'record' (datum) writer
    rec_writer = io.DatumWriter(SCHEMA)

    # Create a 'data file' (avro file) writer
    df_writer = datafile.DataFileWriter(
                    # The file to contain
                    # the records
                    open(OUTFILE_NAME, 'wb'),
                    # The 'record' (datum) writer
                    rec_writer,
                    # Schema, if writing a new file
                    # (aka not 'appending')
                    # (Schema is stored into
                    # the file, so not needed
                    # when you want the writer
                    # to append instead)
                    writers_schema = SCHEMA,
                    # An optional codec name
                    # for compression
                    # ('null' for none)
                    codec = 'deflate'
                )

    # Write our data
    # (You can call append multiple times
    # to write more than one record, of course)
    df_writer.append(data)

    # Close to ensure writing is complete
    df_writer.close()

def read_avro_file():
    # Create a 'record' (datum) reader
    # You can pass an 'expected=SCHEMA' kwarg
    # if you want it to expect a particular
    # schema (Strict)
    rec_reader = io.DatumReader()

    # Create a 'data file' (avro file) reader
    df_reader = datafile.DataFileReader(
                    open(OUTFILE_NAME),
                    rec_reader
                )

    # Read all records stored inside
    for record in df_reader:
        print record['name'], record['age']
        print record['address'], record['value']
        # Do whatever read-processing you wanna do
        # for each record here ...

if __name__ == '__main__':
    # Write an AVRO file first
    write_avro_file()

    # Now, read it
    read_avro_file()


I hope the snippet explains enough to understand how one could write/read Avro Data Files. The same technique would work for Java/Ruby also, although they may have certain other abstractions. Comment if there needs to be anything corrected or bettered.

Written by Harsh

April 25th, 2010 at 10:09 pm

Posted in Personal

Tagged with , , , ,

13 Responses to 'Writing and reading AVRO Data Files using Python'

Subscribe to comments with RSS or TrackBack to 'Writing and reading AVRO Data Files using Python'.

  1. Hey, nice article. I am absolute beginner in python but i want to learn the language along with c++ any good books/links that would help me get on fast track to learning python?

    aash

    26 Apr 10 at 6:56 pm

  2. For Python I always recommend using ‘Core Python Programming‘ by Wesley J. Chun.

    In case you want something free and quick (and assuming you know to program well in C), you can go for ‘A Byte of Python‘ by Swaroop C.H.

    I haven’t got clues on the best book for C++, having learnt that almost entirely by trial, error, practice and Google. Sites dedicated to explain C++ standard libraries’ use help a lot too.

    Harsh

    26 Apr 10 at 7:55 pm

  3. I wouldn’t say I am programmer but I surely do want to become one, what I do to become one?

    aash

    26 Apr 10 at 9:05 pm

  4. Why, develop programs to become a developer of course!

    Start with simple math programs, say a matrix suite which does pretty basic but complete computations like addition/multiplication/determinant/etc. Get on to fairly complex math/algorithms like generating prime numbers efficiently (Sieves, one technique is called) or others. Once you get comfortable using conditional and looping constructs (that’s almost there is to basic programming) — try the problems presented to all at Project Euler. In my opinion that’s the best way to get strong in any language of your choice.

    Harsh

    26 Apr 10 at 9:47 pm

  5. “‘Core Python Programming‘ by Wesley J. Chun.” this book is good for beginners like me also right?
    thanks will go about doing what you just said ! :)

    aash

    26 Apr 10 at 10:01 pm

  6. Yep, its great for beginners. Remember to get the latest edition (2nd, I think) though, so it covers Python 2.5 upwards.

    Harsh

    26 Apr 10 at 10:39 pm

  7. does it cover 3.1 too?

    aash

    27 Apr 10 at 5:26 am

  8. and are there any big differences between 2.5 and 3?

    aash

    27 Apr 10 at 5:28 am

  9. Not many big differences, but Core Python Programming does not cover any of 3.x, its a fairly old book now. You can learn new things in 3.x easily via the Python docs online after mastering 2.5+ (2.5 is what’s used in production today, 3.x will still take a year or more).

    Harsh

    27 Apr 10 at 9:57 am

  10. how much is that book?

    aash

    27 Apr 10 at 10:15 am

  11. ^^^ redundant question u can delete it :P

    aash

    27 Apr 10 at 1:36 pm

  12. aash: I’ve found http://diveintopython3.org to be an excellent resource for a beginner wanting to learn Python 3.x.

    Ethan

    5 Nov 10 at 10:41 pm

  13. Hi,
    I am trying to run this code using python 2.7.2 and getting this error :
    return unicode(self.read_bytes(), “utf-8″)
    UnicodeDecodeError: ‘utf8′ codec can’t decode byte 0xae in position 9: invalid start byte

    The error is thrown from below line:

    df_reader = datafile.DataFileReader(
    open(“outfile”),
    rec_reader
    )

    Any idea why?

    Thanks

    curious

    18 Nov 11 at 4:50 am

Leave a Reply