Parsing Record Jar Formatted Files

Record Jar

The record-jar format is useful for storing multiple records containing name/value pairs. The example given in TAOUP will serve us well:

files/planets.records

Planet: Mercury
Orbital-Radius: 57,910,000 km
Diameter: 4,880 km
Eccentricity: 0.2
Mass: 3.30e23 kg
%%
Planet: Venus
Orbital-Radius: 108,200,000 km
Diameter: 12,103.6 km
Eccentricity: 0.007
Mass: 4.869e24 kg
%%
Planet: Earth
Orbital-Radius: 149,600,000 km
Diameter: 12,756.3 km
Eccentricity: 0.0167
Mass: 5.972e24 kg
Moons: Luna

Individual records are separated by a single line containing %%, each record contains several name/value pairs. Depending on the specific application some or all names may be required, or some may be optional.)

Here’s one way to read records into a python dict:

record-jar/rjar_reader.py

#!/usr/bin/env python3

def print_planet(planet):
	print("{name} has an orbital radius of {orbital_radius}".format(
		name=planet['Planet'], orbital_radius=planet['Orbital-Radius']))

if __name__ == '__main__':
	import sys

	planet = {}
	for line in sys.stdin:
		try:
			key, value = line.split(':')
			planet[key.strip()] = value.strip()
		except ValueError as e:
			if line.strip() == '%%':
				print_planet(planet)
				planet = {}
		
	print_planet(planet)

 $ record-jar/rjar_reader.py <files/planets.records
 Mercury has an orbital radius of 57,910,000 km
 Venus has an orbital radius of 108,200,000 km
 Earth has an orbital radius of 149,600,000 km

Looking at the __main__ function, is it clear what the purpose of the program is? Not really:

parsing logic is intermingled with high level program logic (print information for each planet)
number of visible lines is different from the number of loop iterations. We need to mentally parse through the loop logic to determine when we actually get a planet object

Generators and the `yield` keyword

A generator is a type of routine that can be used to easily implement complex iterators that can be used in loops. Python’s yield keyword makes it ridiculously easy to write generators but keep in mind this is an abstract programming concept that can be implemented in most other common programming languages, including C++.

record-jar/planet_reader.py

#!/usr/bin/env python3

def planet_reader(flo):
	planet = {}
	for line in flo:
		if line.strip() == '%%':
			yield planet
			planet = {}
		else:
			try:
				key, value = line.split(':')
				planet[key.strip()] = value.strip()
			except ValueError as e:
				pass #TODO: decide what we want to do on an 
	yield planet

def print_planet_info(planet):
	print("{name} has an orbital radius of {orbital_radius}".format(
		name=planet['Planet'], 
		orbital_radius=planet['Orbital-Radius'])
	     )


if __name__ == '__main__':
	import sys
	
	for planet in planet_reader(sys.stdin):
		print_planet_info(planet)

The yield keyword works kind of like return in that at that point in the code the current value stored in planet is returned and the control returns to the calling function. The difference is that on subsequent calls to the planet_reader generator control picks up at the line immediately after the yield that ended control last time.

If we look closely we’ll note that other than the names we use for variables the function planet_reader doesn’t contain any code specific to the data content but rather will work for ANY data stored in the record-jar format.

As a module

Let’s separate the record-jar parsing code from the specifics of the planet content so that we can re-use our parser for any data in the record-jar format.

record-jar/record_jar_reader.py

#!/usr/bin/env python3

def record_reader(flo):
        """read a stream containing record-jar formatted data, yield a
        complete record

        """

        record = {}
        for line in flo:
                if line.startswith('%%'):
                        yield record
                        record = {}
                else:
                        key, value = line.split(':')
                        record[key.strip()] = value.strip()
                        
        yield record

def load_records(flo):
        """read a complete stream containing record-jar formatted data and
        load all records into memory
        """

        return [ record for record in record_reader(flo) ]

if __name__ == '__main__':
	import sys
	
	for record in record_reader(sys.stdin):
		print("record: {0}\n".format(record))

Note I include a simple __main__ function to test the functionality of the record-jar reader that does not depend on any particular data being in the records. If I invoke record_jar_reader.py from the command line, the code block contained in __main__ will run, but if I import the file as a module this block will not run. This is useful for providing test code for modules that you write.

record-jar/planets.py

#!/usr/bin/env python3

import record_jar_reader as rjr

def print_planet_info(planet):
	print("{name} has an orbital radius of {orbital_radius}".format(
		name=planet['Planet'], 
		orbital_radius=planet['Orbital-Radius'])
	     )


if __name__ == '__main__':
	import sys
	
	for planet in rjr.record_reader(sys.stdin):
		print_planet_info(planet)

 $ record-jar/planets.py <files/planets.records
 Mercury has an orbital radius of 57,910,000 km
 Venus has an orbital radius of 108,200,000 km
 Earth has an orbital radius of 149,600,000 km

Record Jar

Generators and the yield keyword

As a module

Generators and the `yield` keyword