Parsing Record Jar Formatted Files
Record Jar
The record-jar format is useful for storing multiple records containing name/value pairs. The example given in TAOUP will serve us well:
Planet: Mercury
Orbital-Radius: 57,910,000 km
Diameter: 4,880 km
Eccentricity: 0.2
Mass: 3.30e23 kg
%%
Planet: Venus
Orbital-Radius: 108,200,000 km
Diameter: 12,103.6 km
Eccentricity: 0.007
Mass: 4.869e24 kg
%%
Planet: Earth
Orbital-Radius: 149,600,000 km
Diameter: 12,756.3 km
Eccentricity: 0.0167
Mass: 5.972e24 kg
Moons: Luna
Individual records are separated by a single line containing %%
,
each record contains several name/value pairs. Depending on the
specific application some or all names may be required, or some may be
optional.)
Here’s one way to read records into a python dict
:
#!/usr/bin/env python3
def print_planet(planet):
print("{name} has an orbital radius of {orbital_radius}".format(
name=planet['Planet'], orbital_radius=planet['Orbital-Radius']))
if __name__ == '__main__':
import sys
planet = {}
for line in sys.stdin:
try:
key, value = line.split(':')
planet[key.strip()] = value.strip()
except ValueError as e:
if line.strip() == '%%':
print_planet(planet)
planet = {}
print_planet(planet)
$ record-jar/rjar_reader.py <files/planets.records
Mercury has an orbital radius of 57,910,000 km
Venus has an orbital radius of 108,200,000 km
Earth has an orbital radius of 149,600,000 km
Looking at the __main__
function, is it clear what the purpose of
the program is? Not really:
- parsing logic is intermingled with high level program logic (print information for each planet)
- number of visible lines is different from the number of loop iterations. We need to mentally parse through the loop logic to determine when we actually get a planet object
Generators and the yield
keyword
A
generator
is a type of routine that can be used to easily implement complex
iterators that can be used in loops. Python’s yield
keyword makes
it ridiculously easy to write generators but keep in mind this is an
abstract programming concept that can be implemented in most other
common programming languages,
including C++.
#!/usr/bin/env python3
def planet_reader(flo):
planet = {}
for line in flo:
if line.strip() == '%%':
yield planet
planet = {}
else:
try:
key, value = line.split(':')
planet[key.strip()] = value.strip()
except ValueError as e:
pass #TODO: decide what we want to do on an
yield planet
def print_planet_info(planet):
print("{name} has an orbital radius of {orbital_radius}".format(
name=planet['Planet'],
orbital_radius=planet['Orbital-Radius'])
)
if __name__ == '__main__':
import sys
for planet in planet_reader(sys.stdin):
print_planet_info(planet)
The yield
keyword works kind of like return
in that at that point
in the code the current value stored in planet
is returned and the
control returns to the calling function. The difference is that on
subsequent calls to the planet_reader
generator control picks up at
the line immediately after the yield
that ended control last time.
If we look closely we’ll
note that other than the names we use for variables the function
planet_reader
doesn’t contain any code specific to the data content
but rather will work for ANY data stored in the record-jar format.
As a module
Let’s separate the record-jar parsing code from the specifics of the planet content so that we can re-use our parser for any data in the record-jar format.
#!/usr/bin/env python3
def record_reader(flo):
"""read a stream containing record-jar formatted data, yield a
complete record
"""
record = {}
for line in flo:
if line.startswith('%%'):
yield record
record = {}
else:
key, value = line.split(':')
record[key.strip()] = value.strip()
yield record
def load_records(flo):
"""read a complete stream containing record-jar formatted data and
load all records into memory
"""
return [ record for record in record_reader(flo) ]
if __name__ == '__main__':
import sys
for record in record_reader(sys.stdin):
print("record: {0}\n".format(record))
Note I include a simple __main__
function to test the functionality
of the record-jar reader that does not depend on any particular data
being in the records. If I invoke record_jar_reader.py
from the
command line, the code block contained in __main__
will run, but if
I import
the file as a module this block will not run. This is
useful for providing test code for modules that you write.
#!/usr/bin/env python3
import record_jar_reader as rjr
def print_planet_info(planet):
print("{name} has an orbital radius of {orbital_radius}".format(
name=planet['Planet'],
orbital_radius=planet['Orbital-Radius'])
)
if __name__ == '__main__':
import sys
for planet in rjr.record_reader(sys.stdin):
print_planet_info(planet)
$ record-jar/planets.py <files/planets.records
Mercury has an orbital radius of 57,910,000 km
Venus has an orbital radius of 108,200,000 km
Earth has an orbital radius of 149,600,000 km