BibTeX tutorial
A tutorial for parsing well known format for bibliographic references.
The word BibTeX stands for a tool and a file format which are used to describe and process lists of references, mostly in conjunction with LaTeX documents.
An example of BibTeX entry is given below.
@article{DejanovicADomain-SpecificLanguageforDefiningStaticStructureofDatabaseApplications2010,
author = "Igor Dejanovi\'{c} and Gordana Milosavljevi\'{c} and Branko Peri\v{s}i\'{c} and Maja Tumbas",
title = "A {D}omain-Specific Language for Defining Static Structure of Database Applications",
journal = "Computer Science and Information Systems",
year = "2010",
volume = "7",
pages = "409--440",
number = "3",
month = "June",
issn = "1820-0214",
doi = "10.2298/CSIS090203002D",
url = "http://www.comsis.org/ComSIS/Vol7No3/RegularPapers/paper2.htm",
type = "M23"
}
Each BibTeX entry starts with @
and a keyword denoting entry type (article
)
in this example. After the entry type is the body of the reference inside curly
braces. The body of the reference consists of elements separated by a comma.
The first element is the key of the entry. It should be unique.
The rest of the entries are fields in the format:
<field_name> = <field_value>
The grammar
Let's start with the grammar.
Create file bibtex.py
, and import arpeggio
.
from arpeggio import *
from arpeggio import RegExMatch as _
Then create grammar rules:
- BibTeX file consists of zero or more BibTeX entries.
def bibfile(): return ZeroOrMore(bibentry), EOF
- Now we define the structure of BibTeX entry.
def bibentry(): return bibtype, "{", bibkey, ",", field, ZeroOrMore(",", field), "}"
- Each field is given as field name, equals char (
=
), and the field value.
def field(): return fieldname, "=", fieldvalue
- Field value can be specified inside braces or quotes.
def fieldvalue(): return [fieldvalue_braces, fieldvalue_quotes]
def fieldvalue_braces(): return "{", fieldvalue_braced_content, "}"
def fieldvalue_quotes(): return '"', fieldvalue_quoted_content, '"'
- Now, let's define field name, BibTeX type and the key. We use regular
expression match for this (
RegExMatch
class).
def fieldname(): return _(r'[-\w]+')
def bibtype(): return _(r'@\w+')
def bibkey(): return _(r'[^\s,]+')
Field name is defined as hyphen or alphanumeric one or more times.
BibTeX entry type is @
char after which must be one or more alphanumeric.
BibTeX key is everything until the first space or comma.
- Field value can be quoted and braced. Let's match the content.
def fieldvalue_quoted_content(): return _(r'((\\")|[^"])*')
def fieldvalue_braced_content(): return Combine(ZeroOrMore(Optional(And("{"), fieldvalue_inner),\
fieldvalue_part))
def fieldvalue_part(): return _(r'((\\")|[^{}])+')
def fieldvalue_inner(): return "{", fieldvalue_braced_content, "}"
Combine decorator
We use Combine
decorator to specify braced content. This decorator
produces a Terminal node in the parse
tree.
The parser
To instantiate the parser we are using ParserPython
Arpeggio's class.
parser = ParserPython(bibfile)
Now, we have our parser. Let's parse some input:
- First load some BibTeX data from a file.
file_name = os.path.join(os.path.dirname(__file__), 'bibtex_example.bib')
with codecs.open(file_name, "r", encoding="utf-8") as bibtexfile:
bibtexfile_content = bibtexfile.read()
We are using codecs
module to load the file using utf-8
encoding.
bibtexfile_content
is now a string with the content of the file.
- Parse the input string
parse_tree = parser.parse(bibtexfile_content)
The parse tree is produced.
Extracting data from the parse tree
Let's suppose that we want our BibTeX file to be transformed to a list of Python dictionaries where each field is keyed by its name and the value is the field value cleaned up from the BibTeX cruft.
Like this:
{ 'author': 'Igor Dejanović and Gordana Milosavljević and Branko Perišić and Maja Tumbas',
'bibkey': 'DejanovicADomain-SpecificLanguageforDefiningStaticStructureofDatabaseApplications2010',
'bibtype': '@article',
'doi': '10.2298/CSIS090203002D',
'issn': '1820-0214',
'journal': 'Computer Science and Information Systems',
'month': 'June',
'number': '3',
'pages': '409--440',
'title': 'A Domain-Specific Language for Defining Static Structure of Database Applications',
'type': 'M23',
'url': 'http://www.comsis.org/ComSIS/Vol7No3/RegularPapers/paper2.htm',
'volume': '7',
'year': '2010'}
The key is stored under a dict key bibkey
while the entry type is stored
under the dict key bibtype
.
After calling the parse
method on the parser our textual data will be parsed
and stored in the parse tree. We could navigate the tree
to extract the data and build the python list of dictionaries but a lot easier
is to use Arpeggio's visitor support.
In this case we shall create BibTeXVisitor
class with visit_*
methods for
each grammar rule whose parse tree node we want to process.
class BibTeXVisitor(PTNodeVisitor):
def visit_bibfile(self, node, children):
"""
Just returns list of child nodes (bibentries).
"""
# Return only dict nodes
return [x for x in children if type(x) is dict]
def visit_bibentry(self, node, children):
"""
Constructs a map where key is bibentry field name.
Key is returned under 'bibkey' key. Type is returned under 'bibtype'.
"""
bib_entry_map = {
'bibtype': children[0],
'bibkey': children[1]
}
for field in children[2:]:
bib_entry_map[field[0]] = field[1]
return bib_entry_map
def visit_field(self, node, children):
"""
Constructs a tuple (fieldname, fieldvalue).
"""
field = (children[0], children[1])
return field
Now, apply the visitor to the parse tree.
ast = visit_parse_tree(parse_tree, BibTeXVisitor())
ast
is now a Python list of dictionaries in the desired format from above.
A full source code for this example can be found in the source code repository.
Note
Example in the repository is actually a fully working parser with the support for BibTeX comments and comment entries. This is out of scope for this tutorial. You can find the details in the source code.