Dialects

gffutils borrows the idea of “dialects” from Python’s csv module. In this context, a dialect is a dictionary that specifies details about the format. For GFF and GTF files, most variation in formatting occurs in the attributes field (column 9), so currently gffutils dialects only describe details in how the attributes are formatted.

The advantage to using dialects is that the lines that are imported into a database can be exactly reconstructed after being retrieved from the database.

While the GFF3 specification and GTF specification documents make recommendations about how files should be formatted, in practice there is wide variation. gffutils tries to accomodate real-world GFF and GTF files by examining the first N lines of a file and figuring out what “dialect” the file is using (N=10 by default, but can be changed with the checklines kwarg to gffutils.create_db()).

The dialect is simply a dictionary of various attributes that have been empirically found to differ among real-world files. You can find the current version of the dialect dictionary, which is used by default and mimics the GFF3 format, in gffutils.constants.dialect.

For example, some files have a trailing semicolon after the last attribute in column 9. In this case, the dialect would specify dialect['trailing semicolon'] = True.

A GTF dialect might look like this:

{'field separator': '; ',
 'fmt': 'gtf',
 'keyval separator': ' ',
 'leading semicolon': False,
 'multival separator': ',',
 'quoted GFF2 values': True,
 'repeated keys': False,
 'trailing semicolon': True}

In contrast, a GFF dialect might look like this:

{'field separator': ';',
 'fmt': 'gff3',
 'keyval separator': '=',
 'leading semicolon': False,
 'multival separator': ',',
 'quoted GFF2 values': False,
 'repeated keys': False,
 'trailing semicolon': False}

As other real-world files are brought to the attention of the developers, it’s likely that more entries will be added to the dialect.