# Formatting Citekeys
- [Formatting Citekeys](#formatting-citekeys)
- [Settings](#settings)
- [Building Patterns](#building-patterns)
- [Ignore Lists and Char Case](#ignore-lists-and-char-case)
- [General Tipps](#general-tipps)
- [Examples](#examples)
`bibiman` offers the possibility to create new citekeys from the fields of
BibLaTeX entries. This is done using an easy but powerful pattern-matching
syntax.
## Settings
All settings for the citekey generation have to be configured in the used config
file. The regular path is `XDG_CONFIG_DIR/bibiman/bibiman.toml`. But it can be
set dynamically with the `-c`/`--config=` global option.
Following values can be set through the config file. A detailed explanation for
all fields follows below:
```toml
[citekey_formatter]
fields = [ "author;2;;-;_", "title;3;6;_;_", "year" ]
case = "lowercase"
ascii_only = true
ignored_chars = [
'?', '!', '\\', '\'', '.', '-', '–', ':', ',', '[', ']', '(', ')', '{', '}', '§', '$', '%',
'&', '/', '`', '´', '#', '+', '*', '=', '|', '<', '>', '^', '°', '_', '"', '»', '«', '‘', '’',
'“', '”',
]
ignored_words = [
"the",
"a",
"an",
"of",
"for",
"in",
"at",
"to",
"and",
"der",
"die",
"das",
"ein",
"eine",
"eines",
"des",
"auf",
"und",
"für",
"vor",
]
```
## Building Patterns
The main aspect for generating citekeys are the field patterns. They can be set
through an array in the config file where every array-item represents a single
BibLaTeX field to be used for generating a part of the citekey.
Every field pattern consists of the following five parts separated by
semicolons. The general pattern looks like this (every subfield is explained
below):
*biblatex field name* **;** *max word count* **;** *max char count* **;** *inner
delimiter* **;** *trailing delimiter*
- **BibLaTeX field**: the first part represents the field name which value
should be used to generate the content part of the citekey. Theoretically, any
BibLaTeX field can be selected by name. But there are some fields which are
much more common than others; e.g. `author`, `editor`, `title`, `year`/`date`
or `entrytype`. Those very common fields are preprocessed; meaning that for
instance LaTeX macros are fully stripped from the strings, or that `editor` is
a fallback value for `author` if the latter is empty (however, setting
`editor` explicitly is still possible). Also using `year` will parse the
`date` field too, to ensure a year number.
- **Max Word**: Defines how many words should maximal be used from the named
field. E.g. if the title consists of five words, and the max counter is set to
`3` only the first three fields will be used.
- **Max Chars/Word**: Defines how many chars, counting from the start, of each
word will be used to build the citekey. If for instance the value is set to
`5`, only the first five chars of any word will be used. Thus, "archaeology"
would be stripped down to "archa".
- **Inner Delimiter**: Sets the delimiter char used between words from the
currently named field; e.g. to separate the words of the `title` field.
- **Trailing Delimiter**: Sets the delimiter which separates the current fields
value from the following. This delimiter is only printed if the following
field has some content.
For example, to use the `title` field, print maximal three words and of those
only the first five chars, single words separated by underscore and the whole
field separated by equal sign, insert the following pattern field into the
`fields` array:
`title;3;5;_;=`
Except the BibLaTeX field name, all other parts of the pattern can be left
blank. If the field name is the only value set, semicolon delimiters are also
not necessary. But if only one of the following parts should be set, all
delimiters need to be used. E.g. those are both valid: `title` or `title;;;_;=`.
The first would print all words of the title, no matter the length, not
separated by any char. The last would also print all words of the title, but
single words separated by underscores and the whole pattern value separated from
the following by an equal sign. This is not valid: `title;;_` since `bibiman`
can't know if the underscore means a delimiter (and which) or the max char
count.
The pattern array inside the config file takes multiple pattern fields like the
predecing. This allows an elaborated citekey pattern which takes into account
multiple fields.
## Ignore Lists and Char Case
Beside the field patterns there are some other options to define how citekeys
should be built.
`ascii_only=`
: If set to `true`, which is the default, non-ascii chars are mapped to their
ascii equivalent. For example, the German `ä` would be mapped to `a`. The
Turkish `ş` or Greek `σ`/`ς` would be mapped to `s`. If set to `false` all are
kept as they are. But this could lead to errors running LaTeX on the file.
`case=`
: If used, sets the case of the chars in the citekey. Valid values are
`uppercase`, `lowercase` or `camelcase`. Both first should be clear, the
latter means typical camel case also beginning the *first word* with an
uppercase letter; also referenced as upper camel case or Pascal case.
`ignored_chars=`
: Defines chars which should be ignored during parsing (meaning not print them).
The default list contains 33 special chars and is part of the default config
file (in out-commented state). Be aware, setting this key will completely
overwrite the default list!
`ignored_words=`
: A list of words which should be ignored parsing field values. The default list
contains about 20 very commonly used words in English and German; like
articles, pronouns or connector words. Like with `ignored_chars` setting this
key will completely overwrite the default list!
## General Tipps
- Most importantly: *always use the **`--dry-run`** option first*! This will
print a list of old and new values for all citekeys in the file without
changing anything. For the test file of this repo and using the pattern from
the [section below](#examples) `--dry-run` produces the following output:
[](https://postimg.cc/bs4pRJmX)
- After finding a good overall pattern, *use the `--output=` option* to create a
new file and don't overwrite your existing file. Thus, your original file
isn't broken if the key formatter produces some unwanted output.
- Its possible to update citekey based PDF and note files directly when
formatting the citekeys using the `-u`/`--update-attachments` option. Thus,
all PDFs and notes are already linked to the correct entries after updating
the citekeys. Since this operation can break things, use it with `--dry-run`
first. As with regular citekeys this will print all changes without processing
anything.
- Even very long patterns are possible, they are not encouraged, since it bloats
the bibfiles.
- The same accounts for *too short* patterns; if the pattern is to unspecific,
it bares the risk of producing doublettes (e.g. single author and year only).
But the citekey generator will not check for doublettes!
- It is possible to keep special chars and use them as delimiters. But this
might cause problems for other programs and CLI tools in particular, since
many special chars are reserved for shell operations. For instance, it will
very likely break the note file feature of `bibiman` which doesn't accept many
special chars.
## Examples
To make the process more clear a few examples might help. Following bibfile is
assumed:
```latex
@article{Bos2023,
title = {{LaTeX}, metadata, and publishing workflows},
author = {Bos, Joppe W. and {McCurley}, Kevin S.},
year = {2023},
month = apr,
journal = {arXiv},
number = {{arXiv}:2301.08277},
doi = {10.48550/arXiv.2301.08277},
url = {http://arxiv.org/abs/2301.08277},
urldate = {2023-08-22},
note = {type: article},
}
@book{Bhambra2021,
title = {Colonialism and \textbf{Modern Social Theory}},
author = {Bhambra, Gurminder K. and Holmwood, John},
location = {Cambridge and Medford},
publisher = {Polity Press},
date = {2021},
```
And the following values set in the config file:
```toml
fields = [
# Just print the whole entrytype and a colon as trailing delimiter
"entrytype;;;;:",
# Print all author names in full length, names separated by dash,
# the whole field by underscore
"author;;;-;_",
# Print first 4 words of title, first 3 chars of every word only. Title words
# separated by equal sign, the whole field by underscore
"title;4;3;=;_",
# Print all words of location, but only first 4 chars of every word. Single words
# separated by colon, whole field by underscore
"location;;4;:;_",
# Just print the whole year
"year",
]
case = "lowercase"
ascii_only = true
```
The combination of those setting will produce the following citekeys:
- **`article:bos-mccurley_lat=met=pub=wor_2023`**
- **`book:bhambra-holmwood_col=mod=soc=the_camb:medf_2021`**
**Personal Note**
I use the following pattern to format the citekeys of my bibfiles:
```toml
[citekey_formatter]
fields = [
"author;1;;;_",
"title;3;7;-;_",
"year;;;;_",
"entrytype;;;;_",
"shorthand",
]
case = "lowercase"
ascii_only = true
```
It produces citekeys with enough information to quickly identify the underlying
work while not being too long; at least in my opinion. The shorthand at the end
is only printed in a few cases, but shows me that the specific work might differ
from standard articles/books etc.