User:WindBOT/Documentation
I am a fully automated bot, but that doesn't mean that I am not flexible. You can tweak me from the wiki itself by editing my filters!
Technical details
I am written in Python (2.6). If you want to modify my code, make sure you know enough Python stuff before continuing.
All code must be placed on User:WindBOT/Filters, and indented by two spaces (for the top-level block). Whitespace determines Python's code blocks. Although this is not a necessity, try to use 4 spaces per indentation level.
Filters
How to disable a filter
Wait! Before you disable a filter, consider why you want to disable it. Did it produce an expected result but on a page where such a result is not appropriate? In this case, blacklist this page instead of disabling the filter. If I am malfunctioning, chances are that the problem lies in one of my filters. Thus, instead of completely shutting me down, it would be wiser to disable only the chunk of code that is misbehaving. To make me ignore a certain line, add a "#" in front of it:
# This line will be ignored
If there are multiple lines, wrap them inside triple-quotes:
"""This line will be ignored and this one as well and this one is cake and the previous one was a lie but it was still ignored"""
If all else fails, you can simply delete the block of code from the page. I can't come up with code by myself yet, so I won't do anything. If the problem really is elsewhere, block the bot.
Filter types
I work using filters. They are simply Python functions which take a certain input as argument, and is expected to return a modified version of this input (if the filter changed something) or an identical version (if the filter didn't change anything). There are multiple types of filters:
- Regular filters: These are no-frills, direct filters.
- When to use: When no other filter type is adequate. This type of filter can be very destructive if the function is not careful enough.
- Input/Output: Raw Wikitext of a page
- Implementation details: Your filter is called only once, over the whole content of the page.
- How to use: To register a regular filter, call the
addFilter
function:addFilter(myFilter)
- Safe filters: These are Wikitext safe filters.
- When to use: Semantics-related filters are a good fit for these. Use them to filter human-readable text.
- Input/Output: Sanitized Wikitext of a page (readable text content), external and internal wikilinks labels (labels only), internal wikilinks URLs only when combined with its label ([[like this]]).
- Implementation details: Your filter is called once over the textual body of the page, then once per link label.
- How to use: To register a safe filter, call the
addSafeFilter
function:addSafeFilter(myFilter)
- Link filters: These filters act on links within wiki pages.
- When to use: When you want to apply filters on links. Note: Use a safe filter if all you want to do is to modify link labels (unless you don't want to modify the page's body as well).
- Input/Output: A single link instance. The class definition is given at the bottom of this document.
- Implementation details: Your filter is called once per link in the page.
- How to use: To register a link filter, call the
addLinkFilter
function:addLinkFilter(myFilter)
- Locale filters: These filters act on localization dictionaries.
- When to use: When you want to extract only certain parts of translation files.
- Input/Output: N/A
- Implementation details: The localization dictionary is a huge dictionary with keys being the string IDs (
TF_SPY_BACKSTAB_ENGY_SAP_BUILDING_DESC
, etc.), and each key being another dictionary. This inner dictionary has language names as keys (english
,romanian
...) and the actual translated string as value. - How to use: Call the
languagesFilter
function:languagesFilter(languages, commonto, prefix, suffix, exceptions)
where:- (Required)
languages
is the localization dictionary. - (Optional)
commonto
filters strings by their translation availability. For example,commonto=['english', 'french']
will only keep strings which are available in bothfrench
andenglish
. - (Optional)
prefix
filters strings by their string ID, which must contain this string as prefix. - (Optional)
suffix
filters strings by their string ID, which must contain this string as suffix. - (Optional)
exceptions
is a list of keys that should be excluded no matter what.
- (Required)
Filters themselves may be filtered (yeah, really) so that they are only applied to certain articles:
- Use
addSomeTypeOfFilter(myFilter1, myFilter2, myFilter3, ..., language='de')
(whereaddSomeTypeOfFilter
is one of the functions described above) to addmyFilter1
,myFilter2
,myFilter3
... as filters that will only be applied on German pages (de
).
Filter generators
As previously mentioned, filters are Python functions. However, a lot of filters are similar in function and in purpose. Therefore, declaring a new Python function for each filter would be redundant and cumbersome.
Since functions are first-class variables in Python, you can pass around, edit, and create functions programatically. This is what filter generators are. They take a few argument about your desired filter's details, and generate a corresponding Python function, which you can then add using the method described above.
Filter generators:
dumbReplace(text, replacement)
: This generates a straightforward text replacement filter, which replaces all instance oftext
byreplacement
.dumbReplaces(stuffToReplace)
: The bulk version ofdumbReplace
. Generates a text replacement filter with multiple things to replace.stuffToReplace
should be a Python dictionary of the form:
{ 'text1': 'replacement1', 'text2': 'replacement2', 'text3': 'replacement3' }
regex(regularExpression, replacement)
: This generates a simple regex filter. To use backreferences in the replacement argument, use$1
for group 1,$2
for group 2, etc.regexes(regularExpressions)
: The bulk version of theregex
filter generator. To use backreferences in the replacement argument, use$1
for group 1,$2
for group 2, etc.regularExpressions
should be a Python dictionary of the form:
{ 'regex1': 'replacement1', 'regex2': 'replacement2', 'regex3': 'replacement3' }
wordFilter(correctWord, alternateWord1, alternateWord2, ...)
: This generates a filter guaranteed to be applied only to whole words (if used as safe filter), and with wikitext aliases. The first argument,correctWord
, is the "correct" spelling of the word. The rest of the arguments are regular expressions (that only match whole words! You do not need to check for this) which will be replaced withcorrectWord
. Note that you can (and should) repeatcorrectWord
as one of the alternate spellings, in order to enforcecorrectWord
's capitalization.enforceCapitalization(word1, word2, ...)
: This is effectively the same aswordFilter(word1, word1); wordFilter(word2, word2); ...
. It addsword1
itself as a spelling ofword1
, which replaces all instances ofword1
by the correctly-capitalized version of it. Note: You do not need to calladdSafeFilter
on this one.enforceCapitalization
automatically callsaddSafeFilter
by itself, as it is meant to be used only on textual content.associateLocaleWordFilters(languages, fromLang, toLang, targetPageLang)
: This generates word filters for all strings in the localization dictionarylanguages
, going from languagefromLang
to languagetoLang
. This function automatically adds the generated word filters to the safe filters list. If thetargetPageLang
argument is provided, the word filters will be applied only on pages in that language (for example,targetPageLang='de'
will make the filters only be applied on /de pages).
Link class definition
class link: def __init__(self, content): content = u(content) self.joined = False self.setBody(content) self.setType(u'unknown') self.setLabel(None) self.setLink(u'') self.anchor = None self.joined = False if len(content) > 2: if content[:2] == u'[[' and content[-2:] == u']]': split = content[2:-2].split(u'|') if len(split) in (1, 2): self.setType(u'internal') lnk = split[0] if lnk.find(u':') == -1: lnk = lnk.replace(u'_', u' ') anchor = None if lnk.find(u'#') != -1: lnk, anchor = lnk.split(u'#', 1) self.setAnchor(anchor) self.setLink(lnk) if len(split) == 2: self.setLabel(split[1]) else: self.setLabel(split[0]) self.joined = anchor is None elif content[0] == u'[' and content[-1] == u']': split = content[1:-1].split(u' ', 1) self.setType(u'external') self.setLink(split[0]) if len(split) == 2: self.setLabel(split[1]) else: self.setLabel(None) def getType(self): return u(self.kind) def getBody(self): return u(self.body) def getLink(self, withAnchor=False): if withAnchor and self.getAnchor() is not None: return u(self.link) + u'#' + self.getAnchor() return u(self.link) def getAnchor(self): return self.anchor def getLabel(self): if self.label is None: return None if self.joined: return self.getLink() return u(self.label) def setType(self, kind): self.kind = u(kind) def setBody(self, body): self.body = u(body) def setLink(self, link): link = u(link) if self.getType() == u'internal' and link.find(u'#') != -1: link, anchor = link.split(u'#', 1) self.setAnchor(anchor) self.link = link if self.joined: self.label = u(link) replaceDots = compileRegex(r'(?:\.[a-f\d][a-f\d])+') def _replaceDots(self, g): s = '' g = g.group(0) for i in xrange(0, len(g), 3): s += chr(int(g[i + 1:i + 3], 16)) return s.decode('utf8') def setAnchor(self, anchor): if self.getType() == u'internal': u(anchor).replace(u'_', u' ') try: anchor = link.replaceDots.sub(self._replaceDots, anchor) except: pass self.anchor = anchor def setLabel(self, label): if label is None: self.label = None else: self.label = u(label) if self.joined: self.link = u(label) def __str__(self): return self.__unicode__() def __repr__(self): return u'<Link-' + self.getType() + u': ' + self.__unicode__() + u'>' def __unicode__(self): label = self.getLabel() tmpLink = self.getLink(withAnchor=True) if self.getType() == u'internal': tmpLink2 = tmpLink.replace(u'_', u' ') if label in (tmpLink2, tmpLink) or (label and tmpLink and (label[0].lower() == tmpLink[0].lower() and tmpLink[1:] == label[1:]) or (label[0].lower() == tmpLink2[0].lower() and tmpLink2[1:] == label[1:])): return u'[[' + label + u']]' elif tmpLink and label and len(label) > len(tmpLink) and (label.lower().find(tmpLink2.lower()) == 0 or label.lower().find(tmpLink.lower()) == 0): index = max(label.lower().find(tmpLink2.lower()), label.lower().find(tmpLink.lower())) badchars = (u' ', u'_') nobadchars = True for c in badchars: if label[:index].find(c) != -1 or label[index+len(tmpLink):].find(c) != -1: nobadchars = False if nobadchars: return label[:index] + u(link(u'[[' + tmpLink + u'|' + label[index:index+len(tmpLink)] + u']]')) + label[index+len(tmpLink):] return u'[[' + tmpLink + u'|' + label + u']]' if self.getType() == u'external': if label is None: return u'[' + tmpLink + u']' return u'[' + tmpLink + u' ' + label + u']' return self.getBody()