From the Portal Wiki
Jump to: navigation, search

I am a fully automated bot, but that doesn't mean that I am not flexible. You can tweak me from the wiki itself by editing my filters!

Technical details

I am written in Python (2.6). If you want to modify my code, make sure you know enough Python stuff before continuing.

All code must be placed on User:WindBOT/Filters, and indented by two spaces (for the top-level block). Whitespace determines Python's code blocks. Although this is not a necessity, try to use 4 spaces per indentation level.


How to disable a filter

Wait! Before you disable a filter, consider why you want to disable it. Did it produce an expected result but on a page where such a result is not appropriate? In this case, blacklist this page instead of disabling the filter. If I am malfunctioning, chances are that the problem lies in one of my filters. Thus, instead of completely shutting me down, it would be wiser to disable only the chunk of code that is misbehaving. To make me ignore a certain line, add a "#" in front of it:

 # This line will be ignored

If there are multiple lines, wrap them inside triple-quotes:

 """This line will be ignored
 and this one as well
 and this one is cake
 and the previous one was a lie but it was still ignored"""

If all else fails, you can simply delete the block of code from the page. I can't come up with code by myself yet, so I won't do anything. If the problem really is elsewhere, block the bot.

Filter types

I work using filters. They are simply Python functions which take a certain input as argument, and is expected to return a modified version of this input (if the filter changed something) or an identical version (if the filter didn't change anything). There are multiple types of filters:

  • Regular filters: These are no-frills, direct filters.
    • When to use: When no other filter type is adequate. This type of filter can be very destructive if the function is not careful enough.
    • Input/Output: Raw Wikitext of a page
    • Implementation details: Your filter is called only once, over the whole content of the page.
    • How to use: To register a regular filter, call the addFilter function: addFilter(myFilter)
  • Safe filters: These are Wikitext safe filters.
    • When to use: Semantics-related filters are a good fit for these. Use them to filter human-readable text.
    • Input/Output: Sanitized Wikitext of a page (readable text content), external and internal wikilinks labels (labels only), internal wikilinks URLs only when combined with its label ([[like this]]).
    • Implementation details: Your filter is called once over the textual body of the page, then once per link label.
    • How to use: To register a safe filter, call the addSafeFilter function: addSafeFilter(myFilter)
  • Link filters: These filters act on links within wiki pages.
    • When to use: When you want to apply filters on links. Note: Use a safe filter if all you want to do is to modify link labels (unless you don't want to modify the page's body as well).
    • Input/Output: A single link instance. The class definition is given at the bottom of this document.
    • Implementation details: Your filter is called once per link in the page.
    • How to use: To register a link filter, call the addLinkFilter function: addLinkFilter(myFilter)
  • Locale filters: These filters act on localization dictionaries.
    • When to use: When you want to extract only certain parts of translation files.
    • Input/Output: N/A
    • Implementation details: The localization dictionary is a huge dictionary with keys being the string IDs (TF_SPY_BACKSTAB_ENGY_SAP_BUILDING_DESC, etc.), and each key being another dictionary. This inner dictionary has language names as keys (english, romanian...) and the actual translated string as value.
    • How to use: Call the languagesFilter function: languagesFilter(languages, commonto, prefix, suffix, exceptions) where:
      • (Required) languages is the localization dictionary.
      • (Optional) commonto filters strings by their translation availability. For example, commonto=['english', 'french'] will only keep strings which are available in both french and english.
      • (Optional) prefix filters strings by their string ID, which must contain this string as prefix.
      • (Optional) suffix filters strings by their string ID, which must contain this string as suffix.
      • (Optional) exceptions is a list of keys that should be excluded no matter what.

Filters themselves may be filtered (yeah, really) so that they are only applied to certain articles:

  • Use addSomeTypeOfFilter(myFilter1, myFilter2, myFilter3, ..., language='de') (where addSomeTypeOfFilter is one of the functions described above) to add myFilter1, myFilter2, myFilter3... as filters that will only be applied on German pages (de).

Filter generators

As previously mentioned, filters are Python functions. However, a lot of filters are similar in function and in purpose. Therefore, declaring a new Python function for each filter would be redundant and cumbersome.

Since functions are first-class variables in Python, you can pass around, edit, and create functions programatically. This is what filter generators are. They take a few argument about your desired filter's details, and generate a corresponding Python function, which you can then add using the method described above.

Filter generators:

  • dumbReplace(text, replacement): This generates a straightforward text replacement filter, which replaces all instance of text by replacement.
  • dumbReplaces(stuffToReplace): The bulk version of dumbReplace. Generates a text replacement filter with multiple things to replace. stuffToReplace should be a Python dictionary of the form:
    'text1': 'replacement1',
    'text2': 'replacement2',
    'text3': 'replacement3'
  • regex(regularExpression, replacement): This generates a simple regex filter. To use backreferences in the replacement argument, use $1 for group 1, $2 for group 2, etc.
  • regexes(regularExpressions): The bulk version of the regex filter generator. To use backreferences in the replacement argument, use $1 for group 1, $2 for group 2, etc. regularExpressions should be a Python dictionary of the form:
    'regex1': 'replacement1',
    'regex2': 'replacement2',
    'regex3': 'replacement3'
  • wordFilter(correctWord, alternateWord1, alternateWord2, ...): This generates a filter guaranteed to be applied only to whole words (if used as safe filter), and with wikitext aliases. The first argument, correctWord, is the "correct" spelling of the word. The rest of the arguments are regular expressions (that only match whole words! You do not need to check for this) which will be replaced with correctWord. Note that you can (and should) repeat correctWord as one of the alternate spellings, in order to enforce correctWord's capitalization.
  • enforceCapitalization(word1, word2, ...): This is effectively the same as wordFilter(word1, word1); wordFilter(word2, word2); .... It adds word1 itself as a spelling of word1, which replaces all instances of word1 by the correctly-capitalized version of it. Note: You do not need to call addSafeFilter on this one. enforceCapitalization automatically calls addSafeFilter by itself, as it is meant to be used only on textual content.
  • associateLocaleWordFilters(languages, fromLang, toLang, targetPageLang): This generates word filters for all strings in the localization dictionary languages, going from language fromLang to language toLang. This function automatically adds the generated word filters to the safe filters list. If the targetPageLang argument is provided, the word filters will be applied only on pages in that language (for example, targetPageLang='de' will make the filters only be applied on /de pages).

Link class definition

class link:
	def __init__(self, content):
		content = u(content)
		self.joined = False
		self.anchor = None
		self.joined = False
		if len(content) > 2:
			if content[:2] == u'[[' and content[-2:] == u']]':
				split = content[2:-2].split(u'|')
				if len(split) in (1, 2):
					lnk = split[0]
					if lnk.find(u':') == -1:
						lnk = lnk.replace(u'_', u' ')
					anchor = None
					if lnk.find(u'#') != -1:
						lnk, anchor = lnk.split(u'#', 1)
					if len(split) == 2:
						self.joined = anchor is None
			elif content[0] == u'[' and content[-1] == u']':
				split = content[1:-1].split(u' ', 1)
				if len(split) == 2:
	def getType(self):
		return u(self.kind)
	def getBody(self):
		return u(self.body)
	def getLink(self, withAnchor=False):
		if withAnchor and self.getAnchor() is not None:
			return u( + u'#' + self.getAnchor()
		return u(
	def getAnchor(self):
		return self.anchor
	def getLabel(self):
		if self.label is None:
			return None
		if self.joined:
			return self.getLink()
		return u(self.label)
	def setType(self, kind):
		self.kind = u(kind)
	def setBody(self, body):
		self.body = u(body)
	def setLink(self, link):
		link = u(link)
		if self.getType() == u'internal' and link.find(u'#') != -1:
			link, anchor = link.split(u'#', 1)
			self.setAnchor(anchor) = link
		if self.joined:
			self.label = u(link)
	replaceDots = compileRegex(r'(?:\.[a-f\d][a-f\d])+')
	def _replaceDots(self, g):
		s = ''
		g =
		for i in xrange(0, len(g), 3):
			s += chr(int(g[i + 1:i + 3], 16))
		return s.decode('utf8')
	def setAnchor(self, anchor):
		if self.getType() == u'internal':
			u(anchor).replace(u'_', u' ')
				anchor = link.replaceDots.sub(self._replaceDots, anchor)
			self.anchor = anchor
	def setLabel(self, label):
		if label is None:
			self.label = None
			self.label = u(label)
		if self.joined: = u(label)
	def __str__(self):
		return self.__unicode__()
	def __repr__(self):
		return u'<Link-' + self.getType() + u': ' + self.__unicode__() + u'>'
	def __unicode__(self):
		label = self.getLabel()
		tmpLink = self.getLink(withAnchor=True)
		if self.getType() == u'internal':
			tmpLink2 = tmpLink.replace(u'_', u' ')
			if label in (tmpLink2, tmpLink) or (label and tmpLink and (label[0].lower() == tmpLink[0].lower() and tmpLink[1:] == label[1:]) or (label[0].lower() == tmpLink2[0].lower() and tmpLink2[1:] == label[1:])):
				return u'[[' + label + u']]'
			elif tmpLink and label and len(label) > len(tmpLink) and (label.lower().find(tmpLink2.lower()) == 0 or label.lower().find(tmpLink.lower()) == 0):
				index = max(label.lower().find(tmpLink2.lower()), label.lower().find(tmpLink.lower()))
				badchars = (u' ', u'_')
				nobadchars = True
				for c in badchars:
					if label[:index].find(c) != -1 or label[index+len(tmpLink):].find(c) != -1:
						nobadchars = False
				if nobadchars:
					return label[:index] + u(link(u'[[' + tmpLink + u'|' + label[index:index+len(tmpLink)] + u']]')) + label[index+len(tmpLink):]
			return u'[[' + tmpLink + u'|' + label + u']]'
		if self.getType() == u'external':
			if label is None:
				return u'[' + tmpLink + u']'
			return u'[' + tmpLink + u' ' + label + u']'
		return self.getBody()