Understanding the MatchTree

The basic structure that the filename detection component uses is the MatchTree. A MatchTree is a tree covering the filename, where each node represent a substring in the filename and can have a Guess associated with it that contains the information that has been guessed in this node. Nodes can be further split into subnodes until a proper split has been found.

This makes it so that all the leaves concatenated will give you back the original filename. But enough theory, let’s look at an example:

>>> path = 'Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv'
>>> print guessit.IterativeMatcher(path).match_tree
000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
                 011112           011112000000000000000000000000111
                                        000000000000000000011112
                                        0000000000111122222
                                        0000111112    01112
Movies/__________(____)/Dark.City.(____).DC._____.____.___.____-___.___
       tttttttttt yyyy             yyyy     fffff ssss aaa vvvv rrr ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv

The last line contains the filename, which you can use a reference. The previous line contains the type of property that has been found. The line before that contains the filename, where all the found groups have been blanked. Basically, what is left on this line are the leftover groups which could not be identified.

The lines before that indicate the indices of the groups in the tree.

For instance, the part of the filename ‘BDRip’ is the leaf with index (2, 2, 0, 0, 0, 1) (read from top to bottom), and its meaning is ‘format’ (as shown by the f‘s on the last-but-one line).

What does the IterativeMatcher do?

The goal of the api/matcher is to take a MatchTree which contains no information (yet!) at the beginning, and apply a succession of rules to try to guess parts of the filename. These rules are called transformations and work in-place on the tree, splitting into new leaves and updating the nodes’s guesses when it finds some information.

Let’s look at what happens when matching the previous filename.

Splitting into path components

First, we split the filename into folders + basename + extension This gives us the following tree, which has 4 leaves (from 0 to 3):

000000 1111111111111111 2222222222222222222222222222222222222222222 333
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv

Splitting into explicit groups

Then, we want to split each of those groups into “explicit” groups, i.e.: groups which are enclosed in parentheses, square brackets, curly braces, etc.:

000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.___
                                                                    ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv

As you can see, the containing folder has been split into 2 sub-groups, and the basename into 3 groups (separated by the year information).

Note that we also got the information from the extension, as you can see above.

Finding interesting patterns

Now that this first split has been made, we can start finding some known patterns which we can identify in the filename. That is the main objective of the IterativeMatcher, which will run a series of transformations which can identify groups in the filename and will annotate the corresponding nodes.

For instance, the year:

000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
                 011112           011112
Movies/Dark City (____)/Dark.City.(____).DC.BDRip.720p.DTS.X264-CHD.___
                  yyyy             yyyy                             ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv

Then, known properties usually found in video filenames:

000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
                 011112           011112000000000000000000000000111
                                        000000000000000000011112
                                        0000000000111122222
                                        0000111112    01112
Movies/Dark City (____)/Dark.City.(____).DC._____.____.___.____-___.___
                  yyyy             yyyy     fffff ssss aaa vvvv rrr ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv

As you can see, this starts to branch pretty quickly, as each found group splits a leaf into further leaves. In this case, that gives us the year (1998), the format (BDRip), the screen size (720p), the video codec (x264) and the release group (CHD).

Using positional rules to find the ‘title’ property

Now that we found all the known patterns that we could, it is time to try to guess what is the title of the movie. This is done by looking at which groups in the filename are still unidentified, and trying to guess which one corresponds to the title by looking at their position:

000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
                 011112           011112000000000000000000000000111
                                        000000000000000000011112
                                        0000000000111122222
                                        0000111112    01112
Movies/__________(____)/Dark.City.(____).DC._____.____.___.____-___.___
       tttttttttt yyyy             yyyy     fffff ssss aaa vvvv rrr ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv

In this case, as the containing folder is composed of 2 groups, the second of which is the year, we can (usually) safely assume that the first one corresponds to the movie title.

Merging all the results in a MatchTree to give a final Guess

Once that we have matched as many groups as we could, the job is not done yet. Indeed, every leaf of the tree that we could identify contains the found property in its guess, but what we want at the end is to have a single Guess containing all the information.

There are some simple strategies implemented to try to deal with conflicts and/or duplicate properties. In our example, ‘year’ appears twice, but as it has the same value, so it will be merged into a single ‘year’ property, but with a confidence that represents the combined confidence of both guesses. If the properties were conflicting, we would take the one with the highest confidence and lower it accordingly.

Here:

>>> path = 'Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv'
>>> print guessit.guess_movie_info(path)
{'videoCodec': 'h264', 'container': 'mkv', 'format': 'BluRay',
'title': 'Dark City', 'releaseGroup': 'CHD', 'screenSize': '720p',
'year': 1998, 'type': 'movie', 'audioCodec': 'DTS'}

And that gives you your final guess!

GuessIt is a python library that tries to extract as much information as possible from a file.

Table Of Contents

Related Topics