Understanding the MatchTree¶
The basic structure that the filename detection component uses is the
MatchTree
. A MatchTree
is a tree covering the filename, where each
node represent a substring in the filename and can have a Guess
associated with it that contains the information that has been guessed
in this node. Nodes can be further split into subnodes until a proper
split has been found.
This makes it so that all the leaves concatenated will give you back the original filename. But enough theory, let’s look at an example:
>>> path = 'Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv'
>>> print guessit.IterativeMatcher(path).match_tree
000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
011112 011112000000000000000000000000111
000000000000000000011112
0000000000111122222
0000111112 01112
Movies/__________(____)/Dark.City.(____).DC._____.____.___.____-___.___
tttttttttt yyyy yyyy fffff ssss aaa vvvv rrr ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv
The last line contains the filename, which you can use a reference. The previous line contains the type of property that has been found. The line before that contains the filename, where all the found groups have been blanked. Basically, what is left on this line are the leftover groups which could not be identified.
The lines before that indicate the indices of the groups in the tree.
For instance, the part of the filename ‘BDRip’ is the leaf with index
(2, 2, 0, 0, 0, 1)
(read from top to bottom), and its meaning is ‘format’
(as shown by the f
‘s on the last-but-one line).
What does the IterativeMatcher do?¶
The goal of the api/matcher is to take a MatchTree
which
contains no information (yet!) at the beginning, and apply a succession of
rules to try to guess parts of the filename. These rules are called
transformations and work in-place on the tree, splitting into new leaves
and updating the nodes’s guesses when it finds some information.
Let’s look at what happens when matching the previous filename.
Splitting into path components¶
First, we split the filename into folders + basename + extension This gives us the following tree, which has 4 leaves (from 0 to 3):
000000 1111111111111111 2222222222222222222222222222222222222222222 333
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv
Splitting into explicit groups¶
Then, we want to split each of those groups into “explicit” groups, i.e.: groups which are enclosed in parentheses, square brackets, curly braces, etc.:
000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.___
ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv
As you can see, the containing folder has been split into 2 sub-groups, and the basename into 3 groups (separated by the year information).
Note that we also got the information from the extension, as you can see above.
Finding interesting patterns¶
Now that this first split has been made, we can start finding some known
patterns which we can identify in the filename.
That is the main objective of the IterativeMatcher
, which will run
a series of transformations which can identify groups in the filename and
will annotate the corresponding nodes.
For instance, the year:
000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
011112 011112
Movies/Dark City (____)/Dark.City.(____).DC.BDRip.720p.DTS.X264-CHD.___
yyyy yyyy ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv
Then, known properties usually found in video filenames:
000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
011112 011112000000000000000000000000111
000000000000000000011112
0000000000111122222
0000111112 01112
Movies/Dark City (____)/Dark.City.(____).DC._____.____.___.____-___.___
yyyy yyyy fffff ssss aaa vvvv rrr ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv
As you can see, this starts to branch pretty quickly, as each found group splits a leaf into further leaves. In this case, that gives us the year (1998), the format (BDRip), the screen size (720p), the video codec (x264) and the release group (CHD).
Using positional rules to find the ‘title’ property¶
Now that we found all the known patterns that we could, it is time to try to guess what is the title of the movie. This is done by looking at which groups in the filename are still unidentified, and trying to guess which one corresponds to the title by looking at their position:
000000 1111111111111111 2222222222222222222222222222222222222222222 333
000000 0000000000111111 0000000000111111222222222222222222222222222 000
011112 011112000000000000000000000000111
000000000000000000011112
0000000000111122222
0000111112 01112
Movies/__________(____)/Dark.City.(____).DC._____.____.___.____-___.___
tttttttttt yyyy yyyy fffff ssss aaa vvvv rrr ccc
Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv
In this case, as the containing folder is composed of 2 groups, the second of which is the year, we can (usually) safely assume that the first one corresponds to the movie title.
Merging all the results in a MatchTree to give a final Guess¶
Once that we have matched as many groups as we could, the job is not done yet.
Indeed, every leaf of the tree that we could identify contains the found property
in its guess, but what we want at the end is to have a single Guess
containing
all the information.
There are some simple strategies implemented to try to deal with conflicts and/or duplicate properties. In our example, ‘year’ appears twice, but as it has the same value, so it will be merged into a single ‘year’ property, but with a confidence that represents the combined confidence of both guesses. If the properties were conflicting, we would take the one with the highest confidence and lower it accordingly.
Here:
>>> path = 'Movies/Dark City (1998)/Dark.City.(1998).DC.BDRip.720p.DTS.X264-CHD.mkv'
>>> print guessit.guess_movie_info(path)
{'videoCodec': 'h264', 'container': 'mkv', 'format': 'BluRay',
'title': 'Dark City', 'releaseGroup': 'CHD', 'screenSize': '720p',
'year': 1998, 'type': 'movie', 'audioCodec': 'DTS'}
And that gives you your final guess!