How to create a document categorizer?


#1

Any known method to categorize documents against a corpus? Perhaps once the corpus is classified a document can be categorized into one of the classes?
There’s a way to do this in OpenNLP but I would rather a C++ solution.


#2

https://meta-toolkit.org/classify-tutorial.html


#3

I had read that tutorial but it seems woefully incomplete. The steps described appear to create the corpus which will be used for classification, but no method to classify arbitrary documents is given. Which line in config.toml is the input data (the document or documents that you want to actually classify)?
How is the output retrieved? Other than a console output of corpus statistics?

Edit: - Okay, figuring it out by looking at the source code. I guess functions classify() or test() take input parameters and save() defines output.
No mention of any of that in the tutorial.


#4

If I were to use the built in “classify” executable, where does this program read input data from? I don’t mean corpus data, but unclassified documents. Is it a setting in config.toml? If so, which one?

Edit: Nice echo chamber in here…here…here. Yeah it looks like I have to compile with those functions mentioned earlier.


#5

So, would I be on the right track with something like this?:

  1. Train a classifier (as per the tutorial)
  2. Tokenise the words of a document into a vector of strings
  3. Create a feature vector using this:
    meta::util::sparse_vector< vector < std::string>, string >::sparse_vector (Iter begin, Iter end );
  4. Pass the created object into this:
    classify(const feature_vector & instance) const;
  5. Enjoy my categorized document.

#6

I assume you’re dealing with a scenario where you do not already have the documents that you eventually want to classify. (If you do already have them, then the easiest thing to do would be to just add them to your corpus, train your classifier on only the labeled part of the corpus, and then use that trained classifier on the remaining unlabeled portion of the corpus. You might need to give the “unlabeled” documents dummy labels, but that isn’t hard.).

In that case, the easiest path would be something pretty close to what you’re suggesting:

  1. Train a classifier on your labeled corpus per the tutorial.

  2. Create a new, unlabled document from your content like so:

corpus::document doc;
doc.content(/* your content from somewhere, as a string */);
  1. Feed said document into the tokenize method of the forward_index used to train the original classifier. (This makes sure that you get the correct feature representation that the classifier expects.) That would look something like
auto fwd = index::make_index<index::forward_index>(*config);
auto doc_vec = fwd->tokenize(doc); // where doc is the document you made above
  1. Pass doc_vec off to classify(const feature_vector&).

  2. Do what you will with the label you obtain.


#7

Wow thanks for that! The first method might come in useful for continual refinement of a corpus as new relevant content is discovered. Shouldn’t be too hard to automate that either. Well I’m on my way.


#8

I think I’ve found a mistake in the tutorial page about this.

We are instructed to create this object:
classify::multiclass_dataset dataset{f_idx};

And then we should be able to use that to construct this object:
classify::multiclass_dataset_view train{dataset, dataset.begin(),
dataset.begin() + dataset.size() / 2};

However, g++ throws a tantrum about this:
error: no matching function for call to ‘meta::classify::multiclass_dataset_view::multiclass_dataset_view()’

And indeed the Doxygen for the class of the same name shows a function which takes iterators as parameters as the following:
multiclass_dataset_view (const multiclass_dataset_view &mdv, iterator begin, iterator end)

We are told to pass a multiclass_dataset but it appears to expect multiclass_dataset_view.

Edit: Created another object with multiclass_dataset_view (const multiclass_dataset &dset) and passed that in. The tutorial doesn’t make this easy.


#9

Yep, that should work. I’ve fixed the discrepancy in this commit on the develop branch which will be merged in the next release.

Then fix it! :wink: The documentation is open source (https://github.com/meta-toolkit/meta-toolkit.org), so I’d welcome any pull request there that updates the wording to address any points you found confusing along the way. You are in the best position to do that, I think, since you know where the pain points were.


#10

Hi Philip,

Could you please provide me the exact code (well by replacing some dummy variables instead of your confidential details) . I tried to do using the process mentioned above but I am unable to do so.

It will be of great help to me.

Thanks in advance!