Keyword generation tool with MeTA


#1

Hi,

like many others, I’m also a new user of MeTA and need some help. Currently I need a tool, that is able to take a text and to generate keywords from the text. (And I actually mean from the text, not out of the text, because the actual “main” keyword might not be in the text.)

My idea was to take a deep learning algorithm like TensorFlow and train it with the data from Question / Answer sites, where the questions are tagged. But then I found the MeTA Project. And MeTA might be more helpful then a simple Deep Learning machine.

But I don’t know where I can start to extract the keywords. In the tutorial I found the frequency analysis. It’s a good tool for the beginning, but I need something better. Let me show you something on an example.

Lets imagine I have a text about Bioshock. Of course using the frequency analysis the word Bioshock would be one of the most common words. But “game” or “computer game” should also be one of the main keywords, even if “game” od “computer game” even does not appear in the text – because the machine learned by the way the text is written, that it’s a video game / or knows that Bioshock is a video game.

Is it possible to implement such keyword generation with MeTA?


#2

If you know ahead of time some fixed list of keywords you want to be able to identify, one possibility is to treat this as a classification problem, where the (binary) label is presence (absence) of a specific keyword. You can then train (admittedly, a lot) of binary classifiers to detect the presence (absence) of individual keywords, and then “label” a document with the keywords associated with all of the classifiers that said “positive” for a given document.

That is perhaps a bit heavy-handed, though, especially as the number of keywords you want to identify grows. Another option would be to do some combination of an intelligent frequency-based method (perhaps by looking only at tokens that are POS tagged as nouns, or occur within a NP node in a constutency parse?) with some external tool like a Wikifier to identify what the noun actually refers to. You could use the linked Wikipedia article for the nouns you extract as additional text to extract more potential keywords from.

A third option might be to do some sort of clustering with something like LDA. Here, you can set a fixed number of “topics” that might occur in your corpus of documents, and then the model will output two main things: (1) a distribution over words that are associated with each of the $k$ topics, and (2) a distribution over the $k$ topics for each document in your corpus. You could then potentially use some of the top words in the top topics for each document as candidate keywords, as the words within each topic will be learned from all documents that exhibit that topic (not just the current one).