Many concepts, entities, and relations can be exposed in phrases. However, extracting phrases from ubiquitous text data is a challenging problem. By dealing with it, we can cope with topic tracking, query suggestion, and document categorization, etc.
I created a package for phrase extraction, which has the following benefits
simple and effective
support multi-language
get started on just one paragraph
without additional knowledge
Findings
Let’s start with the definition of the phrase.
Phrases are a small group of words standing together as a conceptual unit, typically forming a component of a clause. therefore, phrases are independent of each other.
There is a statistical difference between the phrases and the phrases themselves, it can show us the boundary of phrase.
we can pick a sentence from attention is all you need as an example
…multi-head attention in three different ways…
The word/phrase frequency is
multi-head — 10
multi-head attention — 8
multi-head attention in — 1 <<<Drop
multi-head attention in three — 1
At the boundary between phrases in a sentence, the frequency of phrases will drop significantly. This drop-down is a good sign to determine whether it is a phrase.
sentence generation can be said to be the process of continuously adding words at the end of the text, the change in frequency can be expressed by the ratio of the previous word and the previous word with the next word. So, we can use conditional probability to normalize it.
P(multi-head attention|multi-head)= 0.8
P(multi-head attention in|multi-head attention)= 0.125
P(multi-head attention in three|multi-head attention in)= 1
Under conditional probability, the smaller the value, the more likely it is a boundary, and the more likely it is a phrase.
There’s a Step By Step Walkthrough of the whole process:
Example paragraph
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
- Splitting Text Into Sentences based on punctuation
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.
The best performing models also connect the encoder and decoder through an attention mechanism.
We propose a new simple network architecture
the Transformer
based solely on attention mechanisms
dispensing with recurrence and convolutions entirely
2. Calculate n-gram frequency
Before
based solely on attention mechanisms
After
based — 7
based solely — 1
based solely on — 1
based solely on attention — 1
based solely on attention mechanisms — 1
solely — 1
solely on — 1
solely on attention — 1
solely on attention mechanisms— 1
on — 587
on attention — 2
on attention mechanisms— 1
attention — 97
attention mechanisms — 5
attention mechanisms <END> — 1
3. Filter n-gram frequency < 1 and length < 2
on attention — 2
attention mechanisms — 5
4.Filter conditional probably to < 1
P(on attention mechanisms|on attention) = 0.5
P(attention mechanisms|attention mechanisms <end>) = 0.2
The final output with be:
[(‘the encoder’, 13), (‘the Transformer’, 13), (‘on the’, 13), (‘machine translation’, 11), (‘from the’, 7), (‘that the’, 7), (‘encoder and’, 6), (‘a new’, 5), (‘our model’, 5), (‘and a’, 4), (‘the encoder and decoder’, 4), (‘attention mechanisms’, 4), (‘recurrence and’, 4), (‘German translation’, 4), (‘the best’, 4), (‘sequence transduction models’, 3), (‘an attention mechanism’, 3), (‘translation tasks’, 3), (‘to train’, 3), (‘over the’, 3), (‘French translation’, 3), (‘fraction of the training’, 3), (‘to other tasks’, 3), (‘are based on’, 2), (‘recurrent or convolutional’, 2), (‘convolutional neural networks’, 2), (‘these models’, 2), (‘to be’, 2), (‘model achieves’, 2), (‘art BLEU score of’, 2), (‘training for’, 2), (‘5 days on’, 2), (‘on eight’, 2), (‘a small’, 2), (‘training costs’, 2), (‘show that’, 2), (‘Transformer generalizes well to’, 2), (‘to English constituency parsing’, 2)]
It worked like a charm but still will output many noise phrases.
To address this problem, we can remove the result contain prepositions, less than 2 words or low-frequency. there are corresponding trade-offs according to the application.
Multi-language support
It takes nothing to support multi-language:
Hands On
I develop a tool called Phraseg to do this:
also, three is a colab example:
Application
we can simply build a Github daily treading discovering tool based on this:
Fetch Github daily repo and it’s Readme into a text
Extract Phrase from text using the above method and filter it.
In this case, we want a longer phrase to give clear information, so just filter result has less than two words/char
Draw Top-100 WordCloud