NLP Library Based on Minimum Entropy Principle: nlp zero
By 苏剑林 | May 31, 2018
I have written several blog posts about the Minimum Entropy principle, dedicated to some foundational work in unsupervised NLP. For the convenience of experimentation, I have encapsulated the algorithms mentioned in those articles into a library for interested readers to test and use.
Since it is oriented towards unsupervised NLP scenarios and covers the basic tasks of NLP, it is named nlp zero.
Address
Github: https://github.com/bojone/nlp-zero
Pypi: https://pypi.org/project/nlp-zero/
It can be installed directly via:
pip install nlp-zero
The entire library is implemented in pure Python with no third-party dependencies, supporting both Python 2.x and 3.x.
Usage
Default Tokenization
The library comes with a built-in dictionary that can be used as a simple tokenization tool.
from nlp_zero import *
tokenizer = Tokenizer()
print(' '.join(tokenizer.tokenize(u'今天天气真好')))
The built-in dictionary includes some new words discovered through a new word discovery algorithm and has been manually optimized, so its quality is relatively high.
Lexicon Construction
Build a lexicon from a large volume of raw corpora.
First, we need to write an iterator container so that we don't have to load the entire corpus into memory at once. The way to write the iterator is very flexible. For example, if my data is stored in MongoDB, it would be:
class texts:
def __iter__(self):
for i in db.find():
yield i['text']
If the data is stored in a text file, it would look something like this:
class texts:
def __iter__(self):
for l in open('corpus.txt'):
yield l.strip()
Then you can execute:
word_count = WordCount()
word_count.count(texts())
word_count.save_words('words.csv')
View the results via Pandas:
import pandas as pd
words = pd.read_csv('words.csv', encoding='utf-8', header=None)
Build a tokenization tool directly using the statistical lexicon:
tokenizer = Tokenizer()
tokenizer.load_words('words.csv')
Sentence Template Construction
As before, you also need to write an iterator, which I won't repeat here.
Because building sentence templates is based on word statistics, a tokenization function is also required. You can use the built-in tokenizer or an external one, such as Jieba.
template_count = TemplateCount()
template_count.count(texts(), tokenize=tokenizer.tokenize)
template_count.save_templates('templates.csv')
View the results via Pandas:
templates = pd.read_csv('templates.csv', encoding='utf-8', header=None)
Each template has been encapsulated as a class.
Hierarchical Decomposition
Parsing sentence structures based on sentence templates.
parser = Parser()
parser.load_templates('templates.csv')
tree = parser.parse(u'今天天气真好')
tree.show()
For convenience in calling and visualising the results, the output has been encapsulated into a SentTree class. This class has three attributes: template (the current main template), content (the string covered by the current main template), and modules (a list of semantic blocks, where each semantic block is also described by a SentTree). Overall, it is designed according to the assumptions about language structure we made in the article "The Principle of Minimum Entropy (III): 'Crossing the River's Elephant' - Sentence Templates and Language Structure".
To be continued
If necessary, please read the source code for answers~ Further updates will continue to be demonstrated here.
If you found this article helpful, you are welcome to share or donate to this article. Donations are not for profit, but to let me know how many readers are truly following Scientific Spaces. Of course, if you ignore it, it will not affect your reading. Thank you again for visiting and for your support!