Python的几个自然语言处理工具介绍

python以其清晰简洁的语法、易用和可扩展性以及丰富庞大的库深受广大开发者喜爱。其内置的非常强大的机器学习代码库和数学库，使python理所当然成为自然语言处理的开发利器。
那么使用python进行自然语言处理，要是不知道这几个工具就真的out了。
python 的几个自然语言处理工具 nltk是使用python处理语言数据的领先平台。它为像wordnet这样的词汇资源提供了简便易用的界面。它还具有为文本分类（classification）、文本标记（tokenization）、词干提取（stemming）、词性标记（tagging）、语义分析（parsing）和语义推理（semantic reasoning）准备的文本处理库。
nltk:nltk 在用 python 处理自然语言的工具中处于领先的地位。它提供了 wordnet 这种方便处理词汇资源的借口，还有分类、分词、除茎、标注、语法分析、语义推理等类库。
pattern:pattern 的自然语言处理工具有词性标注工具（part-of-speech tagger），n元搜索（n-gram search），情感分析（sentiment analysis），wordnet。支持机器学习的向量空间模型，聚类，向量机。
textblob:textblob 是一个处理文本数据的 python 库。提供了一些简单的api解决一些自然语言处理的任务，例如词性标注、名词短语抽取、情感分析、分类、翻译等等。
gensim:gensim 提供了对大型语料库的主题建模、文件索引、相似度检索的功能。它可以处理大于ram内存的数据。作者说它是“实现无干预从纯文本语义建模的最强大、最高效、最无障碍的软件。
pynlpi：它的全称是：python自然语言处理库（python natural language processing library，音发作： pineapple）这是一个各种自然语言处理任务的集合，pynlpi可以用来处理n元搜索，计算频率表和分布，建立语言模型。他还可以处理向优先队列这种更加复杂的数据结构，或者像 beam 搜索这种更加复杂的算法。
spacy：这是一个商业的开源软件。结合python和cython，它的自然语言处理能力达到了工业强度。是速度最快，领域内最先进的自然语言处理工具。
polyglot:polyglot 支持对海量文本和多语言的处理。它支持对165种语言的分词，对196中语言的辨识，40种语言的专有名词识别，16种语言的词性标注，136种语言的情感分析，137种语言的嵌入，135种语言的形态分析，以及69中语言的翻译。
montylingua:montylingua 是一个自由的、训练有素的、端到端的英文处理工具。输入原始英文文本到 montylingua ，就会得到这段文本的语义解释。适合用来进行信息检索和提取，问题处理，回答问题等任务。从英文文本中，它能提取出主动宾元组，形容词、名词和动词短语，人名、地名、事件，日期和时间，等语义信息。
bllip parser:bllip parser（也叫做charniak-johnson parser）是一个集成了产生成分分析和最大熵排序的统计自然语言工具。包括命令行和 python接口。
quepy:quepy是一个python框架，提供将自然语言转换成为数据库查询语言。可以轻松地实现不同类型的自然语言和数据库查询语言的转化。所以，通过quepy，仅仅修改几行代码，就可以实现你自己的自然语言查询数据库系统。github:https://github.com/machinalis/quepy
hannlp：hanlp是由一系列模型与算法组成的java工具包，目标是普及自然语言处理在生产环境中的应用。不仅仅是分词，而是提供词法分析、句法分析、语义理解等完备的功能。hanlp具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。文档使用操作说明：python调用自然语言处理包hanlp 和菜鸟如何调用hannlp
opennlp：进行中文命名实体识别 opennlp是apach下的java自然语言处理api，功能齐全。如下给大家介绍一下使用opennlp进行中文语料命名实体识别的过程。
首先是预处理工作，分词去听用词等等的就不啰嗦了，其实将分词的结果中间加上空格隔开就可以了，opennlp可以将这样形式的的语料照处理英文的方式处理，有些关于字符处理的注意点在后面会提到。
其次我们要准备各个命名实体类别所对应的词库，词库被存在文本文档中，文档名即是命名实体类别的typename，下面两个function分别是载入某类命名实体词库中的词和载入命名实体的类别。
/**
* 载入词库中的命名实体
*
* @param namelistfile
* @return
* @throws exception
*/
public static list《string》 loadnamewords（file namelistfile）
throws exception {
list《string》 namewords = new arraylist《string》（）;
if （！namelistfile.exists（） || namelistfile.isdirectory（）） {
system.err.println（“不存在那个文件”）;
return null;
}
bufferedreader br = new bufferedreader（new filereader（namelistfile））;
string line = null;
while （（line = br.readline（））！= null） {
namewords.add（line）;
}
br.close（）;
return namewords;
}
/**
* 获取命名实体类型
*
* @param namelistfile
* @return
*/
public static string getnametype（file namelistfile） {
string nametype = namelistfile.getname（）;
return nametype.substring（0， nametype.lastindexof（“。”））;
}
因为opennlp要求的训练语料是这样子的：
xxxxxx《start:person》？？？？《end》xxxxxxxxx《start:action》？？？？《end》xxxxxxx
被标注的命名实体被放在《start》《end》范围中，并标出了实体的类别。接下来是对命名实体识别模型的训练，先上代码：
import java.io.file;
import java.io.fileoutputstream;
import java.io.ioexception;
import java.io.stringreader;
import java.util.collections;
import opennlp.tools.namefind.namefinderme;
import opennlp.tools.namefind.namesample;
import opennlp.tools.namefind.namesampledatastream;
import opennlp.tools.namefind.tokennamefindermodel;
import opennlp.tools.util.objectstream;
import opennlp.tools.util.plaintextbylinestream;
import opennlp.tools.util.featuregen.aggregatedfeaturegenerator;
import opennlp.tools.util.featuregen.previousmapfeaturegenerator;
import opennlp.tools.util.featuregen.tokenclassfeaturegenerator;
import opennlp.tools.util.featuregen.tokenfeaturegenerator;
import opennlp.tools.util.featuregen.windowfeaturegenerator;
/**
* 中文命名实体识别模型训练组件
*
* @author ddlovehy
*
*/
public class namedentitymultifindtrainer {
// 默认参数
private int iterations = 80;
private int cutoff = 5;
private string langcode = “general”;
private string type = “default”;
// 待设定的参数
private string namewordspath; // 命名实体词库路径
private string datapath; // 训练集已分词语料路径
private string modelpath; // 模型存储路径
public namedentitymultifindtrainer（） {
super（）;
// todo auto-generated constructor stub
}
public namedentitymultifindtrainer（string namewordspath， string datapath，
string modelpath） {
super（）;
this.namewordspath = namewordspath;
this.datapath = datapath;
this.modelpath = modelpath;
}
public namedentitymultifindtrainer（int iterations， int cutoff，
string langcode， string type， string namewordspath，
string datapath， string modelpath） {
super（）;
this.iterations = iterations;
this.cutoff = cutoff;
this.langcode = langcode;
this.type = type;
this.namewordspath = namewordspath;
this.datapath = datapath;
this.modelpath = modelpath;
}
/**
* 生成定制特征
*
* @return
*/
public aggregatedfeaturegenerator prodfeaturegenerators（） {
aggregatedfeaturegenerator featuregenerators = new aggregatedfeaturegenerator（
new windowfeaturegenerator（new tokenfeaturegenerator（）， 2， 2），
new windowfeaturegenerator（new tokenclassfeaturegenerator（）， 2，
2）， new previousmapfeaturegenerator（））;
return featuregenerators;
}
/**
* 将模型写入磁盘
*
* @param model
* @throws exception
*/
public void writemodelintodisk（tokennamefindermodel model） throws exception {
file outmodelfile = new file（this.getmodelpath（））;
fileoutputstream outmodelstream = new fileoutputstream（outmodelfile）;
model.serialize（outmodelstream）;
}
/**
* 读出标注的训练语料
*
* @return
* @throws exception
*/
public string gettraincorpusdatastr（） throws exception {
// todo 考虑入持久化判断直接载入标注数据的情况以及增量式训练
string traindatastr = null;
traindatastr = nameentitytextfactory.prodnamefindtraintext（
this.getnamewordspath（）， this.getdatapath（）， null）;
return traindatastr;
}
/**
* 训练模型
*
* @param traindatastr
* 已标注的训练数据整体字符串
* @return
* @throws exception
*/
public tokennamefindermodel trainnameentitysamples（string traindatastr）
throws exception {
objectstream《namesample》 nameentitysample = new namesampledatastream（
new plaintextbylinestream（new stringreader（traindatastr）））;
system.out.println（“**************************************”）;
system.out.println（traindatastr）;
tokennamefindermodel namefindermodel = namefinderme.train（
this.getlangcode（）， this.gettype（）， nameentitysample，
this.prodfeaturegenerators（），
collections.《string， object》 emptymap（）， this.getiterations（），
this.getcutoff（））;
return namefindermodel;
}
/**
* 训练组件总调用方法
*
* @return
*/
public boolean execnamefindtrainer（） {
try {
string traindatastr = this.gettraincorpusdatastr（）;
tokennamefindermodel namefindermodel = this
.trainnameentitysamples（traindatastr）;
// system.out.println（namefindermodel）;
this.writemodelintodisk（namefindermodel）;
return true;
} catch （exception e） {
// todo auto-generated catch block
e.printstacktrace（）;
return false;
}
}
｝
注：
参数：iterations是训练算法迭代的次数，太少了起不到训练的效果，太大了会造成过拟合，所以各位可以自己试试效果；
cutoff：语言模型扫描窗口的大小，一般设成5就可以了，当然越大效果越好，时间可能会受不了；
langcode：语种代码和type实体类别，因为没有专门针对中文的代码，设成“普通”的即可，实体的类别因为我们想训练成能识别多种实体的模型，于是设置为“默认”。
说明：
prodfeaturegenerators（）方法用于生成个人订制的特征生成器，其意义在于选择什么样的n-gram语义模型，代码当中显示的是选择窗口大小为5，待测命名实体词前后各扫描两个词的范围计算特征（加上自己就是5个），或许有更深更准确的意义，请大家指正；
trainnameentitysamples（）方法，训练模型的核心，首先是将如上标注的训练语料字符串传入生成字符流，再通过namefinderme的train（）方法传入上面设定的各个参数，订制特征生成器等等，关于源实体映射对，就按默认传入空map就好了。
源代码开源在：https://github.com/ailab403/ailab-mltk4j，test包里面对应有完整的调用demo，以及file文件夹里面的测试语料和已经训练好的模型。

电动机控制设计的运动控制装置
三星Galaxy Book Ion搭载英特尔第10代Comet Lake芯片
机房用的电源和工业用的电源有什么区别?
浅析2022上半年企业存储市场格局
基于计算机视觉的多维图像智能
Python的几个自然语言处理工具介绍
基于开关电源原理分析、组成和作用及电路图
ABB：打造绿色智能充电生态是应对未来能源挑战的途径
苹果正式推送iPadOS13.1 新的布局在每一页上可显示更多app
李开新已经离职，不再担任360手机掌门人
解析led照明灯优缺点所在
制裁凶狠华为美国公司要独立
2021年度劳动用工守法诚信企业名单公布菲菱科思名列其中
亚马逊人脸识别门铃专利引起群众抗议
三星品牌狂欢周：将8K体验贯彻生活全场景
NVIDIA:HPC的未来是ARM非x86
介质滤波器是什么材质做的_介质滤波器结构
Cortex-M0中断向量重定位的高效方法
源创通信SinoV-MIDSPAN-16-G-AT中跨设备介绍
不同仪表类型的安装方式有哪些