Python Examples of jieba.tokenize - ProgramCreek.com
文章推薦指數: 80 %
Python jieba.tokenize() Examples. The following are 30 code examples for showing how to use jieba.tokenize(). These examples are extracted from open source ...
SearchbyModuleSearchbyWordProjectSearchTopPythonAPIsPopularProjectsJavaC++PythonScalaBlogreportthisadMorefromjieba.cut().lcut().setLogLevel().set_dictionary().analyse().pool().tokenize().dt().add_word().posseg().enable_parallel().load_userdict().del_word().cut_for_search().__version__()._get_abs_path().initialize().Tokenizer().suggest_freq()reportthisadRelatedMethodssys.stderr()re.compile()time.time()re.sub()os.makedirs()logging.getLogger()os.getcwd()sys.version_info()glob.glob()os.rename()hashlib.md5()pickle.load()math.log()argparse.ArgumentParser()collections.defaultdict()numpy.zeros()multiprocessing.cpu_count()jieba.posseg.cut()jieba.posseg()jieba.load_userdict()RelatedModulesossysretimeloggingdatetimerandommathjsonpicklenumpycollectionsargparserequeststensorflowPythonjieba.tokenize()ExamplesThefollowingare30
codeexamplesforshowinghowtousejieba.tokenize().
Theseexamplesareextractedfromopensourceprojects.
Youcanvoteuptheonesyoulikeorvotedowntheonesyoudon'tlike,
andgototheoriginalprojectorsourcefilebyfollowingthelinksaboveeachexample.YoumaycheckouttherelatedAPIusageonthesidebar.Youmayalsowanttocheckoutallavailablefunctions/classesofthemodule
jieba
,ortrythesearchfunction
.Example1Project:
driverlessai-recipes
Author:h2oai
File:tokenize_chinese.py
License:ApacheLicense2.06
votes
defcreate_data(X:dt.Frame=None)->Union[str,List[str],
dt.Frame,List[dt.Frame],
np.ndarray,List[np.ndarray],
pd.DataFrame,List[pd.DataFrame]]:
#exitgracefullyifmethodiscalledasadatauploadratherthandatamodify
ifXisNone:
return[]
#Tokenizethechinesetext
importjieba
X=dt.Frame(X).to_pandas()
#Ifnocolumnstotokenize,usethefirstcolumn
iflen(cols_to_tokenize)==0:
cols_to_tokenize.append(X.columns[0])
forcolincols_to_tokenize:
X[col]=X[col].astype('unicode').fillna(u'NA')
X[col]=X[col].apply(lambdax:"".join([r[0]forrinjieba.tokenize(x)]))
returndt.Frame(X)Example2Project:
pycorrector
Author:shibing624
File:tokenizer_test.py
License:ApacheLicense2.06
votes
deftest_tokenizer():
txts=["我不要你花钱,这些路曲近通幽",
"这个消息不胫儿走",
"这个消息不径而走",
"这个消息不胫而走",
"复方甘草口服溶液限田基",
"张老师经常背课到深夜,我们要体晾老师的心苦。
",
'新进人员时,知识当然还不过,可是人有很有精神,面对工作很认真的话,很快就学会、体会。
',
",我遇到了问题怎么办",
",我遇到了问题",
"问题",
"北川景子参演了林诣彬导演的《速度与激情3》",
"林志玲亮相网友:确定不是波多野结衣?",
"龟山千广和近藤公园在龟山公园里喝酒赏花",
"小牛曲清去蛋白提取物乙"]
t=Tokenizer()
fortextintxts:
print(text)
print('deault',t.tokenize(text,'default'))
print('search',t.tokenize(text,'search'))
print('ngram',t.tokenize(text,'ngram'))Example3Project:
pycorrector
Author:shibing624
File:tokenizer_test.py
License:ApacheLicense2.06
votes
deftest_detector_tokenizer():
sents=["我不要你花钱,这些路曲近通幽",
"这个消息不胫儿走",
"这个消息不径而走",
"这个消息不胫而走",
"复方甘草口服溶液限田基",
"张老师经常背课到深夜,我们要体晾老师的心苦。
",
'新进人员时,知识当然还不过,可是人有很有精神,面对工作很认真的话,很快就学会、体会。
',
"北川景子参演了林诣彬导演的《速度与激情3》",
"林志玲亮相网友:确定不是波多野结衣?",
"龟山千广和近藤公园在龟山公园里喝酒赏花",
"问题"
]
d=Detector()
d.check_detector_initialized()
detector_tokenizer=d.tokenizer
fortextinsents:
print(text)
print('deault',detector_tokenizer.tokenize(text,'default'))
print('search',detector_tokenizer.tokenize(text,'search'))Example4Project:
jieba_fast
Author:deepcs233
File:analyzer.py
License:MITLicense5
votes
def__call__(self,text,**kargs):
words=jieba.tokenize(text,mode="search")
token=Token()
for(w,start_pos,stop_pos)inwords:
ifnotaccepted_chars.match(w)andlen(w)<=1:
continue
token.original=token.text=w
token.pos=start_pos
token.startchar=start_pos
token.endchar=stop_pos
yieldtokenExample5Project:
jieba_fast
Author:deepcs233
File:jieba_test.py
License:MITLicense5
votes
deftestTokenize(self):
forcontentintest_contents:
result=jieba.tokenize(content)
assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror"
result=list(result)
assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr)
print("testTokenize",file=sys.stderr)Example6Project:
jieba_fast
Author:deepcs233
File:jieba_test.py
License:MITLicense5
votes
deftestTokenize_NOHMM(self):
forcontentintest_contents:
result=jieba.tokenize(content,HMM=False)
assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror"
result=list(result)
assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr)
print("testTokenize_NOHMM",file=sys.stderr)Example7Project:
jieba_fast
Author:deepcs233
File:test_tokenize_no_hmm.py
License:MITLicense5
votes
defcuttest(test_sent):
globalg_mode
result=jieba.tokenize(test_sent,mode=g_mode,HMM=False)
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example8Project:
jieba_fast
Author:deepcs233
File:test_tokenize.py
License:MITLicense5
votes
defcuttest(test_sent):
globalg_mode
result=jieba.tokenize(test_sent,mode=g_mode)
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example9Project:
chinese-support-redux
Author:luoliyan
File:analyzer.py
License:GNUGeneralPublicLicensev3.05
votes
def__call__(self,text,**kargs):
words=jieba.tokenize(text,mode="search")
token=Token()
for(w,start_pos,stop_pos)inwords:
ifnotaccepted_chars.match(w)andlen(w)<=1:
continue
token.original=token.text=w
token.pos=start_pos
token.startchar=start_pos
token.endchar=stop_pos
yieldtokenExample10Project:
pycorrector
Author:shibing624
File:tokenizer_test.py
License:ApacheLicense2.05
votes
deftest_segment():
"""测试疾病名纠错"""
error_sentence_1='这个新药奥美砂坦脂片能治疗心绞痛,效果还可以'#奥美沙坦酯片
print(error_sentence_1)
print(segment(error_sentence_1))
importjieba
print(list(jieba.tokenize(error_sentence_1)))
importjieba.possegaspseg
words=pseg.lcut("我爱北京天安门")#jieba默认模式
print('old:',words)
#jieba.enable_paddle()#启动paddle模式。
0.40版之后开始支持,早期版本不支持
#words=pseg.cut("我爱北京天安门",use_paddle=True)#paddle模式
#forword,flaginwords:
#print('new:','%s%s'%(word,flag))Example11Project:
Synonyms
Author:huyingxi
File:analyzer.py
License:MITLicense5
votes
def__call__(self,text,**kargs):
words=jieba.tokenize(text,mode="search")
token=Token()
for(w,start_pos,stop_pos)inwords:
ifnotaccepted_chars.match(w)andlen(w)<=1:
continue
token.original=token.text=w
token.pos=start_pos
token.startchar=start_pos
token.endchar=stop_pos
yieldtokenExample12Project:
rasa_nlu_gq
Author:GaoQ1
File:jieba_pseg_extractor.py
License:ApacheLicense2.05
votes
defposseg(text):
#type:(Text)->List[Token]
result=[]
for(word,start,end)injieba.tokenize(text):
pseg_data=[(w,f)for(w,f)inpseg.cut(word)]
result.append((pseg_data,start,end))
returnresultExample13Project:
rasa_nlu
Author:weizhenzhao
File:jieba_tokenizer.py
License:ApacheLicense2.05
votes
deftrain(self,
training_data:TrainingData,
config:RasaNLUModelConfig,
**kwargs:Any)->None:
forexampleintraining_data.training_examples:
example.set("tokens",self.tokenize(example.text))Example14Project:
rasa_nlu
Author:weizhenzhao
File:jieba_tokenizer.py
License:ApacheLicense2.05
votes
defprocess(self,message:Message,**kwargs:Any)->None:
message.set("tokens",self.tokenize(message.text))Example15Project:
rasa_nlu
Author:weizhenzhao
File:jieba_tokenizer.py
License:ApacheLicense2.05
votes
deftokenize(text:Text)->List[Token]:
importjieba
tokenized=jieba.tokenize(text)
tokens=[Token(word,start)for(word,start,end)intokenized]
returntokensExample16Project:
rasa_nlu
Author:weizhenzhao
File:jieba_pseg_extractor.py
License:ApacheLicense2.05
votes
defposseg(text):
#type:(Text)->List[Token]
importjieba
importjieba.possegaspseg
result=[]
for(word,start,end)injieba.tokenize(text):
pseg_data=[(w,f)for(w,f)inpseg.cut(word)]
result.append((pseg_data,start,end))
returnresultExample17Project:
rasa_bot
Author:Ma-Dan
File:jieba_tokenizer.py
License:ApacheLicense2.05
votes
deftrain(self,training_data,config,**kwargs):
#type:(TrainingData,RasaNLUModelConfig,**Any)->None
forexampleintraining_data.training_examples:
example.set("tokens",self.tokenize(example.text))Example18Project:
rasa_bot
Author:Ma-Dan
File:jieba_tokenizer.py
License:ApacheLicense2.05
votes
defprocess(self,message,**kwargs):
#type:(Message,**Any)->None
message.set("tokens",self.tokenize(message.text))Example19Project:
rasa_bot
Author:Ma-Dan
File:jieba_tokenizer.py
License:ApacheLicense2.05
votes
deftokenize(self,text):
#type:(Text)->List[Token]
importjieba
tokenized=jieba.tokenize(text)
tokens=[Token(word,start)for(word,start,end)intokenized]
returntokensExample20Project:
QAbot_by_base_KG
Author:Goooaaal
File:analyzer.py
License:MITLicense5
votes
def__call__(self,text,**kargs):
words=jieba.tokenize(text,mode="search")
token=Token()
for(w,start_pos,stop_pos)inwords:
ifnotaccepted_chars.match(w)andlen(w)<=1:
continue
token.original=token.text=w
token.pos=start_pos
token.startchar=start_pos
token.endchar=stop_pos
yieldtokenExample21Project:
python-girlfriend-mood
Author:CasterWx
File:analyzer.py
License:MITLicense5
votes
def__call__(self,text,**kargs):
words=jieba.tokenize(text,mode="search")
token=Token()
for(w,start_pos,stop_pos)inwords:
ifnotaccepted_chars.match(w)andlen(w)<=1:
continue
token.original=token.text=w
token.pos=start_pos
token.startchar=start_pos
token.endchar=stop_pos
yieldtokenExample22Project:
rasa-for-botfront
Author:botfront
File:jieba_tokenizer.py
License:ApacheLicense2.05
votes
deftokenize(self,message:Message,attribute:Text)->List[Token]:
importjieba
text=message.get(attribute)
tokenized=jieba.tokenize(text)
tokens=[Token(word,start)for(word,start,end)intokenized]
returntokensExample23Project:
annotated_jieba
Author:ustcdane
File:analyzer.py
License:MITLicense5
votes
def__call__(self,text,**kargs):
words=jieba.tokenize(text,mode="search")
token=Token()
for(w,start_pos,stop_pos)inwords:
ifnotaccepted_chars.match(w)andlen(w)<=1:
continue
token.original=token.text=w
token.pos=start_pos
token.startchar=start_pos
token.endchar=stop_pos
yieldtokenExample24Project:
annotated_jieba
Author:ustcdane
File:jieba_test.py
License:MITLicense5
votes
deftestTokenize(self):
forcontentintest_contents:
result=jieba.tokenize(content)
assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror"
result=list(result)
assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr)
print("testTokenize",file=sys.stderr)Example25Project:
annotated_jieba
Author:ustcdane
File:jieba_test.py
License:MITLicense5
votes
deftestTokenize_NOHMM(self):
forcontentintest_contents:
result=jieba.tokenize(content,HMM=False)
assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror"
result=list(result)
assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr)
print("testTokenize_NOHMM",file=sys.stderr)Example26Project:
annotated_jieba
Author:ustcdane
File:test_tokenize_no_hmm.py
License:MITLicense5
votes
defcuttest(test_sent):
globalg_mode
result=jieba.tokenize(test_sent,mode=g_mode,HMM=False)
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example27Project:
annotated_jieba
Author:ustcdane
File:test_tokenize.py
License:MITLicense5
votes
defcuttest(test_sent):
globalg_mode
result=jieba.tokenize(test_sent,mode=g_mode)
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example28Project:
Malicious_Domain_Whois
Author:h-j-13
File:analyzer.py
License:GNUGeneralPublicLicensev3.05
votes
def__call__(self,text,**kargs):
words=jieba.tokenize(text,mode="search")
token=Token()
for(w,start_pos,stop_pos)inwords:
ifnotaccepted_chars.match(w)andlen(w)<=1:
continue
token.original=token.text=w
token.pos=start_pos
token.startchar=start_pos
token.endchar=stop_pos
yieldtokenExample29Project:
Malicious_Domain_Whois
Author:h-j-13
File:jieba_test.py
License:GNUGeneralPublicLicensev3.05
votes
deftestTokenize(self):
forcontentintest_contents:
result=jieba.tokenize(content)
assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror"
result=list(result)
assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr)
print("testTokenize",file=sys.stderr)Example30Project:
Malicious_Domain_Whois
Author:h-j-13
File:jieba_test.py
License:GNUGeneralPublicLicensev3.05
votes
deftestTokenize_NOHMM(self):
forcontentintest_contents:
result=jieba.tokenize(content,HMM=False)
assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror"
result=list(result)
assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content
fortkinresult:
print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr)
print("testTokenize_NOHMM",file=sys.stderr)reportthisadAboutPrivacyContact
延伸文章資訊
- 1Python 结巴分词(jieba)Tokenize和ChineseAnalyzer的使用及 ...
本文主要介绍Python中,使用结巴分词(jieba)中的Tokenize方法,并返回分词的词语在原文的起止位置,和ChineseAnalyzer的使用,以及相关的示例代码。
- 2[NLP][Python] 中文斷詞最方便的開源工具之一: Jieba
Jieba 是一款使用Python (或者說在Python 上最知名的?) 的一款開源中文斷詞工具,當然它也有支援許多不同的NLP 任務,比方說POS、關鍵字抽取.
- 3jieba 词性标注& 并行分词| 计算机科学论坛 - LearnKu
jieba 词性标注# 新建自定义分词器jieba.posseg.POSTokenizer(tokenizer=None) # 参数可指定内部使用的jieba.Tokenizer 分词器。 ji...
- 4fxsjy/jieba: 结巴中文分词
Tokenizer(dictionary=DEFAULT_DICT) 新建自定义分词器,可用于同时使用不同词典。 jieba.dt 为默认分词器,所有全局分词相关函数都是该分词器的映射。 代码示例.
- 5jieba——分詞、添加詞典、詞性標註、Tokenize - 台部落
jieba——分詞、添加詞典、詞性標註、Tokenize 1.分詞jieba.cut 方法接受三個輸入參數: 需要分詞的字符串;cut_all 參數用來控制是否採用全模式;HMM ...