Python Examples of jieba.tokenize - ProgramCreek.com

2025-01-09

文章推薦指數： 80 %

投票人數：10人

Python jieba.tokenize() Examples. The following are 30 code examples for showing how to use jieba.tokenize(). These examples are extracted from open source ... SearchbyModuleSearchbyWordProjectSearchTopPythonAPIsPopularProjectsJavaC++PythonScalaBlogreportthisadMorefromjieba.cut().lcut().setLogLevel().set_dictionary().analyse().pool().tokenize().dt().add_word().posseg().enable_parallel().load_userdict().del_word().cut_for_search().__version__()._get_abs_path().initialize().Tokenizer().suggest_freq()reportthisadRelatedMethodssys.stderr()re.compile()time.time()re.sub()os.makedirs()logging.getLogger()os.getcwd()sys.version_info()glob.glob()os.rename()hashlib.md5()pickle.load()math.log()argparse.ArgumentParser()collections.defaultdict()numpy.zeros()multiprocessing.cpu_count()jieba.posseg.cut()jieba.posseg()jieba.load_userdict()RelatedModulesossysretimeloggingdatetimerandommathjsonpicklenumpycollectionsargparserequeststensorflowPythonjieba.tokenize()ExamplesThefollowingare30 codeexamplesforshowinghowtousejieba.tokenize(). Theseexamplesareextractedfromopensourceprojects. Youcanvoteuptheonesyoulikeorvotedowntheonesyoudon'tlike, andgototheoriginalprojectorsourcefilebyfollowingthelinksaboveeachexample.YoumaycheckouttherelatedAPIusageonthesidebar.Youmayalsowanttocheckoutallavailablefunctions/classesofthemodule jieba ,ortrythesearchfunction .Example1Project: driverlessai-recipes   Author:h2oai   File:tokenize_chinese.py   License:ApacheLicense2.06 votes defcreate_data(X:dt.Frame=None)->Union[str,List[str], dt.Frame,List[dt.Frame], np.ndarray,List[np.ndarray], pd.DataFrame,List[pd.DataFrame]]: #exitgracefullyifmethodiscalledasadatauploadratherthandatamodify ifXisNone: return[] #Tokenizethechinesetext importjieba X=dt.Frame(X).to_pandas() #Ifnocolumnstotokenize,usethefirstcolumn iflen(cols_to_tokenize)==0: cols_to_tokenize.append(X.columns[0]) forcolincols_to_tokenize: X[col]=X[col].astype('unicode').fillna(u'NA') X[col]=X[col].apply(lambdax:"".join([r[0]forrinjieba.tokenize(x)])) returndt.Frame(X)Example2Project: pycorrector   Author:shibing624   File:tokenizer_test.py   License:ApacheLicense2.06 votes deftest_tokenizer(): txts=["我不要你花钱,这些路曲近通幽", "这个消息不胫儿走", "这个消息不径而走", "这个消息不胫而走", "复方甘草口服溶液限田基", "张老师经常背课到深夜，我们要体晾老师的心苦。

", '新进人员时，知识当然还不过，可是人有很有精神，面对工作很认真的话，很快就学会、体会。

', ",我遇到了问题怎么办", ",我遇到了问题", "问题", "北川景子参演了林诣彬导演的《速度与激情3》", "林志玲亮相网友:确定不是波多野结衣？", "龟山千广和近藤公园在龟山公园里喝酒赏花", "小牛曲清去蛋白提取物乙"] t=Tokenizer() fortextintxts: print(text) print('deault',t.tokenize(text,'default')) print('search',t.tokenize(text,'search')) print('ngram',t.tokenize(text,'ngram'))Example3Project: pycorrector   Author:shibing624   File:tokenizer_test.py   License:ApacheLicense2.06 votes deftest_detector_tokenizer(): sents=["我不要你花钱,这些路曲近通幽", "这个消息不胫儿走", "这个消息不径而走", "这个消息不胫而走", "复方甘草口服溶液限田基", "张老师经常背课到深夜，我们要体晾老师的心苦。

", '新进人员时，知识当然还不过，可是人有很有精神，面对工作很认真的话，很快就学会、体会。

', "北川景子参演了林诣彬导演的《速度与激情3》", "林志玲亮相网友:确定不是波多野结衣？", "龟山千广和近藤公园在龟山公园里喝酒赏花", "问题" ] d=Detector() d.check_detector_initialized() detector_tokenizer=d.tokenizer fortextinsents: print(text) print('deault',detector_tokenizer.tokenize(text,'default')) print('search',detector_tokenizer.tokenize(text,'search'))Example4Project: jieba_fast   Author:deepcs233   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample5Project: jieba_fast   Author:deepcs233   File:jieba_test.py   License:MITLicense5 votes deftestTokenize(self): forcontentintest_contents: result=jieba.tokenize(content) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize",file=sys.stderr)Example6Project: jieba_fast   Author:deepcs233   File:jieba_test.py   License:MITLicense5 votes deftestTokenize_NOHMM(self): forcontentintest_contents: result=jieba.tokenize(content,HMM=False) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize_NOHMM",file=sys.stderr)Example7Project: jieba_fast   Author:deepcs233   File:test_tokenize_no_hmm.py   License:MITLicense5 votes defcuttest(test_sent): globalg_mode result=jieba.tokenize(test_sent,mode=g_mode,HMM=False) fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example8Project: jieba_fast   Author:deepcs233   File:test_tokenize.py   License:MITLicense5 votes defcuttest(test_sent): globalg_mode result=jieba.tokenize(test_sent,mode=g_mode) fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example9Project: chinese-support-redux   Author:luoliyan   File:analyzer.py   License:GNUGeneralPublicLicensev3.05 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample10Project: pycorrector   Author:shibing624   File:tokenizer_test.py   License:ApacheLicense2.05 votes deftest_segment(): """测试疾病名纠错""" error_sentence_1='这个新药奥美砂坦脂片能治疗心绞痛，效果还可以'#奥美沙坦酯片 print(error_sentence_1) print(segment(error_sentence_1)) importjieba print(list(jieba.tokenize(error_sentence_1))) importjieba.possegaspseg words=pseg.lcut("我爱北京天安门")#jieba默认模式 print('old:',words) #jieba.enable_paddle()#启动paddle模式。

0.40版之后开始支持，早期版本不支持 #words=pseg.cut("我爱北京天安门",use_paddle=True)#paddle模式 #forword,flaginwords: #print('new:','%s%s'%(word,flag))Example11Project: Synonyms   Author:huyingxi   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample12Project: rasa_nlu_gq   Author:GaoQ1   File:jieba_pseg_extractor.py   License:ApacheLicense2.05 votes defposseg(text): #type:(Text)->List[Token] result=[] for(word,start,end)injieba.tokenize(text): pseg_data=[(w,f)for(w,f)inpseg.cut(word)] result.append((pseg_data,start,end)) returnresultExample13Project: rasa_nlu   Author:weizhenzhao   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftrain(self, training_data:TrainingData, config:RasaNLUModelConfig, **kwargs:Any)->None: forexampleintraining_data.training_examples: example.set("tokens",self.tokenize(example.text))Example14Project: rasa_nlu   Author:weizhenzhao   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes defprocess(self,message:Message,**kwargs:Any)->None: message.set("tokens",self.tokenize(message.text))Example15Project: rasa_nlu   Author:weizhenzhao   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftokenize(text:Text)->List[Token]: importjieba tokenized=jieba.tokenize(text) tokens=[Token(word,start)for(word,start,end)intokenized] returntokensExample16Project: rasa_nlu   Author:weizhenzhao   File:jieba_pseg_extractor.py   License:ApacheLicense2.05 votes defposseg(text): #type:(Text)->List[Token] importjieba importjieba.possegaspseg result=[] for(word,start,end)injieba.tokenize(text): pseg_data=[(w,f)for(w,f)inpseg.cut(word)] result.append((pseg_data,start,end)) returnresultExample17Project: rasa_bot   Author:Ma-Dan   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftrain(self,training_data,config,**kwargs): #type:(TrainingData,RasaNLUModelConfig,**Any)->None forexampleintraining_data.training_examples: example.set("tokens",self.tokenize(example.text))Example18Project: rasa_bot   Author:Ma-Dan   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes defprocess(self,message,**kwargs): #type:(Message,**Any)->None message.set("tokens",self.tokenize(message.text))Example19Project: rasa_bot   Author:Ma-Dan   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftokenize(self,text): #type:(Text)->List[Token] importjieba tokenized=jieba.tokenize(text) tokens=[Token(word,start)for(word,start,end)intokenized] returntokensExample20Project: QAbot_by_base_KG   Author:Goooaaal   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample21Project: python-girlfriend-mood   Author:CasterWx   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample22Project: rasa-for-botfront   Author:botfront   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftokenize(self,message:Message,attribute:Text)->List[Token]: importjieba text=message.get(attribute) tokenized=jieba.tokenize(text) tokens=[Token(word,start)for(word,start,end)intokenized] returntokensExample23Project: annotated_jieba   Author:ustcdane   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample24Project: annotated_jieba   Author:ustcdane   File:jieba_test.py   License:MITLicense5 votes deftestTokenize(self): forcontentintest_contents: result=jieba.tokenize(content) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize",file=sys.stderr)Example25Project: annotated_jieba   Author:ustcdane   File:jieba_test.py   License:MITLicense5 votes deftestTokenize_NOHMM(self): forcontentintest_contents: result=jieba.tokenize(content,HMM=False) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize_NOHMM",file=sys.stderr)Example26Project: annotated_jieba   Author:ustcdane   File:test_tokenize_no_hmm.py   License:MITLicense5 votes defcuttest(test_sent): globalg_mode result=jieba.tokenize(test_sent,mode=g_mode,HMM=False) fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example27Project: annotated_jieba   Author:ustcdane   File:test_tokenize.py   License:MITLicense5 votes defcuttest(test_sent): globalg_mode result=jieba.tokenize(test_sent,mode=g_mode) fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example28Project: Malicious_Domain_Whois   Author:h-j-13   File:analyzer.py   License:GNUGeneralPublicLicensev3.05 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample29Project: Malicious_Domain_Whois   Author:h-j-13   File:jieba_test.py   License:GNUGeneralPublicLicensev3.05 votes deftestTokenize(self): forcontentintest_contents: result=jieba.tokenize(content) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize",file=sys.stderr)Example30Project: Malicious_Domain_Whois   Author:h-j-13   File:jieba_test.py   License:GNUGeneralPublicLicensev3.05 votes deftestTokenize_NOHMM(self): forcontentintest_contents: result=jieba.tokenize(content,HMM=False) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize_NOHMM",file=sys.stderr)reportthisadAboutPrivacyContact

請為這篇文章評分？

延伸文章資訊

Python 结巴分词(jieba)Tokenize和ChineseAnalyzer的使用及 ...

本文主要介绍Python中，使用结巴分词(jieba)中的Tokenize方法，并返回分词的词语在原文的起止位置，和ChineseAnalyzer的使用，以及相关的示例代码。

[NLP][Python] 中文斷詞最方便的開源工具之一： Jieba

Jieba 是一款使用Python (或者說在Python 上最知名的？) 的一款開源中文斷詞工具，當然它也有支援許多不同的NLP 任務，比方說POS、關鍵字抽取.

jieba 词性标注& 并行分词| 计算机科学论坛 - LearnKu

jieba 词性标注# 新建自定义分词器jieba.posseg.POSTokenizer(tokenizer=None) # 参数可指定内部使用的jieba.Tokenizer 分词器。 ji...

fxsjy/jieba: 结巴中文分词

Tokenizer(dictionary=DEFAULT_DICT) 新建自定义分词器，可用于同时使用不同词典。 jieba.dt 为默认分词器，所有全局分词相关函数都是该分词器的映射。代码示例.

jieba——分詞、添加詞典、詞性標註、Tokenize - 台部落

jieba——分詞、添加詞典、詞性標註、Tokenize 1.分詞jieba.cut 方法接受三個輸入參數: 需要分詞的字符串；cut_all 參數用來控制是否採用全模式；HMM ...

Python Examples of jieba.tokenize - ProgramCreek.com

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

月子中心

剖腹產

下訂單英文書信

著迷

憂鬱症發作怎麼辦

心悸症狀

重度憂鬱症

Python Examples of jieba.tokenize - ProgramCreek.com

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

月子中心

剖腹產

下訂單 英文 書信

著迷

憂鬱症發作怎麼辦

心悸症狀

重度憂鬱症

下訂單英文書信