Python Examples of jieba.tokenize - ProgramCreek.com

文章推薦指數: 80 %
投票人數:10人

Python jieba.tokenize() Examples. The following are 30 code examples for showing how to use jieba.tokenize(). These examples are extracted from open source ... SearchbyModuleSearchbyWordProjectSearchTopPythonAPIsPopularProjectsJavaC++PythonScalaBlogreportthisadMorefromjieba.cut().lcut().setLogLevel().set_dictionary().analyse().pool().tokenize().dt().add_word().posseg().enable_parallel().load_userdict().del_word().cut_for_search().__version__()._get_abs_path().initialize().Tokenizer().suggest_freq()reportthisadRelatedMethodssys.stderr()re.compile()time.time()re.sub()os.makedirs()logging.getLogger()os.getcwd()sys.version_info()glob.glob()os.rename()hashlib.md5()pickle.load()math.log()argparse.ArgumentParser()collections.defaultdict()numpy.zeros()multiprocessing.cpu_count()jieba.posseg.cut()jieba.posseg()jieba.load_userdict()RelatedModulesossysretimeloggingdatetimerandommathjsonpicklenumpycollectionsargparserequeststensorflowPythonjieba.tokenize()ExamplesThefollowingare30 codeexamplesforshowinghowtousejieba.tokenize(). Theseexamplesareextractedfromopensourceprojects. Youcanvoteuptheonesyoulikeorvotedowntheonesyoudon'tlike, andgototheoriginalprojectorsourcefilebyfollowingthelinksaboveeachexample.YoumaycheckouttherelatedAPIusageonthesidebar.Youmayalsowanttocheckoutallavailablefunctions/classesofthemodule jieba ,ortrythesearchfunction .Example1Project: driverlessai-recipes   Author:h2oai   File:tokenize_chinese.py   License:ApacheLicense2.06 votes defcreate_data(X:dt.Frame=None)->Union[str,List[str], dt.Frame,List[dt.Frame], np.ndarray,List[np.ndarray], pd.DataFrame,List[pd.DataFrame]]: #exitgracefullyifmethodiscalledasadatauploadratherthandatamodify ifXisNone: return[] #Tokenizethechinesetext importjieba X=dt.Frame(X).to_pandas() #Ifnocolumnstotokenize,usethefirstcolumn iflen(cols_to_tokenize)==0: cols_to_tokenize.append(X.columns[0]) forcolincols_to_tokenize: X[col]=X[col].astype('unicode').fillna(u'NA') X[col]=X[col].apply(lambdax:"".join([r[0]forrinjieba.tokenize(x)])) returndt.Frame(X)Example2Project: pycorrector   Author:shibing624   File:tokenizer_test.py   License:ApacheLicense2.06 votes deftest_tokenizer(): txts=["我不要你花钱,这些路曲近通幽", "这个消息不胫儿走", "这个消息不径而走", "这个消息不胫而走", "复方甘草口服溶液限田基", "张老师经常背课到深夜,我们要体晾老师的心苦。

", '新进人员时,知识当然还不过,可是人有很有精神,面对工作很认真的话,很快就学会、体会。

', ",我遇到了问题怎么办", ",我遇到了问题", "问题", "北川景子参演了林诣彬导演的《速度与激情3》", "林志玲亮相网友:确定不是波多野结衣?", "龟山千广和近藤公园在龟山公园里喝酒赏花", "小牛曲清去蛋白提取物乙"] t=Tokenizer() fortextintxts: print(text) print('deault',t.tokenize(text,'default')) print('search',t.tokenize(text,'search')) print('ngram',t.tokenize(text,'ngram'))Example3Project: pycorrector   Author:shibing624   File:tokenizer_test.py   License:ApacheLicense2.06 votes deftest_detector_tokenizer(): sents=["我不要你花钱,这些路曲近通幽", "这个消息不胫儿走", "这个消息不径而走", "这个消息不胫而走", "复方甘草口服溶液限田基", "张老师经常背课到深夜,我们要体晾老师的心苦。

", '新进人员时,知识当然还不过,可是人有很有精神,面对工作很认真的话,很快就学会、体会。

', "北川景子参演了林诣彬导演的《速度与激情3》", "林志玲亮相网友:确定不是波多野结衣?", "龟山千广和近藤公园在龟山公园里喝酒赏花", "问题" ] d=Detector() d.check_detector_initialized() detector_tokenizer=d.tokenizer fortextinsents: print(text) print('deault',detector_tokenizer.tokenize(text,'default')) print('search',detector_tokenizer.tokenize(text,'search'))Example4Project: jieba_fast   Author:deepcs233   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample5Project: jieba_fast   Author:deepcs233   File:jieba_test.py   License:MITLicense5 votes deftestTokenize(self): forcontentintest_contents: result=jieba.tokenize(content) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize",file=sys.stderr)Example6Project: jieba_fast   Author:deepcs233   File:jieba_test.py   License:MITLicense5 votes deftestTokenize_NOHMM(self): forcontentintest_contents: result=jieba.tokenize(content,HMM=False) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize_NOHMM",file=sys.stderr)Example7Project: jieba_fast   Author:deepcs233   File:test_tokenize_no_hmm.py   License:MITLicense5 votes defcuttest(test_sent): globalg_mode result=jieba.tokenize(test_sent,mode=g_mode,HMM=False) fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example8Project: jieba_fast   Author:deepcs233   File:test_tokenize.py   License:MITLicense5 votes defcuttest(test_sent): globalg_mode result=jieba.tokenize(test_sent,mode=g_mode) fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example9Project: chinese-support-redux   Author:luoliyan   File:analyzer.py   License:GNUGeneralPublicLicensev3.05 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample10Project: pycorrector   Author:shibing624   File:tokenizer_test.py   License:ApacheLicense2.05 votes deftest_segment(): """测试疾病名纠错""" error_sentence_1='这个新药奥美砂坦脂片能治疗心绞痛,效果还可以'#奥美沙坦酯片 print(error_sentence_1) print(segment(error_sentence_1)) importjieba print(list(jieba.tokenize(error_sentence_1))) importjieba.possegaspseg words=pseg.lcut("我爱北京天安门")#jieba默认模式 print('old:',words) #jieba.enable_paddle()#启动paddle模式。

0.40版之后开始支持,早期版本不支持 #words=pseg.cut("我爱北京天安门",use_paddle=True)#paddle模式 #forword,flaginwords: #print('new:','%s%s'%(word,flag))Example11Project: Synonyms   Author:huyingxi   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample12Project: rasa_nlu_gq   Author:GaoQ1   File:jieba_pseg_extractor.py   License:ApacheLicense2.05 votes defposseg(text): #type:(Text)->List[Token] result=[] for(word,start,end)injieba.tokenize(text): pseg_data=[(w,f)for(w,f)inpseg.cut(word)] result.append((pseg_data,start,end)) returnresultExample13Project: rasa_nlu   Author:weizhenzhao   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftrain(self, training_data:TrainingData, config:RasaNLUModelConfig, **kwargs:Any)->None: forexampleintraining_data.training_examples: example.set("tokens",self.tokenize(example.text))Example14Project: rasa_nlu   Author:weizhenzhao   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes defprocess(self,message:Message,**kwargs:Any)->None: message.set("tokens",self.tokenize(message.text))Example15Project: rasa_nlu   Author:weizhenzhao   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftokenize(text:Text)->List[Token]: importjieba tokenized=jieba.tokenize(text) tokens=[Token(word,start)for(word,start,end)intokenized] returntokensExample16Project: rasa_nlu   Author:weizhenzhao   File:jieba_pseg_extractor.py   License:ApacheLicense2.05 votes defposseg(text): #type:(Text)->List[Token] importjieba importjieba.possegaspseg result=[] for(word,start,end)injieba.tokenize(text): pseg_data=[(w,f)for(w,f)inpseg.cut(word)] result.append((pseg_data,start,end)) returnresultExample17Project: rasa_bot   Author:Ma-Dan   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftrain(self,training_data,config,**kwargs): #type:(TrainingData,RasaNLUModelConfig,**Any)->None forexampleintraining_data.training_examples: example.set("tokens",self.tokenize(example.text))Example18Project: rasa_bot   Author:Ma-Dan   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes defprocess(self,message,**kwargs): #type:(Message,**Any)->None message.set("tokens",self.tokenize(message.text))Example19Project: rasa_bot   Author:Ma-Dan   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftokenize(self,text): #type:(Text)->List[Token] importjieba tokenized=jieba.tokenize(text) tokens=[Token(word,start)for(word,start,end)intokenized] returntokensExample20Project: QAbot_by_base_KG   Author:Goooaaal   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample21Project: python-girlfriend-mood   Author:CasterWx   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample22Project: rasa-for-botfront   Author:botfront   File:jieba_tokenizer.py   License:ApacheLicense2.05 votes deftokenize(self,message:Message,attribute:Text)->List[Token]: importjieba text=message.get(attribute) tokenized=jieba.tokenize(text) tokens=[Token(word,start)for(word,start,end)intokenized] returntokensExample23Project: annotated_jieba   Author:ustcdane   File:analyzer.py   License:MITLicense5 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample24Project: annotated_jieba   Author:ustcdane   File:jieba_test.py   License:MITLicense5 votes deftestTokenize(self): forcontentintest_contents: result=jieba.tokenize(content) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize",file=sys.stderr)Example25Project: annotated_jieba   Author:ustcdane   File:jieba_test.py   License:MITLicense5 votes deftestTokenize_NOHMM(self): forcontentintest_contents: result=jieba.tokenize(content,HMM=False) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize_NOHMM",file=sys.stderr)Example26Project: annotated_jieba   Author:ustcdane   File:test_tokenize_no_hmm.py   License:MITLicense5 votes defcuttest(test_sent): globalg_mode result=jieba.tokenize(test_sent,mode=g_mode,HMM=False) fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example27Project: annotated_jieba   Author:ustcdane   File:test_tokenize.py   License:MITLicense5 votes defcuttest(test_sent): globalg_mode result=jieba.tokenize(test_sent,mode=g_mode) fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]))Example28Project: Malicious_Domain_Whois   Author:h-j-13   File:analyzer.py   License:GNUGeneralPublicLicensev3.05 votes def__call__(self,text,**kargs): words=jieba.tokenize(text,mode="search") token=Token() for(w,start_pos,stop_pos)inwords: ifnotaccepted_chars.match(w)andlen(w)<=1: continue token.original=token.text=w token.pos=start_pos token.startchar=start_pos token.endchar=stop_pos yieldtokenExample29Project: Malicious_Domain_Whois   Author:h-j-13   File:jieba_test.py   License:GNUGeneralPublicLicensev3.05 votes deftestTokenize(self): forcontentintest_contents: result=jieba.tokenize(content) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize",file=sys.stderr)Example30Project: Malicious_Domain_Whois   Author:h-j-13   File:jieba_test.py   License:GNUGeneralPublicLicensev3.05 votes deftestTokenize_NOHMM(self): forcontentintest_contents: result=jieba.tokenize(content,HMM=False) assertisinstance(result,types.GeneratorType),"TestTokenizeGeneratorerror" result=list(result) assertisinstance(result,list),"TestTokenizeerroroncontent:%s"%content fortkinresult: print("word%s\t\tstart:%d\t\tend:%d"%(tk[0],tk[1],tk[2]),file=sys.stderr) print("testTokenize_NOHMM",file=sys.stderr)reportthisadAboutPrivacyContact



請為這篇文章評分?