%global _empty_manifest_terminate_build 0 Name: python-sentence-spliter Version: 2.1.8 Release: 1 Summary: This is a sentence cutting tool, currently support English & Chinese License: MIT URL: https://pypi.org/project/sentence-spliter/ Source0: https://mirrors.aliyun.com/pypi/web/packages/b8/1a/ad39b7a6ca2588352824775570348db8a0db8791ecdef69e6dc9c1c98762/sentence-spliter-2.1.8.tar.gz BuildArch: noarch Requires: python3-attrd Requires: python3-attrdict Requires: python3-attrs Requires: python3-importlib-metadata Requires: python3-loguru Requires: python3-more-itertools Requires: python3-packaging Requires: python3-pluggy Requires: python3-py Requires: python3-pyparsing Requires: python3-pytest Requires: python3-six Requires: python3-wcwidth Requires: python3-zipp %description # sentence-spliter [toc] ## 简介 sentence-spliter 句子切分工具:将一个长句或者段落,切分为若干短句的 List 。支持自然切分,中间切分等。 目前支持语言:中文, 英文,韩语 ## Architechture - 项目结构 ``` . ├── doc # 补充文档 ├── LICENSE # 许可证 ├── MANIFEST.in # 用于setup时包含其他文件 ├── pyproject.toml # 用于构建项目 ├── README.md ├── requirements.txt ├── sentence_spliter │   ├── architect # 存放切句的基本单元 │   ├── cutter4grammar # 语法纠错定制的切句 │   ├── en_cutter # 英文切句 │   ├── test # 单元测试 │   ├── utility # 其他工具函数 │   └── zh_cutter # 中文切句 └── setup.py # setup.py ``` 更详细的目录结构见 [链接](doc/detail.md) ## Setup git 安装 ``` git clone git@git.duowan.com:ai/nlp/sentence-spliter.git pip install -U pip pip install -r requirements.txt ``` PYPI 安装 ``` pip install sentence_spliter ``` ## API ### 请求示例 ``` curl --location --request POST 'https://rosetta-nlp-api.duowan.com/api/v1/sentence-spliter/en-sentence-spliter' \ --header 'Content-Type: application/json' \ --data-raw '{ "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] , "options": { "max_len": 30, "min_len": 6 } } ' ``` - Request ``` { "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] , "options": { "max_len": 30, "min_len": 6 } } ``` - Response ``` { "code": 0, "data": { "paragraphs": [ "A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...." ], "sub_sentences": [ [ [ "A long time ago..... there is a mountain, and there is a temple in the mountain!!!" ], [ " And here is an old monk in the temple!?...." ] ] ], "version": "1.0.0" }, "message": "success" } ``` ### 响应参数说明 | **字段名** | **类型** | **说明** | | :------------ | :------- | :----------------- | | paragraphs | String | 需要切分的段落列表 | | sub_sentences | String | 切分完成的子句 | 接口相关更多内容见[接口文档](./doc/interface.md) **特别注意:version字段改动涉及广东部门是否需要重跑流水线  **[链接](./doc/detail#version-number) ## 状态机 ### Data 需要用到的主要辅助数据为以下两个: - 白名单表: /white_list.txt - 权重表:/weights_list.txt ### Format 白名单表格式: ``` Dr. U!S!A! No. abbr. Brig. Ltd. b. N. hr. ``` 每行一个字符串,算法扫描到白名单中被记录字符串中的结束符号将会不计为一种象征结束的标志。 权重表: ``` and 10 or 10 but 10 even 10 however 10 whenever 10 whatever 10 although 10 thought 10 ``` 每行为:`word`+`weight`的格式,表示各个有转折、承接上下文等作用含义的词在需要句内切割时的权重大小。 ### 介绍 以下句子作为样本: ```python sentence = 'I like chicken. I like chicken.' ``` #### ***Sequence*** Sequence模块首先将需要切割的句子转换为某种特殊的序列格式。 ```mermaid graph LR A[I like chicken.] -->B[I] subgraph sequence B -->C[] C -->D[like] D --> E[] E --> F[chicken] F --> G[.] end ``` sequence将直接进入状态机 #### ***Condition*** and ***Operation*** Condition模块表示执行某个动作之前的某个条件或者判断,若满足该条件则执行,否则执行不满足该条件的动作。 Operation模块表示某个动作或者称为操作 ```mermaid graph LR A{Condition} -->|True| B[Operation1] A -->|False| C[Operation2] ``` #### ***Condition&Operation***模块 由一系列上图Condition&Operation组成的模块 表示一连串的判断、动作序列组合叠加 进而 ```mermaid graph LR A{Condition1} -->|True| B[Condition&Operation1] A -->|False| C[Condition&Operation2] B -->D[Condition&Operation3] C -->E[Condition&Operation4] ``` #### ***Logic*** 上述Condition&Operation模块形成了整个Logic 所有的Condition&Operation模块进一步叠加得到整个大的逻辑图 ### 运行 - 导入相关包 ```python from sentence_spliter.en_cutter.en_sequence import Sequence # 导入英文切句框架内的sequence类 from sentence_spliter.en_cutter.logic import SimpleLogic, LongShortLogic # 导入英文切句框架内的logic类 ``` - 加载句子为sequence类 ```python sentence = 'I like chicken. I like chicken.' # 例句 seq = Sequence(sentence) # 转化为sequence simple_logic = SimpleLogic() # 自然切句逻辑 long_logic = LongShortLogic(max_len=max_len, min_len=min_len) # 切割长短句 ``` - 执行切句 ``` simple_result = simple_logic.run(seq, debug=True) long_results = long_logic.run(seq, debug=True) ``` ## 打包上传 - 打开setup.py,修改相应的配置(version等) ```python from setuptools import setup, find_packages setup( name="sentence-spliter", version="X.X.X", author="", author_email="", ... ) ``` - 在项目根目录运行以下命令 ``` ./bin/package.sh ``` - 键入账号和密码 ``` Enter your username: Enter your password: ``` - 等待上传即可 详细教程可见[链接](https://packaging.python.org/en/latest/tutorials/packaging-projects/) %package -n python3-sentence-spliter Summary: This is a sentence cutting tool, currently support English & Chinese Provides: python-sentence-spliter BuildRequires: python3-devel BuildRequires: python3-setuptools BuildRequires: python3-pip %description -n python3-sentence-spliter # sentence-spliter [toc] ## 简介 sentence-spliter 句子切分工具:将一个长句或者段落,切分为若干短句的 List 。支持自然切分,中间切分等。 目前支持语言:中文, 英文,韩语 ## Architechture - 项目结构 ``` . ├── doc # 补充文档 ├── LICENSE # 许可证 ├── MANIFEST.in # 用于setup时包含其他文件 ├── pyproject.toml # 用于构建项目 ├── README.md ├── requirements.txt ├── sentence_spliter │   ├── architect # 存放切句的基本单元 │   ├── cutter4grammar # 语法纠错定制的切句 │   ├── en_cutter # 英文切句 │   ├── test # 单元测试 │   ├── utility # 其他工具函数 │   └── zh_cutter # 中文切句 └── setup.py # setup.py ``` 更详细的目录结构见 [链接](doc/detail.md) ## Setup git 安装 ``` git clone git@git.duowan.com:ai/nlp/sentence-spliter.git pip install -U pip pip install -r requirements.txt ``` PYPI 安装 ``` pip install sentence_spliter ``` ## API ### 请求示例 ``` curl --location --request POST 'https://rosetta-nlp-api.duowan.com/api/v1/sentence-spliter/en-sentence-spliter' \ --header 'Content-Type: application/json' \ --data-raw '{ "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] , "options": { "max_len": 30, "min_len": 6 } } ' ``` - Request ``` { "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] , "options": { "max_len": 30, "min_len": 6 } } ``` - Response ``` { "code": 0, "data": { "paragraphs": [ "A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...." ], "sub_sentences": [ [ [ "A long time ago..... there is a mountain, and there is a temple in the mountain!!!" ], [ " And here is an old monk in the temple!?...." ] ] ], "version": "1.0.0" }, "message": "success" } ``` ### 响应参数说明 | **字段名** | **类型** | **说明** | | :------------ | :------- | :----------------- | | paragraphs | String | 需要切分的段落列表 | | sub_sentences | String | 切分完成的子句 | 接口相关更多内容见[接口文档](./doc/interface.md) **特别注意:version字段改动涉及广东部门是否需要重跑流水线  **[链接](./doc/detail#version-number) ## 状态机 ### Data 需要用到的主要辅助数据为以下两个: - 白名单表: /white_list.txt - 权重表:/weights_list.txt ### Format 白名单表格式: ``` Dr. U!S!A! No. abbr. Brig. Ltd. b. N. hr. ``` 每行一个字符串,算法扫描到白名单中被记录字符串中的结束符号将会不计为一种象征结束的标志。 权重表: ``` and 10 or 10 but 10 even 10 however 10 whenever 10 whatever 10 although 10 thought 10 ``` 每行为:`word`+`weight`的格式,表示各个有转折、承接上下文等作用含义的词在需要句内切割时的权重大小。 ### 介绍 以下句子作为样本: ```python sentence = 'I like chicken. I like chicken.' ``` #### ***Sequence*** Sequence模块首先将需要切割的句子转换为某种特殊的序列格式。 ```mermaid graph LR A[I like chicken.] -->B[I] subgraph sequence B -->C[] C -->D[like] D --> E[] E --> F[chicken] F --> G[.] end ``` sequence将直接进入状态机 #### ***Condition*** and ***Operation*** Condition模块表示执行某个动作之前的某个条件或者判断,若满足该条件则执行,否则执行不满足该条件的动作。 Operation模块表示某个动作或者称为操作 ```mermaid graph LR A{Condition} -->|True| B[Operation1] A -->|False| C[Operation2] ``` #### ***Condition&Operation***模块 由一系列上图Condition&Operation组成的模块 表示一连串的判断、动作序列组合叠加 进而 ```mermaid graph LR A{Condition1} -->|True| B[Condition&Operation1] A -->|False| C[Condition&Operation2] B -->D[Condition&Operation3] C -->E[Condition&Operation4] ``` #### ***Logic*** 上述Condition&Operation模块形成了整个Logic 所有的Condition&Operation模块进一步叠加得到整个大的逻辑图 ### 运行 - 导入相关包 ```python from sentence_spliter.en_cutter.en_sequence import Sequence # 导入英文切句框架内的sequence类 from sentence_spliter.en_cutter.logic import SimpleLogic, LongShortLogic # 导入英文切句框架内的logic类 ``` - 加载句子为sequence类 ```python sentence = 'I like chicken. I like chicken.' # 例句 seq = Sequence(sentence) # 转化为sequence simple_logic = SimpleLogic() # 自然切句逻辑 long_logic = LongShortLogic(max_len=max_len, min_len=min_len) # 切割长短句 ``` - 执行切句 ``` simple_result = simple_logic.run(seq, debug=True) long_results = long_logic.run(seq, debug=True) ``` ## 打包上传 - 打开setup.py,修改相应的配置(version等) ```python from setuptools import setup, find_packages setup( name="sentence-spliter", version="X.X.X", author="", author_email="", ... ) ``` - 在项目根目录运行以下命令 ``` ./bin/package.sh ``` - 键入账号和密码 ``` Enter your username: Enter your password: ``` - 等待上传即可 详细教程可见[链接](https://packaging.python.org/en/latest/tutorials/packaging-projects/) %package help Summary: Development documents and examples for sentence-spliter Provides: python3-sentence-spliter-doc %description help # sentence-spliter [toc] ## 简介 sentence-spliter 句子切分工具:将一个长句或者段落,切分为若干短句的 List 。支持自然切分,中间切分等。 目前支持语言:中文, 英文,韩语 ## Architechture - 项目结构 ``` . ├── doc # 补充文档 ├── LICENSE # 许可证 ├── MANIFEST.in # 用于setup时包含其他文件 ├── pyproject.toml # 用于构建项目 ├── README.md ├── requirements.txt ├── sentence_spliter │   ├── architect # 存放切句的基本单元 │   ├── cutter4grammar # 语法纠错定制的切句 │   ├── en_cutter # 英文切句 │   ├── test # 单元测试 │   ├── utility # 其他工具函数 │   └── zh_cutter # 中文切句 └── setup.py # setup.py ``` 更详细的目录结构见 [链接](doc/detail.md) ## Setup git 安装 ``` git clone git@git.duowan.com:ai/nlp/sentence-spliter.git pip install -U pip pip install -r requirements.txt ``` PYPI 安装 ``` pip install sentence_spliter ``` ## API ### 请求示例 ``` curl --location --request POST 'https://rosetta-nlp-api.duowan.com/api/v1/sentence-spliter/en-sentence-spliter' \ --header 'Content-Type: application/json' \ --data-raw '{ "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] , "options": { "max_len": 30, "min_len": 6 } } ' ``` - Request ``` { "paragraphs":["A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...."] , "options": { "max_len": 30, "min_len": 6 } } ``` - Response ``` { "code": 0, "data": { "paragraphs": [ "A long time ago..... there is a mountain, and there is a temple in the mountain!!! And here is an old monk in the temple!?...." ], "sub_sentences": [ [ [ "A long time ago..... there is a mountain, and there is a temple in the mountain!!!" ], [ " And here is an old monk in the temple!?...." ] ] ], "version": "1.0.0" }, "message": "success" } ``` ### 响应参数说明 | **字段名** | **类型** | **说明** | | :------------ | :------- | :----------------- | | paragraphs | String | 需要切分的段落列表 | | sub_sentences | String | 切分完成的子句 | 接口相关更多内容见[接口文档](./doc/interface.md) **特别注意:version字段改动涉及广东部门是否需要重跑流水线  **[链接](./doc/detail#version-number) ## 状态机 ### Data 需要用到的主要辅助数据为以下两个: - 白名单表: /white_list.txt - 权重表:/weights_list.txt ### Format 白名单表格式: ``` Dr. U!S!A! No. abbr. Brig. Ltd. b. N. hr. ``` 每行一个字符串,算法扫描到白名单中被记录字符串中的结束符号将会不计为一种象征结束的标志。 权重表: ``` and 10 or 10 but 10 even 10 however 10 whenever 10 whatever 10 although 10 thought 10 ``` 每行为:`word`+`weight`的格式,表示各个有转折、承接上下文等作用含义的词在需要句内切割时的权重大小。 ### 介绍 以下句子作为样本: ```python sentence = 'I like chicken. I like chicken.' ``` #### ***Sequence*** Sequence模块首先将需要切割的句子转换为某种特殊的序列格式。 ```mermaid graph LR A[I like chicken.] -->B[I] subgraph sequence B -->C[] C -->D[like] D --> E[] E --> F[chicken] F --> G[.] end ``` sequence将直接进入状态机 #### ***Condition*** and ***Operation*** Condition模块表示执行某个动作之前的某个条件或者判断,若满足该条件则执行,否则执行不满足该条件的动作。 Operation模块表示某个动作或者称为操作 ```mermaid graph LR A{Condition} -->|True| B[Operation1] A -->|False| C[Operation2] ``` #### ***Condition&Operation***模块 由一系列上图Condition&Operation组成的模块 表示一连串的判断、动作序列组合叠加 进而 ```mermaid graph LR A{Condition1} -->|True| B[Condition&Operation1] A -->|False| C[Condition&Operation2] B -->D[Condition&Operation3] C -->E[Condition&Operation4] ``` #### ***Logic*** 上述Condition&Operation模块形成了整个Logic 所有的Condition&Operation模块进一步叠加得到整个大的逻辑图 ### 运行 - 导入相关包 ```python from sentence_spliter.en_cutter.en_sequence import Sequence # 导入英文切句框架内的sequence类 from sentence_spliter.en_cutter.logic import SimpleLogic, LongShortLogic # 导入英文切句框架内的logic类 ``` - 加载句子为sequence类 ```python sentence = 'I like chicken. I like chicken.' # 例句 seq = Sequence(sentence) # 转化为sequence simple_logic = SimpleLogic() # 自然切句逻辑 long_logic = LongShortLogic(max_len=max_len, min_len=min_len) # 切割长短句 ``` - 执行切句 ``` simple_result = simple_logic.run(seq, debug=True) long_results = long_logic.run(seq, debug=True) ``` ## 打包上传 - 打开setup.py,修改相应的配置(version等) ```python from setuptools import setup, find_packages setup( name="sentence-spliter", version="X.X.X", author="", author_email="", ... ) ``` - 在项目根目录运行以下命令 ``` ./bin/package.sh ``` - 键入账号和密码 ``` Enter your username: Enter your password: ``` - 等待上传即可 详细教程可见[链接](https://packaging.python.org/en/latest/tutorials/packaging-projects/) %prep %autosetup -n sentence-spliter-2.1.8 %build %py3_build %install %py3_install install -d -m755 %{buildroot}/%{_pkgdocdir} if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi pushd %{buildroot} if [ -d usr/lib ]; then find usr/lib -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/lib64 ]; then find usr/lib64 -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/bin ]; then find usr/bin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi if [ -d usr/sbin ]; then find usr/sbin -type f -printf "\"/%h/%f\"\n" >> filelist.lst fi touch doclist.lst if [ -d usr/share/man ]; then find usr/share/man -type f -printf "\"/%h/%f.gz\"\n" >> doclist.lst fi popd mv %{buildroot}/filelist.lst . mv %{buildroot}/doclist.lst . %files -n python3-sentence-spliter -f filelist.lst %dir %{python3_sitelib}/* %files help -f doclist.lst %{_docdir}/* %changelog * Fri Jun 09 2023 Python_Bot - 2.1.8-1 - Package Spec generated