summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-04-21 12:06:49 +0000
committerCoprDistGit <infra@openeuler.org>2023-04-21 12:06:49 +0000
commit16bf5ca02a5f7e7562c3cca1b06849a4f3b2eaf9 (patch)
tree70f729fc415ed5159d6b1fe4a0b8e2dfb3efe85e
parent2fb05b8cdfb39ff4e767241fe2205f3bceb3db07 (diff)
automatic import of python-spark-nlpopeneuler20.03
-rw-r--r--.gitignore1
-rw-r--r--python-spark-nlp.spec491
-rw-r--r--sources2
3 files changed, 267 insertions, 227 deletions
diff --git a/.gitignore b/.gitignore
index 1419612..30626fb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,2 @@
/spark-nlp-4.3.2.tar.gz
+/spark-nlp-4.4.0.tar.gz
diff --git a/python-spark-nlp.spec b/python-spark-nlp.spec
index 42098ea..8fc9255 100644
--- a/python-spark-nlp.spec
+++ b/python-spark-nlp.spec
@@ -1,11 +1,11 @@
%global _empty_manifest_terminate_build 0
Name: python-spark-nlp
-Version: 4.3.2
+Version: 4.4.0
Release: 1
Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
License: Apache Software License
URL: https://github.com/JohnSnowLabs/spark-nlp
-Source0: https://mirrors.nju.edu.cn/pypi/web/packages/40/a7/6d450edede7a7f54b3a5cd78fe3d521bad33ada0f69de0b542c1ab13f3bd/spark-nlp-4.3.2.tar.gz
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/2e/aa/19e34297e3cc22a0f361c22cc366418158612bca8f5b9e3959c1f066f747/spark-nlp-4.4.0.tar.gz
BuildArch: noarch
@@ -29,18 +29,16 @@ BuildArch: noarch
<img src="https://static.pepy.tech/personalized-badge/spark-nlp?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads" /></a>
</p>
-Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple
-**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed
+Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed
environment.
-Spark NLP comes with **11000+** pretrained **pipelines** and **models** in more than **200+** languages.
-It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Automatic Speech Recognition
-**, and many more [NLP tasks](#features).
+Spark NLP comes with **17000+** pretrained **pipelines** and **models** in more than **200+** languages.
+It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features).
-**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Google T5**, **MarianMT**, **GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
+**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Google T5**, **MarianMT**, **OpenAI GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
## Project's website
-Take a look at our official Spark NLP page: [http://nlp.johnsnowlabs.com/](http://nlp.johnsnowlabs.com/) for user
+Take a look at our official Spark NLP page: [https://sparknlp.org/](https://sparknlp.org/) for user
documentation and examples
## Community support
@@ -151,19 +149,22 @@ documentation and examples
- Longformer for Question Answering
- Table Question Answering (TAPAS)
- Zero-Shot NER Model
+- Zero Shot Text Classification by BERT (ZSL)
- Neural Machine Translation (MarianMT)
- Text-To-Text Transfer Transformer (Google T5)
- Generative Pre-trained Transformer 2 (OpenAI GPT2)
-- Vision Transformer (ViT)
-- Swin Image Classification
+- Seq2Seq for NLG, Translation, and Comprehension (Facebook BART)
+- Vision Transformer (Google ViT)
+- Swin Image Classification (Microsoft Swin Transformer)
+- ConvNext Image Classification (Facebook ConvNext)
- Automatic Speech Recognition (Wav2Vec2)
- Automatic Speech Recognition (HuBERT)
- Named entity recognition (Deep learning)
- Easy TensorFlow integration
- GPU Support
- Full integration with Spark ML functions
-- +9400 pre-trained models in +200 languages!
-- +3200 pre-trained pipelines in +200 languages!
+- +12000 pre-trained models in +200 languages!
+- +5000 pre-trained pipelines in +200 languages!
- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian,
Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more.
@@ -176,7 +177,7 @@ To use Spark NLP you need the following requirements:
**GPU (optional):**
-Spark NLP 4.3.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
+Spark NLP 4.4.0 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
@@ -192,7 +193,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.3.2 pyspark==3.3.1
+$ pip install spark-nlp==4.4.0 pyspark==3.3.1
```
In Python console or Jupyter `Python3` kernel:
@@ -237,11 +238,12 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh
## Apache Spark Support
-Spark NLP *4.3.2* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and
+Spark NLP *4.4.0* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and
3.3.x:
| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
+| 4.4.x | NO | NO | YES | YES | YES | YES |
| 4.3.x | NO | NO | YES | YES | YES | YES |
| 4.2.x | NO | NO | YES | YES | YES | YES |
| 4.1.x | NO | NO | YES | YES | YES | YES |
@@ -260,22 +262,23 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
## Scala and Python Support
-| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Scala 2.11 | Scala 2.12 |
-|-----------|------------|------------|------------|------------|------------|------------|
-| 4.3.x | YES | YES | YES | YES | NO | YES |
-| 4.2.x | YES | YES | YES | YES | NO | YES |
-| 4.1.x | YES | YES | YES | YES | NO | YES |
-| 4.0.x | YES | YES | YES | YES | NO | YES |
-| 3.4.x | YES | YES | YES | YES | YES | YES |
-| 3.3.x | YES | YES | YES | NO | YES | YES |
-| 3.2.x | YES | YES | YES | NO | YES | YES |
-| 3.1.x | YES | YES | YES | NO | YES | YES |
-| 3.0.x | YES | YES | YES | NO | YES | YES |
-| 2.7.x | YES | YES | NO | NO | YES | NO |
+| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 |
+|-----------|------------|------------|------------|------------|------------|------------|------------|
+| 4.4.x | NO | YES | YES | YES | YES | NO | YES |
+| 4.3.x | YES | YES | YES | YES | YES | NO | YES |
+| 4.2.x | YES | YES | YES | YES | YES | NO | YES |
+| 4.1.x | YES | YES | YES | YES | NO | NO | YES |
+| 4.0.x | YES | YES | YES | YES | NO | NO | YES |
+| 3.4.x | YES | YES | YES | YES | NO | YES | YES |
+| 3.3.x | YES | YES | YES | NO | NO | YES | YES |
+| 3.2.x | YES | YES | YES | NO | NO | YES | YES |
+| 3.1.x | YES | YES | YES | NO | NO | YES | YES |
+| 3.0.x | YES | YES | YES | NO | NO | YES | YES |
+| 2.7.x | YES | YES | NO | NO | NO | YES | NO |
## Databricks Support
-Spark NLP 4.3.2 has been tested and is compatible with the following runtimes:
+Spark NLP 4.4.0 has been tested and is compatible with the following runtimes:
**CPU:**
@@ -301,6 +304,12 @@ Spark NLP 4.3.2 has been tested and is compatible with the following runtimes:
- 11.2 ML
- 11.3
- 11.3 ML
+- 12.0
+- 12.0 ML
+- 12.1
+- 12.1 ML
+- 12.2
+- 12.2 ML
**GPU:**
@@ -314,13 +323,16 @@ Spark NLP 4.3.2 has been tested and is compatible with the following runtimes:
- 11.1 ML & GPU
- 11.2 ML & GPU
- 11.3 ML & GPU
+- 12.0 ML & GPU
+- 12.1 ML & GPU
+- 12.2 ML & GPU
NOTE: Spark NLP 4.x is based on TensorFlow 2.7.x which is compatible with CUDA11 and cuDNN 8.0.2. The only Databricks
runtimes supporting CUDA 11 are 9.x and above as listed under GPU.
## EMR Support
-Spark NLP 4.3.2 has been tested and is compatible with the following EMR releases:
+Spark NLP 4.4.0 has been tested and is compatible with the following EMR releases:
- emr-6.2.0
- emr-6.3.0
@@ -329,6 +341,9 @@ Spark NLP 4.3.2 has been tested and is compatible with the following EMR release
- emr-6.5.0
- emr-6.6.0
- emr-6.7.0
+- emr-6.8.0
+- emr-6.9.0
+- emr-6.10.0
Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html)
@@ -361,11 +376,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
```sh
# CPU
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
The `spark-nlp` has been published to
@@ -374,11 +389,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# GPU
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0
```
@@ -388,11 +403,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# AArch64
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0
```
@@ -402,11 +417,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# M1/M2 (Apple Silicon)
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0
```
@@ -420,7 +435,7 @@ set in your SparkSession:
spark-shell \
--driver-memory 16g \
--conf spark.kryoserializer.buffer.max=2000M \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
## Scala
@@ -438,7 +453,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -449,7 +464,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -460,7 +475,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -471,7 +486,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -481,28 +496,28 @@ coordinates:
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.4.0"
```
**spark-nlp-gpu:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.4.0"
```
**spark-nlp-aarch64:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.4.0"
```
**spark-nlp-silicon:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "4.4.0"
```
Maven
@@ -524,7 +539,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
Pip:
```bash
-pip install spark-nlp==4.3.2
+pip install spark-nlp==4.4.0
```
Conda:
@@ -553,7 +568,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2")
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0")
.getOrCreate()
```
@@ -624,7 +639,7 @@ Use either one of the following options
- Add the following Maven Coordinates to the interpreter's library list
```bash
-com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
@@ -635,7 +650,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
Apart from the previous step, install the python module through pip
```bash
-pip install spark-nlp==4.3.2
+pip install spark-nlp==4.4.0
```
Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -663,7 +678,7 @@ launch the Jupyter from the same Python environment:
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.3.2 pyspark==3.3.1 jupyter
+$ pip install spark-nlp==4.4.0 pyspark==3.3.1 jupyter
$ jupyter notebook
```
@@ -680,7 +695,7 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -707,7 +722,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
# by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.3.2
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.4.0
```
[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
@@ -730,7 +745,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
# by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.3.2
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.4.0
```
[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
@@ -749,9 +764,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP
3. In `Libraries` tab inside your cluster you need to follow these steps:
- 3.1. Install New -> PyPI -> `spark-nlp==4.3.2` -> Install
+ 3.1. Install New -> PyPI -> `spark-nlp==4.4.0` -> Install
- 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2` -> Install
+ 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0` -> Install
4. Now you can attach your notebook to the cluster and use Spark NLP!
@@ -802,7 +817,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
- "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2"
+ "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0"
}
}]
```
@@ -811,7 +826,7 @@ A sample of AWS CLI to launch EMR cluster:
```.sh
aws emr create-cluster \
---name "Spark NLP 4.3.2" \
+--name "Spark NLP 4.4.0" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
@@ -875,7 +890,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
- --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
@@ -914,7 +929,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2")
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0")
.getOrCreate()
```
@@ -928,7 +943,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
**pyspark:**
@@ -941,7 +956,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
**Databricks:**
@@ -1100,7 +1115,7 @@ ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0")
*/
```
-#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://nlp.johnsnowlabs.com/models) with examples, demos, benchmarks, and more
+#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more
### Models
@@ -1190,7 +1205,7 @@ XlnetEmbeddings
*/
```
-#### Please check out our Models Hub for the full list of [pre-trained models](https://nlp.johnsnowlabs.com/models) with examples, demo, benchmark, and more
+#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more
## Offline
@@ -1201,7 +1216,7 @@ any limitations offline:
- Instead of using the Maven package, you need to load our Fat JAR
- Instead of using PretrainedPipeline for pretrained pipelines or the `.pretrained()` function to download pretrained
- models, you will need to manually download your pipeline/model from [Models Hub](https://nlp.johnsnowlabs.com/models),
+ models, you will need to manually download your pipeline/model from [Models Hub](https://sparknlp.org/models),
extract it, and load it.
Example of `SparkSession` with Fat JAR to have Spark NLP offline:
@@ -1213,7 +1228,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
- .config("spark.jars", "/tmp/spark-nlp-assembly-4.3.2.jar")
+ .config("spark.jars", "/tmp/spark-nlp-assembly-4.4.0.jar")
.getOrCreate()
```
@@ -1222,7 +1237,7 @@ spark = SparkSession.builder
version (3.0.x, 3.1.x, 3.2.x, and 3.3.x)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
- i.e., `hdfs:///tmp/spark-nlp-assembly-4.3.2.jar`)
+ i.e., `hdfs:///tmp/spark-nlp-assembly-4.4.0.jar`)
Example of using pretrained Models and Pipelines in offline:
@@ -1252,13 +1267,13 @@ PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/")
Need more **examples**? Check out our dedicated [Spark NLP Examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
repository to showcase all Spark NLP use cases!
-Also, don't forget to check [Spark NLP in Action](https://nlp.johnsnowlabs.com/demo) built by Streamlit.
+Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demo) built by Streamlit.
### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
## FAQ
-[Check our Articles and Videos page here](https://nlp.johnsnowlabs.com/learn)
+[Check our Articles and Videos page here](https://sparknlp.org/learn)
## Citation
@@ -1299,8 +1314,6 @@ Clone the repo and submit your pull-requests! Or directly create issues in this
[http://johnsnowlabs.com](http://johnsnowlabs.com)
-
-
%package -n python3-spark-nlp
Summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
Provides: python-spark-nlp
@@ -1327,18 +1340,16 @@ BuildRequires: python3-pip
<img src="https://static.pepy.tech/personalized-badge/spark-nlp?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads" /></a>
</p>
-Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple
-**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed
+Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed
environment.
-Spark NLP comes with **11000+** pretrained **pipelines** and **models** in more than **200+** languages.
-It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Automatic Speech Recognition
-**, and many more [NLP tasks](#features).
+Spark NLP comes with **17000+** pretrained **pipelines** and **models** in more than **200+** languages.
+It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features).
-**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Google T5**, **MarianMT**, **GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
+**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Google T5**, **MarianMT**, **OpenAI GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
## Project's website
-Take a look at our official Spark NLP page: [http://nlp.johnsnowlabs.com/](http://nlp.johnsnowlabs.com/) for user
+Take a look at our official Spark NLP page: [https://sparknlp.org/](https://sparknlp.org/) for user
documentation and examples
## Community support
@@ -1449,19 +1460,22 @@ documentation and examples
- Longformer for Question Answering
- Table Question Answering (TAPAS)
- Zero-Shot NER Model
+- Zero Shot Text Classification by BERT (ZSL)
- Neural Machine Translation (MarianMT)
- Text-To-Text Transfer Transformer (Google T5)
- Generative Pre-trained Transformer 2 (OpenAI GPT2)
-- Vision Transformer (ViT)
-- Swin Image Classification
+- Seq2Seq for NLG, Translation, and Comprehension (Facebook BART)
+- Vision Transformer (Google ViT)
+- Swin Image Classification (Microsoft Swin Transformer)
+- ConvNext Image Classification (Facebook ConvNext)
- Automatic Speech Recognition (Wav2Vec2)
- Automatic Speech Recognition (HuBERT)
- Named entity recognition (Deep learning)
- Easy TensorFlow integration
- GPU Support
- Full integration with Spark ML functions
-- +9400 pre-trained models in +200 languages!
-- +3200 pre-trained pipelines in +200 languages!
+- +12000 pre-trained models in +200 languages!
+- +5000 pre-trained pipelines in +200 languages!
- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian,
Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more.
@@ -1474,7 +1488,7 @@ To use Spark NLP you need the following requirements:
**GPU (optional):**
-Spark NLP 4.3.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
+Spark NLP 4.4.0 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
@@ -1490,7 +1504,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.3.2 pyspark==3.3.1
+$ pip install spark-nlp==4.4.0 pyspark==3.3.1
```
In Python console or Jupyter `Python3` kernel:
@@ -1535,11 +1549,12 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh
## Apache Spark Support
-Spark NLP *4.3.2* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and
+Spark NLP *4.4.0* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and
3.3.x:
| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
+| 4.4.x | NO | NO | YES | YES | YES | YES |
| 4.3.x | NO | NO | YES | YES | YES | YES |
| 4.2.x | NO | NO | YES | YES | YES | YES |
| 4.1.x | NO | NO | YES | YES | YES | YES |
@@ -1558,22 +1573,23 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
## Scala and Python Support
-| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Scala 2.11 | Scala 2.12 |
-|-----------|------------|------------|------------|------------|------------|------------|
-| 4.3.x | YES | YES | YES | YES | NO | YES |
-| 4.2.x | YES | YES | YES | YES | NO | YES |
-| 4.1.x | YES | YES | YES | YES | NO | YES |
-| 4.0.x | YES | YES | YES | YES | NO | YES |
-| 3.4.x | YES | YES | YES | YES | YES | YES |
-| 3.3.x | YES | YES | YES | NO | YES | YES |
-| 3.2.x | YES | YES | YES | NO | YES | YES |
-| 3.1.x | YES | YES | YES | NO | YES | YES |
-| 3.0.x | YES | YES | YES | NO | YES | YES |
-| 2.7.x | YES | YES | NO | NO | YES | NO |
+| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 |
+|-----------|------------|------------|------------|------------|------------|------------|------------|
+| 4.4.x | NO | YES | YES | YES | YES | NO | YES |
+| 4.3.x | YES | YES | YES | YES | YES | NO | YES |
+| 4.2.x | YES | YES | YES | YES | YES | NO | YES |
+| 4.1.x | YES | YES | YES | YES | NO | NO | YES |
+| 4.0.x | YES | YES | YES | YES | NO | NO | YES |
+| 3.4.x | YES | YES | YES | YES | NO | YES | YES |
+| 3.3.x | YES | YES | YES | NO | NO | YES | YES |
+| 3.2.x | YES | YES | YES | NO | NO | YES | YES |
+| 3.1.x | YES | YES | YES | NO | NO | YES | YES |
+| 3.0.x | YES | YES | YES | NO | NO | YES | YES |
+| 2.7.x | YES | YES | NO | NO | NO | YES | NO |
## Databricks Support
-Spark NLP 4.3.2 has been tested and is compatible with the following runtimes:
+Spark NLP 4.4.0 has been tested and is compatible with the following runtimes:
**CPU:**
@@ -1599,6 +1615,12 @@ Spark NLP 4.3.2 has been tested and is compatible with the following runtimes:
- 11.2 ML
- 11.3
- 11.3 ML
+- 12.0
+- 12.0 ML
+- 12.1
+- 12.1 ML
+- 12.2
+- 12.2 ML
**GPU:**
@@ -1612,13 +1634,16 @@ Spark NLP 4.3.2 has been tested and is compatible with the following runtimes:
- 11.1 ML & GPU
- 11.2 ML & GPU
- 11.3 ML & GPU
+- 12.0 ML & GPU
+- 12.1 ML & GPU
+- 12.2 ML & GPU
NOTE: Spark NLP 4.x is based on TensorFlow 2.7.x which is compatible with CUDA11 and cuDNN 8.0.2. The only Databricks
runtimes supporting CUDA 11 are 9.x and above as listed under GPU.
## EMR Support
-Spark NLP 4.3.2 has been tested and is compatible with the following EMR releases:
+Spark NLP 4.4.0 has been tested and is compatible with the following EMR releases:
- emr-6.2.0
- emr-6.3.0
@@ -1627,6 +1652,9 @@ Spark NLP 4.3.2 has been tested and is compatible with the following EMR release
- emr-6.5.0
- emr-6.6.0
- emr-6.7.0
+- emr-6.8.0
+- emr-6.9.0
+- emr-6.10.0
Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html)
@@ -1659,11 +1687,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
```sh
# CPU
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
The `spark-nlp` has been published to
@@ -1672,11 +1700,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# GPU
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0
```
@@ -1686,11 +1714,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# AArch64
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0
```
@@ -1700,11 +1728,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# M1/M2 (Apple Silicon)
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0
```
@@ -1718,7 +1746,7 @@ set in your SparkSession:
spark-shell \
--driver-memory 16g \
--conf spark.kryoserializer.buffer.max=2000M \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
## Scala
@@ -1736,7 +1764,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -1747,7 +1775,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -1758,7 +1786,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -1769,7 +1797,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -1779,28 +1807,28 @@ coordinates:
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.4.0"
```
**spark-nlp-gpu:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.4.0"
```
**spark-nlp-aarch64:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.4.0"
```
**spark-nlp-silicon:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "4.4.0"
```
Maven
@@ -1822,7 +1850,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
Pip:
```bash
-pip install spark-nlp==4.3.2
+pip install spark-nlp==4.4.0
```
Conda:
@@ -1851,7 +1879,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2")
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0")
.getOrCreate()
```
@@ -1922,7 +1950,7 @@ Use either one of the following options
- Add the following Maven Coordinates to the interpreter's library list
```bash
-com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
@@ -1933,7 +1961,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
Apart from the previous step, install the python module through pip
```bash
-pip install spark-nlp==4.3.2
+pip install spark-nlp==4.4.0
```
Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -1961,7 +1989,7 @@ launch the Jupyter from the same Python environment:
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.3.2 pyspark==3.3.1 jupyter
+$ pip install spark-nlp==4.4.0 pyspark==3.3.1 jupyter
$ jupyter notebook
```
@@ -1978,7 +2006,7 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -2005,7 +2033,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
# by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.3.2
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.4.0
```
[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
@@ -2028,7 +2056,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
# by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.3.2
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.4.0
```
[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
@@ -2047,9 +2075,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP
3. In `Libraries` tab inside your cluster you need to follow these steps:
- 3.1. Install New -> PyPI -> `spark-nlp==4.3.2` -> Install
+ 3.1. Install New -> PyPI -> `spark-nlp==4.4.0` -> Install
- 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2` -> Install
+ 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0` -> Install
4. Now you can attach your notebook to the cluster and use Spark NLP!
@@ -2100,7 +2128,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
- "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2"
+ "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0"
}
}]
```
@@ -2109,7 +2137,7 @@ A sample of AWS CLI to launch EMR cluster:
```.sh
aws emr create-cluster \
---name "Spark NLP 4.3.2" \
+--name "Spark NLP 4.4.0" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
@@ -2173,7 +2201,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
- --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
@@ -2212,7 +2240,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2")
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0")
.getOrCreate()
```
@@ -2226,7 +2254,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
**pyspark:**
@@ -2239,7 +2267,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
**Databricks:**
@@ -2398,7 +2426,7 @@ ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0")
*/
```
-#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://nlp.johnsnowlabs.com/models) with examples, demos, benchmarks, and more
+#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more
### Models
@@ -2488,7 +2516,7 @@ XlnetEmbeddings
*/
```
-#### Please check out our Models Hub for the full list of [pre-trained models](https://nlp.johnsnowlabs.com/models) with examples, demo, benchmark, and more
+#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more
## Offline
@@ -2499,7 +2527,7 @@ any limitations offline:
- Instead of using the Maven package, you need to load our Fat JAR
- Instead of using PretrainedPipeline for pretrained pipelines or the `.pretrained()` function to download pretrained
- models, you will need to manually download your pipeline/model from [Models Hub](https://nlp.johnsnowlabs.com/models),
+ models, you will need to manually download your pipeline/model from [Models Hub](https://sparknlp.org/models),
extract it, and load it.
Example of `SparkSession` with Fat JAR to have Spark NLP offline:
@@ -2511,7 +2539,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
- .config("spark.jars", "/tmp/spark-nlp-assembly-4.3.2.jar")
+ .config("spark.jars", "/tmp/spark-nlp-assembly-4.4.0.jar")
.getOrCreate()
```
@@ -2520,7 +2548,7 @@ spark = SparkSession.builder
version (3.0.x, 3.1.x, 3.2.x, and 3.3.x)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
- i.e., `hdfs:///tmp/spark-nlp-assembly-4.3.2.jar`)
+ i.e., `hdfs:///tmp/spark-nlp-assembly-4.4.0.jar`)
Example of using pretrained Models and Pipelines in offline:
@@ -2550,13 +2578,13 @@ PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/")
Need more **examples**? Check out our dedicated [Spark NLP Examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
repository to showcase all Spark NLP use cases!
-Also, don't forget to check [Spark NLP in Action](https://nlp.johnsnowlabs.com/demo) built by Streamlit.
+Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demo) built by Streamlit.
### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
## FAQ
-[Check our Articles and Videos page here](https://nlp.johnsnowlabs.com/learn)
+[Check our Articles and Videos page here](https://sparknlp.org/learn)
## Citation
@@ -2597,8 +2625,6 @@ Clone the repo and submit your pull-requests! Or directly create issues in this
[http://johnsnowlabs.com](http://johnsnowlabs.com)
-
-
%package help
Summary: Development documents and examples for spark-nlp
Provides: python3-spark-nlp-doc
@@ -2622,18 +2648,16 @@ Provides: python3-spark-nlp-doc
<img src="https://static.pepy.tech/personalized-badge/spark-nlp?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads" /></a>
</p>
-Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple
-**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed
+Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed
environment.
-Spark NLP comes with **11000+** pretrained **pipelines** and **models** in more than **200+** languages.
-It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Automatic Speech Recognition
-**, and many more [NLP tasks](#features).
+Spark NLP comes with **17000+** pretrained **pipelines** and **models** in more than **200+** languages.
+It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features).
-**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Google T5**, **MarianMT**, **GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
+**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Google T5**, **MarianMT**, **OpenAI GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
## Project's website
-Take a look at our official Spark NLP page: [http://nlp.johnsnowlabs.com/](http://nlp.johnsnowlabs.com/) for user
+Take a look at our official Spark NLP page: [https://sparknlp.org/](https://sparknlp.org/) for user
documentation and examples
## Community support
@@ -2744,19 +2768,22 @@ documentation and examples
- Longformer for Question Answering
- Table Question Answering (TAPAS)
- Zero-Shot NER Model
+- Zero Shot Text Classification by BERT (ZSL)
- Neural Machine Translation (MarianMT)
- Text-To-Text Transfer Transformer (Google T5)
- Generative Pre-trained Transformer 2 (OpenAI GPT2)
-- Vision Transformer (ViT)
-- Swin Image Classification
+- Seq2Seq for NLG, Translation, and Comprehension (Facebook BART)
+- Vision Transformer (Google ViT)
+- Swin Image Classification (Microsoft Swin Transformer)
+- ConvNext Image Classification (Facebook ConvNext)
- Automatic Speech Recognition (Wav2Vec2)
- Automatic Speech Recognition (HuBERT)
- Named entity recognition (Deep learning)
- Easy TensorFlow integration
- GPU Support
- Full integration with Spark ML functions
-- +9400 pre-trained models in +200 languages!
-- +3200 pre-trained pipelines in +200 languages!
+- +12000 pre-trained models in +200 languages!
+- +5000 pre-trained pipelines in +200 languages!
- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian,
Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more.
@@ -2769,7 +2796,7 @@ To use Spark NLP you need the following requirements:
**GPU (optional):**
-Spark NLP 4.3.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
+Spark NLP 4.4.0 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
@@ -2785,7 +2812,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.3.2 pyspark==3.3.1
+$ pip install spark-nlp==4.4.0 pyspark==3.3.1
```
In Python console or Jupyter `Python3` kernel:
@@ -2830,11 +2857,12 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh
## Apache Spark Support
-Spark NLP *4.3.2* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and
+Spark NLP *4.4.0* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and
3.3.x:
| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
+| 4.4.x | NO | NO | YES | YES | YES | YES |
| 4.3.x | NO | NO | YES | YES | YES | YES |
| 4.2.x | NO | NO | YES | YES | YES | YES |
| 4.1.x | NO | NO | YES | YES | YES | YES |
@@ -2853,22 +2881,23 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
## Scala and Python Support
-| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Scala 2.11 | Scala 2.12 |
-|-----------|------------|------------|------------|------------|------------|------------|
-| 4.3.x | YES | YES | YES | YES | NO | YES |
-| 4.2.x | YES | YES | YES | YES | NO | YES |
-| 4.1.x | YES | YES | YES | YES | NO | YES |
-| 4.0.x | YES | YES | YES | YES | NO | YES |
-| 3.4.x | YES | YES | YES | YES | YES | YES |
-| 3.3.x | YES | YES | YES | NO | YES | YES |
-| 3.2.x | YES | YES | YES | NO | YES | YES |
-| 3.1.x | YES | YES | YES | NO | YES | YES |
-| 3.0.x | YES | YES | YES | NO | YES | YES |
-| 2.7.x | YES | YES | NO | NO | YES | NO |
+| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 |
+|-----------|------------|------------|------------|------------|------------|------------|------------|
+| 4.4.x | NO | YES | YES | YES | YES | NO | YES |
+| 4.3.x | YES | YES | YES | YES | YES | NO | YES |
+| 4.2.x | YES | YES | YES | YES | YES | NO | YES |
+| 4.1.x | YES | YES | YES | YES | NO | NO | YES |
+| 4.0.x | YES | YES | YES | YES | NO | NO | YES |
+| 3.4.x | YES | YES | YES | YES | NO | YES | YES |
+| 3.3.x | YES | YES | YES | NO | NO | YES | YES |
+| 3.2.x | YES | YES | YES | NO | NO | YES | YES |
+| 3.1.x | YES | YES | YES | NO | NO | YES | YES |
+| 3.0.x | YES | YES | YES | NO | NO | YES | YES |
+| 2.7.x | YES | YES | NO | NO | NO | YES | NO |
## Databricks Support
-Spark NLP 4.3.2 has been tested and is compatible with the following runtimes:
+Spark NLP 4.4.0 has been tested and is compatible with the following runtimes:
**CPU:**
@@ -2894,6 +2923,12 @@ Spark NLP 4.3.2 has been tested and is compatible with the following runtimes:
- 11.2 ML
- 11.3
- 11.3 ML
+- 12.0
+- 12.0 ML
+- 12.1
+- 12.1 ML
+- 12.2
+- 12.2 ML
**GPU:**
@@ -2907,13 +2942,16 @@ Spark NLP 4.3.2 has been tested and is compatible with the following runtimes:
- 11.1 ML & GPU
- 11.2 ML & GPU
- 11.3 ML & GPU
+- 12.0 ML & GPU
+- 12.1 ML & GPU
+- 12.2 ML & GPU
NOTE: Spark NLP 4.x is based on TensorFlow 2.7.x which is compatible with CUDA11 and cuDNN 8.0.2. The only Databricks
runtimes supporting CUDA 11 are 9.x and above as listed under GPU.
## EMR Support
-Spark NLP 4.3.2 has been tested and is compatible with the following EMR releases:
+Spark NLP 4.4.0 has been tested and is compatible with the following EMR releases:
- emr-6.2.0
- emr-6.3.0
@@ -2922,6 +2960,9 @@ Spark NLP 4.3.2 has been tested and is compatible with the following EMR release
- emr-6.5.0
- emr-6.6.0
- emr-6.7.0
+- emr-6.8.0
+- emr-6.9.0
+- emr-6.10.0
Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html)
@@ -2954,11 +2995,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
```sh
# CPU
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
The `spark-nlp` has been published to
@@ -2967,11 +3008,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# GPU
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0
```
@@ -2981,11 +3022,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# AArch64
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0
```
@@ -2995,11 +3036,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# M1/M2 (Apple Silicon)
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0
```
@@ -3013,7 +3054,7 @@ set in your SparkSession:
spark-shell \
--driver-memory 16g \
--conf spark.kryoserializer.buffer.max=2000M \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
## Scala
@@ -3031,7 +3072,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -3042,7 +3083,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -3053,7 +3094,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -3064,7 +3105,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
- <version>4.3.2</version>
+ <version>4.4.0</version>
</dependency>
```
@@ -3074,28 +3115,28 @@ coordinates:
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.4.0"
```
**spark-nlp-gpu:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.4.0"
```
**spark-nlp-aarch64:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.4.0"
```
**spark-nlp-silicon:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "4.3.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "4.4.0"
```
Maven
@@ -3117,7 +3158,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
Pip:
```bash
-pip install spark-nlp==4.3.2
+pip install spark-nlp==4.4.0
```
Conda:
@@ -3146,7 +3187,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2")
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0")
.getOrCreate()
```
@@ -3217,7 +3258,7 @@ Use either one of the following options
- Add the following Maven Coordinates to the interpreter's library list
```bash
-com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
@@ -3228,7 +3269,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
Apart from the previous step, install the python module through pip
```bash
-pip install spark-nlp==4.3.2
+pip install spark-nlp==4.4.0
```
Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -3256,7 +3297,7 @@ launch the Jupyter from the same Python environment:
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.3.2 pyspark==3.3.1 jupyter
+$ pip install spark-nlp==4.4.0 pyspark==3.3.1 jupyter
$ jupyter notebook
```
@@ -3273,7 +3314,7 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -3300,7 +3341,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
# by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.3.2
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.4.0
```
[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
@@ -3323,7 +3364,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
# by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.3.2
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.4.0
```
[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
@@ -3342,9 +3383,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP
3. In `Libraries` tab inside your cluster you need to follow these steps:
- 3.1. Install New -> PyPI -> `spark-nlp==4.3.2` -> Install
+ 3.1. Install New -> PyPI -> `spark-nlp==4.4.0` -> Install
- 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2` -> Install
+ 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0` -> Install
4. Now you can attach your notebook to the cluster and use Spark NLP!
@@ -3395,7 +3436,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
- "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2"
+ "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0"
}
}]
```
@@ -3404,7 +3445,7 @@ A sample of AWS CLI to launch EMR cluster:
```.sh
aws emr create-cluster \
---name "Spark NLP 4.3.2" \
+--name "Spark NLP 4.4.0" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
@@ -3468,7 +3509,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
- --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
@@ -3507,7 +3548,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2")
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0")
.getOrCreate()
```
@@ -3521,7 +3562,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
**pyspark:**
@@ -3534,7 +3575,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.2
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0
```
**Databricks:**
@@ -3693,7 +3734,7 @@ ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0")
*/
```
-#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://nlp.johnsnowlabs.com/models) with examples, demos, benchmarks, and more
+#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more
### Models
@@ -3783,7 +3824,7 @@ XlnetEmbeddings
*/
```
-#### Please check out our Models Hub for the full list of [pre-trained models](https://nlp.johnsnowlabs.com/models) with examples, demo, benchmark, and more
+#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more
## Offline
@@ -3794,7 +3835,7 @@ any limitations offline:
- Instead of using the Maven package, you need to load our Fat JAR
- Instead of using PretrainedPipeline for pretrained pipelines or the `.pretrained()` function to download pretrained
- models, you will need to manually download your pipeline/model from [Models Hub](https://nlp.johnsnowlabs.com/models),
+ models, you will need to manually download your pipeline/model from [Models Hub](https://sparknlp.org/models),
extract it, and load it.
Example of `SparkSession` with Fat JAR to have Spark NLP offline:
@@ -3806,7 +3847,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
- .config("spark.jars", "/tmp/spark-nlp-assembly-4.3.2.jar")
+ .config("spark.jars", "/tmp/spark-nlp-assembly-4.4.0.jar")
.getOrCreate()
```
@@ -3815,7 +3856,7 @@ spark = SparkSession.builder
version (3.0.x, 3.1.x, 3.2.x, and 3.3.x)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
- i.e., `hdfs:///tmp/spark-nlp-assembly-4.3.2.jar`)
+ i.e., `hdfs:///tmp/spark-nlp-assembly-4.4.0.jar`)
Example of using pretrained Models and Pipelines in offline:
@@ -3845,13 +3886,13 @@ PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/")
Need more **examples**? Check out our dedicated [Spark NLP Examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
repository to showcase all Spark NLP use cases!
-Also, don't forget to check [Spark NLP in Action](https://nlp.johnsnowlabs.com/demo) built by Streamlit.
+Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demo) built by Streamlit.
### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
## FAQ
-[Check our Articles and Videos page here](https://nlp.johnsnowlabs.com/learn)
+[Check our Articles and Videos page here](https://sparknlp.org/learn)
## Citation
@@ -3892,10 +3933,8 @@ Clone the repo and submit your pull-requests! Or directly create issues in this
[http://johnsnowlabs.com](http://johnsnowlabs.com)
-
-
%prep
-%autosetup -n spark-nlp-4.3.2
+%autosetup -n spark-nlp-4.4.0
%build
%py3_build
@@ -3935,5 +3974,5 @@ mv %{buildroot}/doclist.lst .
%{_docdir}/*
%changelog
-* Mon Apr 10 2023 Python_Bot <Python_Bot@openeuler.org> - 4.3.2-1
+* Fri Apr 21 2023 Python_Bot <Python_Bot@openeuler.org> - 4.4.0-1
- Package Spec generated
diff --git a/sources b/sources
index fbd4e84..217a365 100644
--- a/sources
+++ b/sources
@@ -1 +1 @@
-7811d7ab9c36ce2034abf9f7dcd6e738 spark-nlp-4.3.2.tar.gz
+51b6a6600515e11589aa81da06070cf4 spark-nlp-4.4.0.tar.gz