summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorCoprDistGit <infra@openeuler.org>2023-05-10 06:05:29 +0000
committerCoprDistGit <infra@openeuler.org>2023-05-10 06:05:29 +0000
commit27d87af662311d2e85111e1248e1f64611f996e4 (patch)
treed67e1c847ff22ab29e5956dc25c3a2482f1702ff
parentd7f340d9c3306c73000a10996e76ba72f9e4a977 (diff)
automatic import of python-simeon
-rw-r--r--.gitignore1
-rw-r--r--python-simeon.spec1061
-rw-r--r--sources1
3 files changed, 1063 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..29250bc 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/simeon-0.0.24.tar.gz
diff --git a/python-simeon.spec b/python-simeon.spec
new file mode 100644
index 0000000..7d011ad
--- /dev/null
+++ b/python-simeon.spec
@@ -0,0 +1,1061 @@
+%global _empty_manifest_terminate_build 0
+Name: python-simeon
+Version: 0.0.24
+Release: 1
+Summary: A CLI tool to help process research data from edX
+License: MIT LICENSE
+URL: https://github.com/MIT-IR/simeon
+Source0: https://mirrors.nju.edu.cn/pypi/web/packages/41/1c/71f8c37b3b2a2b791e5bb96a6a81475783fb491e96fa976f55a038864c9a/simeon-0.0.24.tar.gz
+BuildArch: noarch
+
+Requires: python3-boto3
+Requires: python3-google-cloud-bigquery
+Requires: python3-google-cloud-storage
+Requires: python3-jinja2
+Requires: python3-dateutil
+Requires: python3-geoip2
+Requires: python3-sphinx
+Requires: python3-tox
+
+%description
+simeon
+~~~~~~
+
+``simeon`` is a CLI tool to help with the processing of edx Research
+data. It can ``list``, ``download``, and ``split`` edX data packages. It
+can also ``push`` the output of the ``split`` subcommand to both GCS and
+BigQuery. It is heavily inspired by the
+`edx2bigquery <https://github.com/mitodl/edx2bigquery>`__ package. If
+you’ve used that tool, you should be able to navigate the quirks that
+may come with this one.
+
+Installing with pip
+~~~~~~~~~~~~~~~~~~~
+
+.. code:: sh
+
+ python3 -m pip install simeon
+ # Or with geoip
+ python3 -m pip install simeon[geoip]
+ # Then invoke the CLI tool with
+ simeon --help
+
+Installing with git clone
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: sh
+
+ git clone git@github.com:MIT-IR/simeon.git
+ cd simeon && python -m pip install .
+ # Or with geoip
+ cd simeon && python -m pip install .[geoip]
+ # Then invoke the CLI tool with
+ simeon --help
+
+Using Docker
+~~~~~~~~~~~~
+
+.. code:: sh
+
+ docker run -it mitir/simeon:latest
+ simeon --help
+
+Developing
+~~~~~~~~~~
+
+.. code:: sh
+
+ git clone git@github.com:MIT-IR/simeon.git
+ cd simeon
+ # Set up a virtual environment if you don't already have on
+ python3 -m venv venv
+ . venv/bin/activate
+ # pip install the package in an editable way
+ python3 -m pip install -e .[test,geoip]
+ # Invoke the executable
+ simeon --help
+ # Run the tests
+ tox
+ # Write code and tests and submit PR's
+
+Setups and configurations
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``simeon`` is a glorified downloader and uploader set of scripts. Much
+of the downloading and uploading that it does makes the assumptions that
+you have your AWS credentials configured properly and that you’ve got a
+service account file for GCP services available on your machine. If the
+latter is missing, you may have to authenticate to GCP services through
+the SDK. However, both we and Google recommend you not do that.
+
+Every downloaded file is decrypted either during the download process or
+while it gets split by the ``simeon split`` command. So, this tool
+assumes that you’ve installed and configured ``gpg`` to be able to
+decrypt files from edX.
+
+The following steps may be useful to someone just getting started with
+the edX data package:
+
+1. Credentials from edX
+
+ - Reach out to edX to get your data czar credentials
+ - Configure both AWS and gpg, so your credentials can access the S3
+ buckets and your ``gpg`` key can decrypt the files there
+
+2. Setup a GCP project
+
+ - Create a GCP project
+ - Setup a BigQuery workspace
+ - Create a GCS bucket
+ - Create a service account and download the associated file
+ - Give the service account Admin Role access to both the BigQuery
+ project and the GCS bucket
+
+If the above steps are carried out successfully, then you should be able
+to use ``simeon`` without any issues.
+
+However, if you’ve taken care of the above steps but are still unable to
+get ``simeon`` to work, please open an issue.
+
+Further, ``simeon`` can parse INI formatted configuration files. It, by
+default, looks for files in the user’s home directory, or in the current
+working directory of the running process. The base names that are
+targeted when config files are looked up are: ``simeon.cfg`` or
+``.simeon.cfg`` or ``simeon.ini`` or ``.simeon.ini``. You can also
+provide ``simeon`` with a config file by using the global option
+``--config-file`` or ``-C`` and giving it a path to the file with the
+corresponding configurations.
+
+The following is a sample file content:
+
+.. code:: sh
+
+ # Default section for things like the organization whose data package is processed
+ # You can also set a default site as one of the following: edx, edge, patches
+ [DEFAULT]
+ site = edx
+ org = yourorganizationx
+ clistings_file = /path/to/file/with/course_ids
+
+ # Section related to Google Cloud (project, bucket, service account)
+ [GCP]
+ project = your-gcp-project-id
+ bucket = your-gcs-bucket
+ service_account_file = /path/to/a/service_account_file.json
+ wait_for_loads = True
+ geo_table = your-gcp-project.geocode_latest.geoip
+ youtube_table = your-gcp-project.videos.youtube
+ youtube_token = your-YouTube-API-token
+
+ # Section related to the AWS credentials needed to download data from S3
+ [AWS]
+ aws_cred_file = ~/.aws/credentials
+ profile_name = default
+
+The options in the config file(s) should match the optional arguments of
+the CLI tool. For instance, the ``--service-account-file``,
+``--project`` and ``--bucket`` options can be provided under the ``GCP``
+section of the config file as ``service_account_file``, ``project`` and
+``bucket``, respectively. Similarly, the ``--site`` and ``--org``
+options can be provided under the ``DEFAULT`` section as ``site`` and
+``org``, respectively.
+
+List files
+~~~~~~~~~~
+
+``simeon`` can list files on S3 for your organization based on criteria
+like file type (``sql`` or ``log`` or ``email``), time intervals (begin
+and end dates), and site (``edx`` or ``edge`` or ``patches``).
+
+- Example: List the latest data packages for file types ``sql``,
+ ``email``, and ``log``
+
+ .. code:: sh
+
+ # List the latest SQL bundle
+ simeon list -s edx -o mitx -f sql -L
+ # List the laetst email data dump
+ simeon list -s edx -o mitx -f email -L
+ # List the latest tracking log file
+ simeon list -s edx -o mitx -f log -L
+
+Download and split files
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+``simeon`` can download, decrypt and split up files into folders
+belonging to specific courses.
+
+- Example 1: Download, split and push SQL bundles to both GCS and
+ BigQuery
+
+ .. code:: sh
+
+ # Download the latest SQL bundle
+ simeon download -s edx -o mitx -f sql -L -d data/
+
+ # Download SQL bundles dumped any time since 2021-01-01 and
+ # extract the contents for course ID MITx/12.3x/1T2021.
+ # Place the downloaded files in data/ and the output of the split operation
+ # in data/SQL
+ simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f sql \
+ -b 2021-01-01 -d data -S -D data/SQL/
+
+ # Push to GCS the split up SQL files inside data/SQL/MITx__12_3x__1T2021
+ simeon push gcs -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/SQL/MITx__12_3x__1T2021
+
+ # Push the files to BigQuery and wait for the jobs to finish
+ # Using -s or --use-storage tells BigQuery to extract the files
+ # to be loaded from Google Cloud Storage.
+ # So, use the option when you've already called simeon push gcs
+ simeon push bq -w -s -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/SQL/MITx__12_3x__1T2021
+
+- Example 2: Download, split and push tracking logs to both GCS and
+ BigQuery
+
+ .. code:: sh
+
+ # Download the latest tracking log file
+ simeon download -s edx -o mitx -f log -L -d data/
+
+ # Download tracking logs dumped any time since 2021-01-01
+ # and extract the contents for course ID MITx/12.3x/1T2021
+ # Place the downloaded files in data/ and the output of the split operation
+ # in data/TRACKING_LOGS
+ simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f log \
+ -b 2021-01-01 -d data -S -D data/TRACKING_LOGS/
+
+ # Push to GCS the split up tracking log files inside
+ # data/TRACKING_LOGS/MITx__12_3x__1T2021
+ simeon push gcs -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021
+
+ # Push the files to BigQuery and wait for the jobs to finish
+ # Using -s or --use-storage tells BigQuery to extract the files
+ # to be loaded from Google Cloud Storage.
+ # So, use the option when you've already called simeon push gcs
+ simeon push bq -w -s -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021
+
+- If you already have downloaded SQL bundles or tracking log files, you
+ can use ``simeon split`` them up.
+
+Make secondary/aggregated tables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``simeon`` can generate secondary tables based on already loaded data.
+Call ``simeon report --help`` for the expected positional and optional
+arguments.
+
+- Example: Make ``person_course`` for course ID ``MITx/12.3x/1T2021``
+
+ .. code:: sh
+
+ # Make a person course table for course ID MITx/12.3x/1T2021
+ # Provide the -g option to give a geolocation BigQuery table
+ # to fill the ip-to-location details in the generated person course table
+ COURSE=MITx/12.3x/1T2021
+ simeon report -w -g "${GCP_PROJECT_ID}.geocode.geoip" -t "person_course" \
+ -p ${GCP_PROJECT_ID} -S ${SAFILE} ${COURSE}
+
+Notes:
+~~~~~~
+
+1. Please note that SQL bundles are quite large when split up, so
+ consider using the ``-c`` or ``--courses`` option when invoking
+ ``simeon download -S`` or ``simeon split`` to make sure that you
+ limit the splitting to a set of course IDs. You may also use the
+ ``--clistings-file`` option, which expects a txt file of course IDs;
+ one ID per line. If the aforementioned options are not used,
+ ``simeon`` may end up failing to complete the split operation due to
+ exhausted system resources (storage to be specific).
+
+2. ``simeon download`` with file types ``log`` and ``email`` will both
+ download and decrypt the files matching the given criteria. If the
+ latter operations are successful, then the encrypted files are
+ deleted by default. This is to make sure that you don’t exhaust
+ storage resources. If you wish to keep those files, you can always
+ use the ``--keep-encrypted`` option that comes with
+ ``simeon download`` and ``simeon split``. SQL bundles are only
+ downloaded (not decrypted). Their decryption is done during a
+ ``split`` operation.
+
+3. Unless there is an unhandled exception (which should be reported as a
+ bug), ``simeon`` should, by default, print to the standard output
+ both information and errors encountered while processing your files.
+ You can capture those logs in a file by using the global option
+ ``--log-file`` and providing a destination file for the logs.
+
+4. When using multi argument options like ``--tables`` or ``--courses``,
+ you should try not to place them right before the expected positional
+ arguments. This will help the CLI parser not confuse your positional
+ arguments with table names (in the case of ``--tables``) or course
+ IDs (when ``--courses`` is used).
+
+5. Splitting tracking logs is a resource intensive process. The routine
+ that splits the logs generates a file for each course ID encountered.
+ If you happen to have more course IDs in your logs than the running
+ process can open operating system file descriptors, then ``simeon``
+ will put away records it can’t save to disk for a second pass.
+ Putting away the records involves using more memory than normally
+ required. The second pass will only require one file descriptor at a
+ time, so it should be safe in terms of file descriptor limits. To
+ help ``simeon`` not have to do a second pass, you may increase the
+ file descriptor limits of processes from your shell by running
+ something like ``ulimit -n 2000`` before calling ``simeon split`` on
+ Unix machines. For Windows users, you may have to dig into the
+ Windows Registries for a corresponding setting. This should tell your
+ OS kernel to allow OS processes to open up to 2000 file handles.
+
+6. Care must be taken when using ``simeon split`` and ``simeon push`` to
+ make sure that the number of positional arguments passed does not
+ lead to the invoked command exceeding the maximum command-line length
+ allowed for arguments in a command. To avoid errors along those
+ lines, please consider passing the positional arguments as UNIX glob
+ patterns. For instance,
+ ``simeon split --file-type log 'data/TRACKING-LOGS/*/*.log.gz'``
+ tells ``simeon`` to expand the given glob pattern, instead of relying
+ on the shell to do it.
+
+7. The ``report`` subcommand relies on the presence of SQL query files
+ to parse and send to BigQuery to execute. Any errors arising from
+ executing the parsed queries will be shown to the end user through
+ the given log stream. While the ``simeon`` tool ships with query
+ files for most secondary/reporting tables that are based on the
+ ``edx2bigquery`` tool, an end user should be able to point ``simeon``
+ to a different location with SQL query files by using the
+ ``--query-dir`` option that comes with ``simeon report``.
+ Additionally, these query files can contain
+ ```jinja2 templated`` <https://jinja.palletsprojects.com/en/latest/>`__
+ SQL code. Any mentioned variables within these templated queries can
+ be passed to ``simeon report`` by using the ``--extra-args`` option
+ and passing key-value pair items in the format
+ ``var1=value1,var2=value2,var3=value3,...,varn=valuen``. Further,
+ these key-value pair items can also be typed by using the format
+ ``var1:i=value1,var2:s=value2,var3:f=value3,...,varn:s=valuen``. In
+ this format, the type is append to the key, separated by a colon. The
+ only supported scalar types, so far, are ``s`` for ``str``, ``i`` for
+ ``int``, and ``f`` for ``float``. If any conversion errors occur
+ during value parsing, then those are shown to the end user, and the
+ query won’t get executed. Finally, if you wish to pass an ``array``
+ or ``list`` to the template, you will need to repeat a key multiple
+ times. For instance, if you want to pass a list named ``mylist``
+ containing the integers, you could write something like
+ ``--extra-args mylist:i=1,mylist:i=2,mylist:i=3``. This means that
+ you’ll have a python ``list`` named ``mylist`` within your template,
+ and it should contain ``[1, 2, 3]``.
+
+
+%package -n python3-simeon
+Summary: A CLI tool to help process research data from edX
+Provides: python-simeon
+BuildRequires: python3-devel
+BuildRequires: python3-setuptools
+BuildRequires: python3-pip
+%description -n python3-simeon
+simeon
+~~~~~~
+
+``simeon`` is a CLI tool to help with the processing of edx Research
+data. It can ``list``, ``download``, and ``split`` edX data packages. It
+can also ``push`` the output of the ``split`` subcommand to both GCS and
+BigQuery. It is heavily inspired by the
+`edx2bigquery <https://github.com/mitodl/edx2bigquery>`__ package. If
+you’ve used that tool, you should be able to navigate the quirks that
+may come with this one.
+
+Installing with pip
+~~~~~~~~~~~~~~~~~~~
+
+.. code:: sh
+
+ python3 -m pip install simeon
+ # Or with geoip
+ python3 -m pip install simeon[geoip]
+ # Then invoke the CLI tool with
+ simeon --help
+
+Installing with git clone
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: sh
+
+ git clone git@github.com:MIT-IR/simeon.git
+ cd simeon && python -m pip install .
+ # Or with geoip
+ cd simeon && python -m pip install .[geoip]
+ # Then invoke the CLI tool with
+ simeon --help
+
+Using Docker
+~~~~~~~~~~~~
+
+.. code:: sh
+
+ docker run -it mitir/simeon:latest
+ simeon --help
+
+Developing
+~~~~~~~~~~
+
+.. code:: sh
+
+ git clone git@github.com:MIT-IR/simeon.git
+ cd simeon
+ # Set up a virtual environment if you don't already have on
+ python3 -m venv venv
+ . venv/bin/activate
+ # pip install the package in an editable way
+ python3 -m pip install -e .[test,geoip]
+ # Invoke the executable
+ simeon --help
+ # Run the tests
+ tox
+ # Write code and tests and submit PR's
+
+Setups and configurations
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``simeon`` is a glorified downloader and uploader set of scripts. Much
+of the downloading and uploading that it does makes the assumptions that
+you have your AWS credentials configured properly and that you’ve got a
+service account file for GCP services available on your machine. If the
+latter is missing, you may have to authenticate to GCP services through
+the SDK. However, both we and Google recommend you not do that.
+
+Every downloaded file is decrypted either during the download process or
+while it gets split by the ``simeon split`` command. So, this tool
+assumes that you’ve installed and configured ``gpg`` to be able to
+decrypt files from edX.
+
+The following steps may be useful to someone just getting started with
+the edX data package:
+
+1. Credentials from edX
+
+ - Reach out to edX to get your data czar credentials
+ - Configure both AWS and gpg, so your credentials can access the S3
+ buckets and your ``gpg`` key can decrypt the files there
+
+2. Setup a GCP project
+
+ - Create a GCP project
+ - Setup a BigQuery workspace
+ - Create a GCS bucket
+ - Create a service account and download the associated file
+ - Give the service account Admin Role access to both the BigQuery
+ project and the GCS bucket
+
+If the above steps are carried out successfully, then you should be able
+to use ``simeon`` without any issues.
+
+However, if you’ve taken care of the above steps but are still unable to
+get ``simeon`` to work, please open an issue.
+
+Further, ``simeon`` can parse INI formatted configuration files. It, by
+default, looks for files in the user’s home directory, or in the current
+working directory of the running process. The base names that are
+targeted when config files are looked up are: ``simeon.cfg`` or
+``.simeon.cfg`` or ``simeon.ini`` or ``.simeon.ini``. You can also
+provide ``simeon`` with a config file by using the global option
+``--config-file`` or ``-C`` and giving it a path to the file with the
+corresponding configurations.
+
+The following is a sample file content:
+
+.. code:: sh
+
+ # Default section for things like the organization whose data package is processed
+ # You can also set a default site as one of the following: edx, edge, patches
+ [DEFAULT]
+ site = edx
+ org = yourorganizationx
+ clistings_file = /path/to/file/with/course_ids
+
+ # Section related to Google Cloud (project, bucket, service account)
+ [GCP]
+ project = your-gcp-project-id
+ bucket = your-gcs-bucket
+ service_account_file = /path/to/a/service_account_file.json
+ wait_for_loads = True
+ geo_table = your-gcp-project.geocode_latest.geoip
+ youtube_table = your-gcp-project.videos.youtube
+ youtube_token = your-YouTube-API-token
+
+ # Section related to the AWS credentials needed to download data from S3
+ [AWS]
+ aws_cred_file = ~/.aws/credentials
+ profile_name = default
+
+The options in the config file(s) should match the optional arguments of
+the CLI tool. For instance, the ``--service-account-file``,
+``--project`` and ``--bucket`` options can be provided under the ``GCP``
+section of the config file as ``service_account_file``, ``project`` and
+``bucket``, respectively. Similarly, the ``--site`` and ``--org``
+options can be provided under the ``DEFAULT`` section as ``site`` and
+``org``, respectively.
+
+List files
+~~~~~~~~~~
+
+``simeon`` can list files on S3 for your organization based on criteria
+like file type (``sql`` or ``log`` or ``email``), time intervals (begin
+and end dates), and site (``edx`` or ``edge`` or ``patches``).
+
+- Example: List the latest data packages for file types ``sql``,
+ ``email``, and ``log``
+
+ .. code:: sh
+
+ # List the latest SQL bundle
+ simeon list -s edx -o mitx -f sql -L
+ # List the laetst email data dump
+ simeon list -s edx -o mitx -f email -L
+ # List the latest tracking log file
+ simeon list -s edx -o mitx -f log -L
+
+Download and split files
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+``simeon`` can download, decrypt and split up files into folders
+belonging to specific courses.
+
+- Example 1: Download, split and push SQL bundles to both GCS and
+ BigQuery
+
+ .. code:: sh
+
+ # Download the latest SQL bundle
+ simeon download -s edx -o mitx -f sql -L -d data/
+
+ # Download SQL bundles dumped any time since 2021-01-01 and
+ # extract the contents for course ID MITx/12.3x/1T2021.
+ # Place the downloaded files in data/ and the output of the split operation
+ # in data/SQL
+ simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f sql \
+ -b 2021-01-01 -d data -S -D data/SQL/
+
+ # Push to GCS the split up SQL files inside data/SQL/MITx__12_3x__1T2021
+ simeon push gcs -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/SQL/MITx__12_3x__1T2021
+
+ # Push the files to BigQuery and wait for the jobs to finish
+ # Using -s or --use-storage tells BigQuery to extract the files
+ # to be loaded from Google Cloud Storage.
+ # So, use the option when you've already called simeon push gcs
+ simeon push bq -w -s -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/SQL/MITx__12_3x__1T2021
+
+- Example 2: Download, split and push tracking logs to both GCS and
+ BigQuery
+
+ .. code:: sh
+
+ # Download the latest tracking log file
+ simeon download -s edx -o mitx -f log -L -d data/
+
+ # Download tracking logs dumped any time since 2021-01-01
+ # and extract the contents for course ID MITx/12.3x/1T2021
+ # Place the downloaded files in data/ and the output of the split operation
+ # in data/TRACKING_LOGS
+ simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f log \
+ -b 2021-01-01 -d data -S -D data/TRACKING_LOGS/
+
+ # Push to GCS the split up tracking log files inside
+ # data/TRACKING_LOGS/MITx__12_3x__1T2021
+ simeon push gcs -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021
+
+ # Push the files to BigQuery and wait for the jobs to finish
+ # Using -s or --use-storage tells BigQuery to extract the files
+ # to be loaded from Google Cloud Storage.
+ # So, use the option when you've already called simeon push gcs
+ simeon push bq -w -s -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021
+
+- If you already have downloaded SQL bundles or tracking log files, you
+ can use ``simeon split`` them up.
+
+Make secondary/aggregated tables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``simeon`` can generate secondary tables based on already loaded data.
+Call ``simeon report --help`` for the expected positional and optional
+arguments.
+
+- Example: Make ``person_course`` for course ID ``MITx/12.3x/1T2021``
+
+ .. code:: sh
+
+ # Make a person course table for course ID MITx/12.3x/1T2021
+ # Provide the -g option to give a geolocation BigQuery table
+ # to fill the ip-to-location details in the generated person course table
+ COURSE=MITx/12.3x/1T2021
+ simeon report -w -g "${GCP_PROJECT_ID}.geocode.geoip" -t "person_course" \
+ -p ${GCP_PROJECT_ID} -S ${SAFILE} ${COURSE}
+
+Notes:
+~~~~~~
+
+1. Please note that SQL bundles are quite large when split up, so
+ consider using the ``-c`` or ``--courses`` option when invoking
+ ``simeon download -S`` or ``simeon split`` to make sure that you
+ limit the splitting to a set of course IDs. You may also use the
+ ``--clistings-file`` option, which expects a txt file of course IDs;
+ one ID per line. If the aforementioned options are not used,
+ ``simeon`` may end up failing to complete the split operation due to
+ exhausted system resources (storage to be specific).
+
+2. ``simeon download`` with file types ``log`` and ``email`` will both
+ download and decrypt the files matching the given criteria. If the
+ latter operations are successful, then the encrypted files are
+ deleted by default. This is to make sure that you don’t exhaust
+ storage resources. If you wish to keep those files, you can always
+ use the ``--keep-encrypted`` option that comes with
+ ``simeon download`` and ``simeon split``. SQL bundles are only
+ downloaded (not decrypted). Their decryption is done during a
+ ``split`` operation.
+
+3. Unless there is an unhandled exception (which should be reported as a
+ bug), ``simeon`` should, by default, print to the standard output
+ both information and errors encountered while processing your files.
+ You can capture those logs in a file by using the global option
+ ``--log-file`` and providing a destination file for the logs.
+
+4. When using multi argument options like ``--tables`` or ``--courses``,
+ you should try not to place them right before the expected positional
+ arguments. This will help the CLI parser not confuse your positional
+ arguments with table names (in the case of ``--tables``) or course
+ IDs (when ``--courses`` is used).
+
+5. Splitting tracking logs is a resource intensive process. The routine
+ that splits the logs generates a file for each course ID encountered.
+ If you happen to have more course IDs in your logs than the running
+ process can open operating system file descriptors, then ``simeon``
+ will put away records it can’t save to disk for a second pass.
+ Putting away the records involves using more memory than normally
+ required. The second pass will only require one file descriptor at a
+ time, so it should be safe in terms of file descriptor limits. To
+ help ``simeon`` not have to do a second pass, you may increase the
+ file descriptor limits of processes from your shell by running
+ something like ``ulimit -n 2000`` before calling ``simeon split`` on
+ Unix machines. For Windows users, you may have to dig into the
+ Windows Registries for a corresponding setting. This should tell your
+ OS kernel to allow OS processes to open up to 2000 file handles.
+
+6. Care must be taken when using ``simeon split`` and ``simeon push`` to
+ make sure that the number of positional arguments passed does not
+ lead to the invoked command exceeding the maximum command-line length
+ allowed for arguments in a command. To avoid errors along those
+ lines, please consider passing the positional arguments as UNIX glob
+ patterns. For instance,
+ ``simeon split --file-type log 'data/TRACKING-LOGS/*/*.log.gz'``
+ tells ``simeon`` to expand the given glob pattern, instead of relying
+ on the shell to do it.
+
+7. The ``report`` subcommand relies on the presence of SQL query files
+ to parse and send to BigQuery to execute. Any errors arising from
+ executing the parsed queries will be shown to the end user through
+ the given log stream. While the ``simeon`` tool ships with query
+ files for most secondary/reporting tables that are based on the
+ ``edx2bigquery`` tool, an end user should be able to point ``simeon``
+ to a different location with SQL query files by using the
+ ``--query-dir`` option that comes with ``simeon report``.
+ Additionally, these query files can contain
+ ```jinja2 templated`` <https://jinja.palletsprojects.com/en/latest/>`__
+ SQL code. Any mentioned variables within these templated queries can
+ be passed to ``simeon report`` by using the ``--extra-args`` option
+ and passing key-value pair items in the format
+ ``var1=value1,var2=value2,var3=value3,...,varn=valuen``. Further,
+ these key-value pair items can also be typed by using the format
+ ``var1:i=value1,var2:s=value2,var3:f=value3,...,varn:s=valuen``. In
+ this format, the type is append to the key, separated by a colon. The
+ only supported scalar types, so far, are ``s`` for ``str``, ``i`` for
+ ``int``, and ``f`` for ``float``. If any conversion errors occur
+ during value parsing, then those are shown to the end user, and the
+ query won’t get executed. Finally, if you wish to pass an ``array``
+ or ``list`` to the template, you will need to repeat a key multiple
+ times. For instance, if you want to pass a list named ``mylist``
+ containing the integers, you could write something like
+ ``--extra-args mylist:i=1,mylist:i=2,mylist:i=3``. This means that
+ you’ll have a python ``list`` named ``mylist`` within your template,
+ and it should contain ``[1, 2, 3]``.
+
+
+%package help
+Summary: Development documents and examples for simeon
+Provides: python3-simeon-doc
+%description help
+simeon
+~~~~~~
+
+``simeon`` is a CLI tool to help with the processing of edx Research
+data. It can ``list``, ``download``, and ``split`` edX data packages. It
+can also ``push`` the output of the ``split`` subcommand to both GCS and
+BigQuery. It is heavily inspired by the
+`edx2bigquery <https://github.com/mitodl/edx2bigquery>`__ package. If
+you’ve used that tool, you should be able to navigate the quirks that
+may come with this one.
+
+Installing with pip
+~~~~~~~~~~~~~~~~~~~
+
+.. code:: sh
+
+ python3 -m pip install simeon
+ # Or with geoip
+ python3 -m pip install simeon[geoip]
+ # Then invoke the CLI tool with
+ simeon --help
+
+Installing with git clone
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: sh
+
+ git clone git@github.com:MIT-IR/simeon.git
+ cd simeon && python -m pip install .
+ # Or with geoip
+ cd simeon && python -m pip install .[geoip]
+ # Then invoke the CLI tool with
+ simeon --help
+
+Using Docker
+~~~~~~~~~~~~
+
+.. code:: sh
+
+ docker run -it mitir/simeon:latest
+ simeon --help
+
+Developing
+~~~~~~~~~~
+
+.. code:: sh
+
+ git clone git@github.com:MIT-IR/simeon.git
+ cd simeon
+ # Set up a virtual environment if you don't already have on
+ python3 -m venv venv
+ . venv/bin/activate
+ # pip install the package in an editable way
+ python3 -m pip install -e .[test,geoip]
+ # Invoke the executable
+ simeon --help
+ # Run the tests
+ tox
+ # Write code and tests and submit PR's
+
+Setups and configurations
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``simeon`` is a glorified downloader and uploader set of scripts. Much
+of the downloading and uploading that it does makes the assumptions that
+you have your AWS credentials configured properly and that you’ve got a
+service account file for GCP services available on your machine. If the
+latter is missing, you may have to authenticate to GCP services through
+the SDK. However, both we and Google recommend you not do that.
+
+Every downloaded file is decrypted either during the download process or
+while it gets split by the ``simeon split`` command. So, this tool
+assumes that you’ve installed and configured ``gpg`` to be able to
+decrypt files from edX.
+
+The following steps may be useful to someone just getting started with
+the edX data package:
+
+1. Credentials from edX
+
+ - Reach out to edX to get your data czar credentials
+ - Configure both AWS and gpg, so your credentials can access the S3
+ buckets and your ``gpg`` key can decrypt the files there
+
+2. Setup a GCP project
+
+ - Create a GCP project
+ - Setup a BigQuery workspace
+ - Create a GCS bucket
+ - Create a service account and download the associated file
+ - Give the service account Admin Role access to both the BigQuery
+ project and the GCS bucket
+
+If the above steps are carried out successfully, then you should be able
+to use ``simeon`` without any issues.
+
+However, if you’ve taken care of the above steps but are still unable to
+get ``simeon`` to work, please open an issue.
+
+Further, ``simeon`` can parse INI formatted configuration files. It, by
+default, looks for files in the user’s home directory, or in the current
+working directory of the running process. The base names that are
+targeted when config files are looked up are: ``simeon.cfg`` or
+``.simeon.cfg`` or ``simeon.ini`` or ``.simeon.ini``. You can also
+provide ``simeon`` with a config file by using the global option
+``--config-file`` or ``-C`` and giving it a path to the file with the
+corresponding configurations.
+
+The following is a sample file content:
+
+.. code:: sh
+
+ # Default section for things like the organization whose data package is processed
+ # You can also set a default site as one of the following: edx, edge, patches
+ [DEFAULT]
+ site = edx
+ org = yourorganizationx
+ clistings_file = /path/to/file/with/course_ids
+
+ # Section related to Google Cloud (project, bucket, service account)
+ [GCP]
+ project = your-gcp-project-id
+ bucket = your-gcs-bucket
+ service_account_file = /path/to/a/service_account_file.json
+ wait_for_loads = True
+ geo_table = your-gcp-project.geocode_latest.geoip
+ youtube_table = your-gcp-project.videos.youtube
+ youtube_token = your-YouTube-API-token
+
+ # Section related to the AWS credentials needed to download data from S3
+ [AWS]
+ aws_cred_file = ~/.aws/credentials
+ profile_name = default
+
+The options in the config file(s) should match the optional arguments of
+the CLI tool. For instance, the ``--service-account-file``,
+``--project`` and ``--bucket`` options can be provided under the ``GCP``
+section of the config file as ``service_account_file``, ``project`` and
+``bucket``, respectively. Similarly, the ``--site`` and ``--org``
+options can be provided under the ``DEFAULT`` section as ``site`` and
+``org``, respectively.
+
+List files
+~~~~~~~~~~
+
+``simeon`` can list files on S3 for your organization based on criteria
+like file type (``sql`` or ``log`` or ``email``), time intervals (begin
+and end dates), and site (``edx`` or ``edge`` or ``patches``).
+
+- Example: List the latest data packages for file types ``sql``,
+ ``email``, and ``log``
+
+ .. code:: sh
+
+ # List the latest SQL bundle
+ simeon list -s edx -o mitx -f sql -L
+ # List the laetst email data dump
+ simeon list -s edx -o mitx -f email -L
+ # List the latest tracking log file
+ simeon list -s edx -o mitx -f log -L
+
+Download and split files
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+``simeon`` can download, decrypt and split up files into folders
+belonging to specific courses.
+
+- Example 1: Download, split and push SQL bundles to both GCS and
+ BigQuery
+
+ .. code:: sh
+
+ # Download the latest SQL bundle
+ simeon download -s edx -o mitx -f sql -L -d data/
+
+ # Download SQL bundles dumped any time since 2021-01-01 and
+ # extract the contents for course ID MITx/12.3x/1T2021.
+ # Place the downloaded files in data/ and the output of the split operation
+ # in data/SQL
+ simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f sql \
+ -b 2021-01-01 -d data -S -D data/SQL/
+
+ # Push to GCS the split up SQL files inside data/SQL/MITx__12_3x__1T2021
+ simeon push gcs -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/SQL/MITx__12_3x__1T2021
+
+ # Push the files to BigQuery and wait for the jobs to finish
+ # Using -s or --use-storage tells BigQuery to extract the files
+ # to be loaded from Google Cloud Storage.
+ # So, use the option when you've already called simeon push gcs
+ simeon push bq -w -s -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/SQL/MITx__12_3x__1T2021
+
+- Example 2: Download, split and push tracking logs to both GCS and
+ BigQuery
+
+ .. code:: sh
+
+ # Download the latest tracking log file
+ simeon download -s edx -o mitx -f log -L -d data/
+
+ # Download tracking logs dumped any time since 2021-01-01
+ # and extract the contents for course ID MITx/12.3x/1T2021
+ # Place the downloaded files in data/ and the output of the split operation
+ # in data/TRACKING_LOGS
+ simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f log \
+ -b 2021-01-01 -d data -S -D data/TRACKING_LOGS/
+
+ # Push to GCS the split up tracking log files inside
+ # data/TRACKING_LOGS/MITx__12_3x__1T2021
+ simeon push gcs -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021
+
+ # Push the files to BigQuery and wait for the jobs to finish
+ # Using -s or --use-storage tells BigQuery to extract the files
+ # to be loaded from Google Cloud Storage.
+ # So, use the option when you've already called simeon push gcs
+ simeon push bq -w -s -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \
+ -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021
+
+- If you already have downloaded SQL bundles or tracking log files, you
+ can use ``simeon split`` them up.
+
+Make secondary/aggregated tables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``simeon`` can generate secondary tables based on already loaded data.
+Call ``simeon report --help`` for the expected positional and optional
+arguments.
+
+- Example: Make ``person_course`` for course ID ``MITx/12.3x/1T2021``
+
+ .. code:: sh
+
+ # Make a person course table for course ID MITx/12.3x/1T2021
+ # Provide the -g option to give a geolocation BigQuery table
+ # to fill the ip-to-location details in the generated person course table
+ COURSE=MITx/12.3x/1T2021
+ simeon report -w -g "${GCP_PROJECT_ID}.geocode.geoip" -t "person_course" \
+ -p ${GCP_PROJECT_ID} -S ${SAFILE} ${COURSE}
+
+Notes:
+~~~~~~
+
+1. Please note that SQL bundles are quite large when split up, so
+ consider using the ``-c`` or ``--courses`` option when invoking
+ ``simeon download -S`` or ``simeon split`` to make sure that you
+ limit the splitting to a set of course IDs. You may also use the
+ ``--clistings-file`` option, which expects a txt file of course IDs;
+ one ID per line. If the aforementioned options are not used,
+ ``simeon`` may end up failing to complete the split operation due to
+ exhausted system resources (storage to be specific).
+
+2. ``simeon download`` with file types ``log`` and ``email`` will both
+ download and decrypt the files matching the given criteria. If the
+ latter operations are successful, then the encrypted files are
+ deleted by default. This is to make sure that you don’t exhaust
+ storage resources. If you wish to keep those files, you can always
+ use the ``--keep-encrypted`` option that comes with
+ ``simeon download`` and ``simeon split``. SQL bundles are only
+ downloaded (not decrypted). Their decryption is done during a
+ ``split`` operation.
+
+3. Unless there is an unhandled exception (which should be reported as a
+ bug), ``simeon`` should, by default, print to the standard output
+ both information and errors encountered while processing your files.
+ You can capture those logs in a file by using the global option
+ ``--log-file`` and providing a destination file for the logs.
+
+4. When using multi argument options like ``--tables`` or ``--courses``,
+ you should try not to place them right before the expected positional
+ arguments. This will help the CLI parser not confuse your positional
+ arguments with table names (in the case of ``--tables``) or course
+ IDs (when ``--courses`` is used).
+
+5. Splitting tracking logs is a resource intensive process. The routine
+ that splits the logs generates a file for each course ID encountered.
+ If you happen to have more course IDs in your logs than the running
+ process can open operating system file descriptors, then ``simeon``
+ will put away records it can’t save to disk for a second pass.
+ Putting away the records involves using more memory than normally
+ required. The second pass will only require one file descriptor at a
+ time, so it should be safe in terms of file descriptor limits. To
+ help ``simeon`` not have to do a second pass, you may increase the
+ file descriptor limits of processes from your shell by running
+ something like ``ulimit -n 2000`` before calling ``simeon split`` on
+ Unix machines. For Windows users, you may have to dig into the
+ Windows Registries for a corresponding setting. This should tell your
+ OS kernel to allow OS processes to open up to 2000 file handles.
+
+6. Care must be taken when using ``simeon split`` and ``simeon push`` to
+ make sure that the number of positional arguments passed does not
+ lead to the invoked command exceeding the maximum command-line length
+ allowed for arguments in a command. To avoid errors along those
+ lines, please consider passing the positional arguments as UNIX glob
+ patterns. For instance,
+ ``simeon split --file-type log 'data/TRACKING-LOGS/*/*.log.gz'``
+ tells ``simeon`` to expand the given glob pattern, instead of relying
+ on the shell to do it.
+
+7. The ``report`` subcommand relies on the presence of SQL query files
+ to parse and send to BigQuery to execute. Any errors arising from
+ executing the parsed queries will be shown to the end user through
+ the given log stream. While the ``simeon`` tool ships with query
+ files for most secondary/reporting tables that are based on the
+ ``edx2bigquery`` tool, an end user should be able to point ``simeon``
+ to a different location with SQL query files by using the
+ ``--query-dir`` option that comes with ``simeon report``.
+ Additionally, these query files can contain
+ ```jinja2 templated`` <https://jinja.palletsprojects.com/en/latest/>`__
+ SQL code. Any mentioned variables within these templated queries can
+ be passed to ``simeon report`` by using the ``--extra-args`` option
+ and passing key-value pair items in the format
+ ``var1=value1,var2=value2,var3=value3,...,varn=valuen``. Further,
+ these key-value pair items can also be typed by using the format
+ ``var1:i=value1,var2:s=value2,var3:f=value3,...,varn:s=valuen``. In
+ this format, the type is append to the key, separated by a colon. The
+ only supported scalar types, so far, are ``s`` for ``str``, ``i`` for
+ ``int``, and ``f`` for ``float``. If any conversion errors occur
+ during value parsing, then those are shown to the end user, and the
+ query won’t get executed. Finally, if you wish to pass an ``array``
+ or ``list`` to the template, you will need to repeat a key multiple
+ times. For instance, if you want to pass a list named ``mylist``
+ containing the integers, you could write something like
+ ``--extra-args mylist:i=1,mylist:i=2,mylist:i=3``. This means that
+ you’ll have a python ``list`` named ``mylist`` within your template,
+ and it should contain ``[1, 2, 3]``.
+
+
+%prep
+%autosetup -n simeon-0.0.24
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+ find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+ find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+ find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+ find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+ find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-simeon -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.0.24-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..75c0a7c
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+7ef8c4f2ff33b8653c162a638edfd091 simeon-0.0.24.tar.gz