Table of Contents generated with DocToc
Data scientists are accustomed to writing code rather than using a self-service style for data cleaning, data exploration, data modeling, machine learning, data visualization and other data processing tasks. Well-known products providing these functionalities include Databricks Data Science Workspace, Zeppelin, Jupyter, etc. Hengshi Sense version 1.1, following these products, offered a "Data Science" functionality, supporting the writing and execution of Python, R, Scala, Markdown, Native SQL, Spark SQL code. This feature, leveraging the Livy Spark job service management, actually executes the code on Spark. Since it was a significant implementation overhead and there was low customer demand, it was not ported to versions after 2.0. However, considering the "Data Preparation" functionality, we were torn between catering to IT personnel or business users. The existing self-serve "Data Preparation" is geared towards business users, but the current customer need is actually the execution of SQL from within the "Data Science" feature, so we decided to bring back the "Data Science" functionality, starting with support for SQL.
A new "Data Science" entry has been added, which is at the same level as "Data Preparation" and others. Only users with the "Data Management" role can see the "Data Science" entry. All users with the "Data Management" role can view all "Notes" without isolation.
The following will introduce in detail the various concepts and operations of data science.
There are multiple "Notes" in data science, each independent of each other. Each "Note" contains multiple "Paragraphs." Each "Note" supports:
Select a note, open the menu with three dots, and click delete
.
Click on a note in the notes list, open the note for modification.
Each “Paragraph” sets a type, currently only supporting SQL. Each “Paragraph” supports:
Clicking + adds a new paragraph where users can enter their script. Users can only write Native SQL that conforms to the syntax required by the selected connection type.
Clicking the upload icon allows for the upload of a text file whose content will then be uploaded to the paragraph.
Each paragraph can set the language, runtime environment, and default schema.
Language
refers to the scripting language used in the paragraph; currently, only SQL is supported.
Runtime Environment
refers to the data connection used to run the paragraph.
Default Schema
refers to the default schema used in the paragraph when no specific schema is designated.
Clicking the 'open' icon on the right side of the paragraph will display the paragraph in full-screen mode, easing editing and debugging.
Click the 'delete' icon on the right side of each paragraph to remove an individual paragraph.
Click 'settings' to open the Runtime Environment Settings
pop-up. As previously mentioned, a runtime environment refers to the data connection used to run paragraphs.
Clicking 'add' will bring up the data connection selection page. After selecting a data connection, it will be displayed in the 'authorized connection list'.
If the user does not have RW permissions, the connection 'status' is invalid. If the user does have RW permissions, the connection 'status' is valid.
In the 'authorized connection list', open the menu for a connection and choose 'authorize' to grant all users of data science access to that connection; all users will be able to use it to run notes, whether or not they have RW permissions.
After authorization, all users can execute notes, posing a risk of data leakage. If a user does not want to grant access to other users, they can revoke authorization after each execution. Thus, other users won't be able to use the connection to execute notes.
Clicking 'revoke authorization' causes the connection status to change to 'invalid', and no users can use that connection anymore.
In the 'authorized connection list', open the menu for a connection and choose 'delete' to remove that connection from the list and make it unavailable for executing notes.
Selecting 'Commit transactions as a whole' means that, when running notes, all paragraphs will be treated as one transaction, and if one paragraph fails, all paragraphs will be rolled back.
Selecting 'Commit transactions by paragraph' means that, when running notes, each paragraph will be treated as one transaction, and if a paragraph fails, only that paragraph will roll back, while other transactions will execute and commit as usual.
Click 'Test Execution' to execute the current paragraph without submitting it to the database.
Click 'Submit Execution' to execute the current paragraph and submit the results to the database.
If there is an SQL query statement in a paragraph, the 'Result Preview' will display the query results. Otherwise, there will be no content in the result preview.
The 'Execution Log' will display logs from the execution process for users to debug or monitor.
Clicking 'Execute All' will execute all paragraphs in the current note, and the execution results will be submitted to the database.
Clicking the 'Execution Plan' in the upper right corner will pop up the 'Execution Plan' window.
Select 'Enable' to schedule the note for automatic execution according to the set execution plan.
Unselect 'Enable', and the note will not execute automatically and can only be started manually.
A data administrator opens the main menu 'Settings' to see 'Execution Plan and Task Queue Management'.
Clicking to open, you will see two tabs: 'Execution Plan' and 'Execution Queue'.
Click on the menu in the operation column to perform the corresponding actions.
Execute a task immediately.
Modify the execution plan.
Click to view all tasks that have been executed for that project.
Delete the execution plan. The note's execution plan will be deleted.
This UI is similar to 'View Tasks' but will show all execution tasks here.
Click 'Re-execute' to resubmit a task for execution. A new execution record will appear in the execution queue.
You can view the execution log of the task.