DSS 11: Dataiku strengthens its support for “technical experts”

Although it has wanted to democratize analytics, AI and machine learning since its inception, Dataiku puts data scientists, data engineers and developers at the center of its DSS platform.

Previously, these “experts” were invited to migrate their notebooks to the platform or to use the IDE and version of Jupyter embedded in DSS. Then, the publisher offered extensions to connect its software to external IDEs, including Visual Studio Code and PyCharm.

However, these tools are installed per workstation, and administrators may lack control over their deployment or use. Configuration issues can also slow project start-ups, according to the publisher.

Code Studios: Dataiku Codespaces

To “enhance the engagement” of technical teams, DSS 11 integrates “Code Studios”, a personal space to run IDEs and web applications in the cloud. Code Studios launches on a Kubernetes pod hosted on an Elastic AI instance leveraging AKS, GKE, or EKS. “Each Code Studio is a separate container, and has its own file system. It cannot access the file system of the DSS host,” says the vendor’s documentation.

A Code Studio allows you to edit “recipes” – transformations – written in Python and SQL in Visual Studio Code, in R in RStudio Server, and to debug Python code in JupyterLab. With Code Studios, Dataiku also integrates Streamlit, the Python framework for developing web applications, acquired by Snowflake. The GitHub-like Codespaces tool is open enough to accommodate the packages needed for tech team projects. And like Codespaces, Code Studios requires special attention when starting and stopping cloud instances.

Code Studios directly echoes Dataiku’s strategy: the company wants to attract customers to its cloud offerings. In this sense, to facilitate the deployment of Code Studios, the publisher recommends the use of templates associated with Cloud Stacks licenses. Note that after AWS, the new version of DSS introduces a Cloud Stacks configuration for Google Cloud.

Simplify data sharing and collaboration

Code experts aren’t the only ones benefiting from new development environments.

For “data professionals,” a tool called Visual Time Series Forecasting should make it easier to design, train, evaluate, and deploy time series forecasting models “without writing code.” The teams cross several time series and access several machine learning and deep learning algorithms.

As for teams that edit computer vision models, they have the right to a new space for labeling images. This interface allows prompting annotators and supports shortcuts to speed up image labeling. A manager can analyze the annotation process and resolve any conflicts. The tool is above all designed to prepare data submitted to classification and object detection algorithms.

And to connect data science teams with other platform users, Dataiku has enhanced the sharing features infused into DSS. Thus, administrators can make their projects “discoverable”. In principle, all users can then obtain information on the nature of a project, and, if necessary, request access from its owner. The managers receive notifications and accept or not the solicitation. Project participants may request to distribute datasets with other programs. These requests are also submitted to the admin. In addition, a quick share feature can be enabled to let users exchange data sets without manager intervention.

There is also a data intelligibility challenge. For this, Dataiku optimizes its data visualization capabilities and adds a pivot table (requested by Excel regulars).

MLOps: Dataiku sets up an in-house feature store

Above all, Dataiku continues the efforts undertaken as part of its V10 to support the MLOps principles.

Thus, the editor introduces a Feature Store. More specifically, it is a space dedicated to sharing data sets and parameters necessary for the design of machine learning models. Users can mark datasets containing features of interest with the “Feature Group” seal. They then join the feature store.

Like GCP’s Vertex AI and AWS SageMaker, this feature relies on implementing existing platform capabilities. Parameters are stored in the various object storage and database instances supported by DSS. Data is ingested using stream recipes. Parameters related to batch processing (offline) are served via join recipes deployed on automation nodes. Parameters associated with real-time processing (online) go through the API allowing looks in datasets. Then, monitoring and maintenance are defined via triggers.

In the same vein, the platform supports the MLflow Tracking API, in order to follow the metrics of parameters, performances, and various metadata necessary for the evaluation of experimental models.

Finally, Dataiku has optimized its Model Document Generator. Introduced from DSS 8, this feature automatically generated documentation for trained models. The editor has extended this capability to the flow, that is to say the visual representation of the steps (the famous recipes) constituting a data transformation pipeline. The document in DOCX format must detail the data sets and the operations carried out during the development of a statistical or artificial intelligence model. In terms of governance, Dataiku provides a new editor for managing permissions and signatures, as well as temporal traceability of governed objects.

In total, DSS 11 fixes about fifty bugs, and deprecates support for MapR as well as versions 1.x and 2.x of Elasticsearch.

We want to say thanks to the author of this write-up for this awesome material

DSS 11: Dataiku strengthens its support for “technical experts”


You can find our social media profiles here as well as other pages related to them here.https://www.ai-magazine.com/related-pages/