2025-04-24 –, Helium3
SAP's data often remains locked away, hindering the creation of a complete data picture. This talk presents a hands-on proof of concept leveraging SAP Datasphere, Python and PySpark to bridge an Azure-based, data mesh-inspired open data lake with a centralized SAP BI environment.
This presentation will delve into the architecture of SAP Datasphere and its integration interfaces with Python. It will explore network integration, authentication, authorization and resource management options, as well as data integration patterns. The presentation will summarize the evaluated features and limitations discovered during the PoC.
In many enterprises relying on SAP ERP systems, a wealth of valuable master data remains trapped within a closed ecosystem. This creates significant obstacles when striving for a comprehensive, 360° view, especially when integrating with modern, open data lakes built on platforms like Azure and designed around data mesh principles. This talk presents a practical PoC that tackles this challenge head-on, utilizing SAP Datasphere as the key integration point.
Outline:
-
The challenge: navigating sap's data silos and the pursuit of a unified view
* The section outlines the enterprise data landscape of RATIONAL where valuable master data resides within SAP’s traditionally closed ecosystem, hindering data democratization and the creation of a comprehensive, 360° operational view. This scenario is quite common - at least for German manufacturing companies. This situation is frequently encountered, particularly among German manufacturers.
* The inherent conflict between the open, distributed nature of data lakes (especially those built on data mesh principles) and the centralized, closed nature of traditional SAP BI environments is discussed. -
Solution overview: leveraging sap datasphere as the integration layer
* An introduction to sap datasphere and its capabilities is provided, with a focus on its ability to connect with non-SAP systems.
* This part explains how datasphere was chosen as the central integration layer for the proof of concept and its role in enabling bi-directional data flow between SAP and the open data lake. -
Architecture of SAP Datasphere
* Introduction in architecture of SAP Datasphere and role of underlying SAP HANA database
* Explanation of openSQL schema as key integration option -
Security first: exploring network integration, authentication and authorization options
* This section details the evaluation of network connectivity options between the Azure services like Azure Databricks, PostgresQL, ADLS and SAP Datasphere
* The methods used to authenticate Python and Pypark to SAP datasphere are explained
* The implementation and evaluation of data authorization mechanisms within SAP Datasphere are described -
Python and PySpark integration
* Available interfaces for python integration (ODBC/JDBC, OData), their features and limitations
* Explanation of practical data integration patterns implemented within the poc for extracting data from sap and loading it into the data lake for full and delta load scenarios -
Reflecting PoC: summary and key learnings
* This section summarizes the core findings and lessons learned from the PoC, particularly regarding security and software quality best practices
* A hint for the SAP open data alliance launched in 2023
Main takeaways:
* An understanding of SAP Datasphere's architecture and its potential for integrating non-SAP, open-source technologies like Python and PySpark
* Knowledge of current features and limitations of SAP Datasphere in the area of data integration with the open source world
Intermediate
Expected audience expertise: Python:Novice
Rostislaw, a data architect at RATIONAL AG, specializes in distributed databases, the Apache Hadoop ecosystem and Azure cloud. He leverages his expertise to oversee the company's Data & Analytics platform, where his daily work involves reconciling diverse stakeholder perspectives to deliver optimal solutions.