Data Engineer (Fully remote)
Job Location
Johannesburg, South Africa
Job Description
Data Architecture and Design - Data Modeling: o Create normalized and denormalized schemas (3NF, star, snowflake). o Design data lakes, warehouses, and marts optimized for analytical or transactional workloads. o Incorporate modern paradigms like data mesh, lakehouse, and delta architecture. - ETL/ELT Pipelines: o Develop end-to-end pipelines for extracting, transforming, and loading data. o Optimize pipelines for real-time and batch processing. - Metadata Management: o Implement data lineage, cataloging, and tagging for better discoverability and governance. Distributed Computing and Big Data Technologies - Proficiency with big data platforms: o Apache Spark (PySpark, Sparklyr). o Hadoop ecosystem (HDFS, Hive, MapReduce). o Apache Iceberg or Delta Lake for versioned data lake storage. - Manage large-scale, distributed datasets efficiently. - Utilize query engines like Presto, Trino, or Dremio for federated data access. Data Storage Systems - Expertise in working with different types of storage systems: o Relational Databases (RDBMS): SQL Server, PostgreSQL, MySQL, etc. o NoSQL Databases: MongoDB, Cassandra, DynamoDB. o Cloud Data Warehouses: Snowflake, Google BigQuery, Azure Synapse, AWS Redshift. o Object Storage: Amazon S3, Azure Blob Storage, Google Cloud Storage. - Optimize storage strategies for cost and performance: o Partitioning, bucketing, indexing, and compaction. Programming and Scripting - Advanced knowledge of programming languages: o Python (pandas, PySpark, SQL Alchemy). o SQL (window functions, CTEs, query optimization). o R (data wrangling, Sparklyr for data processing). o Java or Scala (for Spark and Hadoop customizations). - Proficiency in scripting for automation (e.g., Bash, PowerShell). Real-Time and Streaming Data - Expertise in real-time data processing: o Apache Kafka, Kinesis, Azure Event Hub for event streaming. o Apache Flink or Spark Streaming for real-time ETL. o Implement event-driven architectures using message queues. - Handle time-series data and process live feeds for real-time analytics. Cloud Platforms and Services - Experience with cloud environments: o AWS: Lambda, Glue, EMR, Redshift, S3, Athena. o Azure: Data Factory, Synapse, Data Lake, Databricks. o GCP: BigQuery, Dataflow, Dataproc. - Manage infrastructure-as-code (IaC) using tools like Terraform or CloudFormation. - Leverage cloud-native features like auto-scaling, serverless compute, and managed services. DevOps and Automation - Implement CI/CD pipelines for data workflows: o Tools: Jenkins, GitHub Actions, GitLab CI, Azure DevOps. - Monitor and automate tasks using orchestration tools: o Apache Airflow, Prefect, Dagster. o Managed services like AWS Step Functions or Azure Data Factory. - Automate resource provisioning using tools like Kubernetes or Docker. Data Governance, Security, and Compliance - Data Governance: o Implement role-based access control (RBAC) and attribute-based access control (ABAC). o Maintain master data and metadata consistency. - Security: o Apply encryption at rest and in transit. o Secure data pipelines with IAM roles, OAuth, or API keys. o Implement network security (e.g., firewalls, VPCs). - Compliance: o Ensure adherence to regulations like GDPR, CCPA, HIPAA, or SOC 2. o Track and document audit trails for data usage. Performance Optimization - Optimize query and pipeline performance: o Query tuning (partition pruning, caching, broadcast joins). o Reduce IO costs and bottlenecks with columnar formats like Parquet or ORC. o Use distributed computing patterns to parallelize workloads. - Implement incremental data processing to avoid full dataset reprocessing. Advanced Data Integration - Work with API-driven data integration: o Consume and build REST/GraphQL APIs. o Implement integrations with SaaS platforms (e.g., Salesforce, Twilio, Google Ads). - Integrate disparate systems using ETL/ELT tools like: o Informatica, Talend, dbt (data build tool), or Azure Data Factory. Data Analytics and Machine Learning Integration - Enable data science workflows by preparing data for ML: o Feature engineering, data cleaning, and transformations. - Integrate machine learning pipelines: o Use Spark MLlib, TensorFlow, or scikit-learn in ETL pipelines. - Automate scoring and prediction serving using ML models. Monitoring and Observability - Set up monitoring for data pipelines: o Tools: Prometheus, Grafana, or ELK stack. o Create alerts for SLA breaches or job failures. - Track pipeline and job health with detailed logs and metrics. Business and Communication Skills - Translate complex technical concepts into business terms. - Collaborate with stakeholders to define data requirements and SLAs. - Design data systems that align with business goals and use cases. Continuous Learning and Adaptability - Stay updated with the latest trends and tools in data engineering: o E.g., Data mesh architecture, Fabric, and AI-integrated data workflows. - Actively engage in learning through online courses, certifications, and community contributions: o Certifications like Databricks Certified Data Engineer, AWS Data Analytics Specialty, or Google Professional Data Engineer.
Location: Johannesburg, ZA
Posted Date: 12/20/2024
Location: Johannesburg, ZA
Posted Date: 12/20/2024
Contact Information
Contact | Human Resources |
---|