CAST

www.castsoftware.com

1 Job

1,253 Employees

About the Company

Businesses move faster using CAST technology to understand, improve, and transform their software. Through semantic analysis of source code, CAST produces 3D maps and dashboards to navigate inside individual applications and across entire portfolios. This intelligence empowers executives and technology leaders to steer, speed, and report on initiatives such as technical debt, GenAI, modernization, and cloud. As the pioneer of the software intelligence field, CAST is trusted by the world’s leading companies and governments, their consultancies and cloud providers. See it all at castsoftware.com.

Listed Jobs

Company Name: CAST
Job Title: Stagiaire Data Engineer / Data Enablement with AI for AI
Job Description: **Job Title** Intern Data Engineer / Data Enablement for AI **Role Summary** Assist in building the foundational data layer that powers AI systems by aggregating, structuring, and enriching software project data. Use LLMs, embeddings, and NLP tools to automate data cleaning, entity extraction, and semantic annotation, creating ready‑to‑use datasets for fine‑tuning and Retrieval‑Augmented Generation (RAG). **Expectations** - Develop and maintain scalable semantic pipelines. - Apply advanced AI techniques for data curation and quality assurance. - Deliver reproducible, lineage‑tracked datasets for downstream AI models. - Work cross‑functionally with AI research and engineering teams. **Key Responsibilities** - Aggregate data from software ecosystems (code, APIs, tickets, docs, architecture specs). - Clean and enrich data using LLMs, embeddings, and NLP (Hugging Face, spaCy, LangChain). - Extract entities, tag metadata, perform semantic annotation, and prepare tokenized chunks for model input. - Build and manage pipelines (Airflow, Pandas, PyArrow) that feed RAG and LLM fine‑tuning workflows. - Format datasets for Agent‑to‑Agent interaction (vector databases, knowledge graphs, APIs). - Ensure robust data lineage, reproducibility, and version control. - Collaborate on schema evolution, prompt design, labeling strategies, and evaluation metrics. **Required Skills** - Python programming with data‑pipeline libraries (Pandas, PyArrow, regex, Airflow). - Experience in data engineering, ML data ops, or structured data curation. - Familiarity with LLMs/NLP frameworks (Hugging Face, spaCy, LangChain). - Knowledge of tokenization, chunking, and model input preparation. - Ability to work with software project data (Git repos, APIs, technical docs). **Required Education & Certifications** - Current student or recent graduate in Computer Science, Data Science, or a related field. - Optional certifications in data engineering or machine learning (e.g., Coursera, edX, GCP/AWS data services). ---

Meudon, France

Hybrid

03-12-2025