- Company Name
- CAST
- Job Title
- Stagiaire Data Engineer / Data Enablement with AI for AI
- Job Description
-
**Job Title**
Intern Data Engineer / Data Enablement for AI
**Role Summary**
Assist in building the foundational data layer that powers AI systems by aggregating, structuring, and enriching software project data. Use LLMs, embeddings, and NLP tools to automate data cleaning, entity extraction, and semantic annotation, creating ready‑to‑use datasets for fine‑tuning and Retrieval‑Augmented Generation (RAG).
**Expectations**
- Develop and maintain scalable semantic pipelines.
- Apply advanced AI techniques for data curation and quality assurance.
- Deliver reproducible, lineage‑tracked datasets for downstream AI models.
- Work cross‑functionally with AI research and engineering teams.
**Key Responsibilities**
- Aggregate data from software ecosystems (code, APIs, tickets, docs, architecture specs).
- Clean and enrich data using LLMs, embeddings, and NLP (Hugging Face, spaCy, LangChain).
- Extract entities, tag metadata, perform semantic annotation, and prepare tokenized chunks for model input.
- Build and manage pipelines (Airflow, Pandas, PyArrow) that feed RAG and LLM fine‑tuning workflows.
- Format datasets for Agent‑to‑Agent interaction (vector databases, knowledge graphs, APIs).
- Ensure robust data lineage, reproducibility, and version control.
- Collaborate on schema evolution, prompt design, labeling strategies, and evaluation metrics.
**Required Skills**
- Python programming with data‑pipeline libraries (Pandas, PyArrow, regex, Airflow).
- Experience in data engineering, ML data ops, or structured data curation.
- Familiarity with LLMs/NLP frameworks (Hugging Face, spaCy, LangChain).
- Knowledge of tokenization, chunking, and model input preparation.
- Ability to work with software project data (Git repos, APIs, technical docs).
**Required Education & Certifications**
- Current student or recent graduate in Computer Science, Data Science, or a related field.
- Optional certifications in data engineering or machine learning (e.g., Coursera, edX, GCP/AWS data services).
---