- Company Name
- Storm3
- Job Title
- Research Scientist - Data
- Job Description
-
Job title: Research Scientist – Data
Role Summary: Lead data‑centric research on foundation models, designing large‑scale training corpora, developing automated data pipelines, and creating evaluation frameworks to enhance LLM robustness, scalability, and reasoning. Collaborate with researchers, data scientists, and engineers to publish findings in top AI venues and contribute to open‑source tooling.
Expactations: Deliver independent research on data quality, scaling, and reasoning; publish regularly in leading conferences; contribute to open‑source datasets and benchmarks; maintain high standards of data curation and pipeline reproducibility; engage with external research community and conferences.
Key Responsibilities
- Design and lead research on data‑centric approaches for LLMs (pretraining corpus, data valuation, speculative decoding).
- Build and optimize agentic data pipelines (retrieval, self‑curation, multi‑agent feedback).
- Develop scalable data preprocessing and curation pipelines for heterogeneous sources.
- Prototype and deploy evaluation frameworks assessing data quality, coverage, and downstream LLM reasoning impact.
- Collaborate with alignment and reasoning researchers to integrate data‑driven methods.
- Publish studies at NeurIPS, ICLR, ACL, EMNLP, etc.; represent institute at conferences.
- Contribute tools, datasets, and benchmarks to the open‑source foundation model community.
Required Skills
- Master’s in CS, Data Science, or related field (PhD preferred).
- Proven experience with large‑scale text data collection, multi‑lingual curation, and preprocessing for ML/LLM training.
- Hands‑on expertise in scalable ML infrastructure for training, evaluation, and debugging.
- Strong background in data engineering, tokenization, and training tokenizers.
- Experience with RL/SFT, post‑training, retrieval‑augmented generation, or agentic data pipelines.
- Ability to lead independent research projects and produce high‑impact publications.
- Familiarity with knowledge graphs, semantic search, indexing, and speculative decoding concepts.
Required Education & Certifications
- Master's (B.Sc. acceptable with extensive experience) in Computer Science, Data Science, Machine Learning, or a related technical discipline; Ph.D. strongly preferred.
- No mandatory certifications, but demonstrable contributions to open‑source ML data tools or benchmarks are highly valued.
San francisco bay, United states
Hybrid
24-11-2025