Data Engineer
Company: Institute of Foundation Models
Location: Sunnyvale
Posted on: February 15, 2026
|
|
|
Job Description:
Job Description Job Description About the Institute of
Foundation Models We are a dedicated research lab for building,
understanding, using, and risk-managing foundation models. Our
mandate is to advance research, nurture the next generation of AI
builders, and drive transformative contributions to a
knowledge-driven economy. As part of our team, you’ll have the
opportunity to work on the core of cutting-edge foundation model
training, alongside world-class researchers, data scientists, and
engineers, tackling the most fundamental and impactful challenges
in AI development. You will participate in the development of
groundbreaking AI solutions that have the potential to reshape
entire industries. Strategic and innovative problem-solving skills
will be instrumental in establishing MBZUAI as a global hub for
high-performance computing in deep learning, driving impactful
discoveries that inspire the next generation of AI pioneers. The
Role As a Data Engineer specializing in Natural Language Processing
(NLP) and large-scale data processing, you will quickly and
effectively gather, curate, and prepare high-quality datasets to
support cutting-edge NLP research. Your role will be instrumental
in enabling researchers by delivering essential data through
efficient and scalable engineering practices, including web
crawling, LLM-generated content refinement, and robust data
pipelines, primarily leveraging Python and related technologies.
Key Responsibilities Rapidly collect, curate, and preprocess
datasets based on detailed specifications provided by
NLPresearchers,delivering data within tight timelines. Develop and
maintain efficient web crawling solutions, APIs, and automated
workflows to continuously improve data collection processes. Refine
and evaluate outputs from Large Language Models (LLMs) to generate
structured datasets suitable for model training and benchmarking.
Implement scalable data pipelines, ensuring efficient data
processing, storage, retrieval, and distribution to research teams.
Collaborate closely with researchers and engineers to ensure
collected data meets specified quality and relevance criteria.
Document data collection methodologies, dataset characteristics,
and pipeline architecture clearly and effectively. Engage with peer
teams and participate in technical reviews to uphold best practices
and data quality standards. Represent MBZUAI at industry and
research forums, showcasing technical capabilities in large-scale
data processing and AI data infrastructure. Academic Qualifications
Bachelor's degree in Computer Science, Data Science, Engineering,
or a related technical field required Master’s degree or PhD degree
or equivalent experience in Computer Science, Data Engineering, or
related technical fields preferred. Professional Experience -
Required Extensive experience in data engineering, data processing,
and automation using Python. Demonstrated proficiency in designing
and deploying web crawling solutions, automated data extraction,
and processing pipelines. Strong understanding of data structures,
algorithms, databases, SQL, and performance optimization.
Experience working with cloud infrastructure and distributed data
processing frameworks (e.g., AWS, Spark, Kafka, Kubernetes).
Excellent problem-solving abilities, attention to detail, and the
capability to rapidly address technical challenges. Strong
communication and collaboration skills with cross-functional teams.
Professional Experience - Preferred Proven track record of
supporting NLP or AI research teams with rapid and reliable data
delivery. Experience working with large language models, including
evaluation, efficient inference, and prompt engineering. Experience
with refining outputs from large-scale AI models, such as
LLM-generated data. Contributions to open-source projects, coding
competitions, or high visibility in coding communities (e.g.,
GitHub, Stack Overflow). Familiarity with the latest advancements
in NLP data processing and large language model technologies. Visa
Sponsorship This position is eligible for visa sponsorship.
Benefits Include *Comprehensive medical, dental, and vision
benefits *Bonus *401K Plan *Generous paid time off, sick leave and
holidays *Paid Parental Leave *Employee Assistance Program *Life
insurance and disability
Keywords: Institute of Foundation Models, San Bruno , Data Engineer, Science, Research & Development , Sunnyvale, California