Staff/Principal Performance Engineer

Remote | CloudSquad

Location: Remote (outside of Russia)
Work format: Remote, Full-time
Company name: CloudSquad
Contacts: @natalia_kurland

Job Title: Staff/Principal Performance Engineer

About the Role:
We are seeking a highly skilled and motivated Principal Performance Engineer to lead the performance optimization of our cutting-edge Generative AI technology stack. This role is critical n ensuring the scalability, efficiency, and reliability of our Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) systems. You will be a key driver in identifying and resolving performance bottlenecks, optimizing resource utilization, and ensuring a seamless user experience. You will work closely with our AI research, software engineering, and infrastructure teams to deliver world-class AI solutions.

Responsibilities:

📌Performance Leadership:
- Define and implement performance engineering strategies for our Generative AI full stack, including services, application, LLMs, RAG pipelines, and related infrastructure.
- Lead performance testing, profiling, and analysis efforts to identify and resolve performance bottlenecks.
- Establish and maintain performance benchmarks and SLAs for critical AI services.
- Provide technical leadership and mentorship to performance engineering team members.

📌LLM Capacity and Tuning:
- Analyze and improve LLM inference performance, including latency, throughput, and resource utilization.
- Develop and implement strategies for LLM capacity planning and scaling.
- Collaborate with AI researchers to evaluate and improve LLM model architectures and training techniques for performance.
- Optimize LLM inference through techniques such as quantization, distillation, and optimized kernel implementation.

📌RAG Performance Optimization:
- Design and implement performance tests for RAG pipelines, including retrieval, ranking, and generation components.
- Identify and optimize performance bottlenecks in RAG systems, such as database queries, vector search, and document processing.
- Evaluate and optimize RAG system architectures for scalability and efficiency.
- Tune vector databases for optimal recall and latency.

📌Infrastructure Optimization:
- Collaborate with infrastructure teams to optimize hardware and software configurations for AI workloads.●
- Evaluate and recommend new technologies and tools for performance monitoring and analysis.
- Develop and maintain performance dashboards and reports to track key metrics.
- Optimize GPU utilization and memory management for LLM inference.

📌Collaboration and Communication:
- Work closely with AI researchers, software engineers, and product managers to ensure performance requirements are met.
- Communicate performance findings and recommendations to stakeholders at all levels.
- Stay up-to-date with the latest developments in Generative AI and performance engineering.

Qualifications:

📌Education:
- Bachelor's degree in Computer Science, Engineering, or a related field (Master's preferred).

📌Experience:
- 10+ years of experience in performance engineering, with a focus on large-scale distributed systems.
- 2+ years of experience working with AI/ML technologies
- Proven experience in performance testing, profiling, and analysis of complex software systems.
- Deep understanding of NLP architectures, training, and inference.
- Experience with vector databases and search technologies.
- Experience with cloud computing platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
- Strong programming skills in python.
- Experience with performance analysis tools (e.g., profilers, debuggers, monitoring tools).

📌Skills:
- Strong analytical and problem-solving skills.
- Excellent communication and collaboration skills.
- Ability to work in a fast-paced and dynamic environment.
- Passion for AI and a desire to push the boundaries of performance engineering