About the job Junior Data Scientist
Junior Data Scientist
About Theia
Theia Insights is a venture-backed deep tech company. We build future-proof AI solutions in Industry Classification, Risk Factor Models, and Portfolio Analysis for the global finance and investment arenas.
We are a team of PhD scientists, engineers and mathematicians, with decades of combined experience. Our products and solutions are built upon a foundation of academic and proprietary research and the latest developments in AI, machine learning, Natural Language Processing (NLP), and Large Language Model (LLM) technologies.
We are guided by our commitment to sustainably building incredible technology to better serve the needs of a rapidly evolving world and ever-changing investment landscape. We are on a mission to leverage artificial intelligence and revolutionise how information is transformed into insights.
Your role
Your role involves constructing processes for both real-time and batch data, ensuring data integrity through cleaning, transformation, and validation processes. Your focus will be on optimizing performance, scalability, and reliability. Additionally, you will play a vital role in aligning our data with customer expectations, contributing to the development of machine learning models, and shaping engineering processes to enhance product effectiveness in meeting customer intentions.
Your Responsibilities
- Data Analysis: Reporting quanitative and qualitative properties of our data.
- Data Quality: Developing the quality assurance and validation processes surrounding our data. Such as developing a data annotation pipeline or using LLM models for evaluating our data
- Model Development: Involvement in developing models based on recent publications and open-source resources, adapted to our specific business needs.
- Pipeline Building: Helping shape the machine learning model and the engineering steps to align products with customer intentions.
Our tech stack
Machine Learning and Data Pipelines: Python and HuggingFace are used for the machine learning aspects of the technology.
Front End: TypeScript and React are used for creating the user interfaces.
Data Storage and Management: Amazon AWS stack. Data is stored in AWS S3 buckets.
Job Orchestration Framework: We use Dagster for managing and orchestrating jobs.
Who you are
- Essential:
- A strong background in data analysis (both textual and numerical) and statistics.
- Strong programming skills in Python.
- Experience with implementing NLP and ML models.
- Experience in QA, data quality, and related areas such as data annotation and validation.
- Our culture is integrity, humility, and pursuit of excellence. We are looking for candidates who resonate with these principles, and has a passion for technology innovation and desire to build what is best for customers. You will be working as part of a multidisciplinary team in a fast-changing start-up environment.
- Desirable:
- Master's degree or higher in data science, CS, ML, Math, or related field experience.
- Demonstrable record of achievement (e.g. publications, open source contributions, projects) in Machine Learning and NLP.
- Experience with designing a data validation process or annotation pipeline.
- Experience with designing and implementing batch and real-time data pipelines.
- Experience with Amazon Sagemaker, Weights and Biases, or similar MLOps tooling.
- Knowledge in finance and economics.