RAG-based Knowledge Extraction Challenge π
Context and Motivation π
Welcome to the SelectCode RAG-based Knowledge Extraction Challenge! π At SelectCode, we're facing a common corporate challenge: making technical knowledge more accessbible. We're turning to you, our innovative applicants, to help us address this using Retrieval Augmented Generation (RAG) techniques.
Challenge Description π
Your task is to design and implement a data pipeline that processes technical documentation and uses RAG to let people ask questions on how to perform some of these technical tasks.
Technical Requirements π οΈ
- You will be provided with multiple guides in markdown format (realistic, imperfect output from a PDF extractor). You can find them here (opens in a new tab).
- You do not have to use all of these files. You can just choose the ones you like best.
- You have the freedom to choose your preferred programming languages, frameworks, and tools.
- Focus on using RAG/LLM tools that are open-source or provide free API keys for evaluation, using paid service that might provide better output will not be an advantage.
- Your solution should demonstrate a well-designed data pipeline that incorporates RAG techniques.
- You should pay attention to the classic RAG workflow like preprocessing the data if needed, generating embeddings, ingesting chunks into a vector database, etc.
- Optional techniques that show domain knowledge are also appreciated (e.g. reranking of results)
Core Functionality β
- File Processing: Ability to ingest and process markdown guides.
- RAG Implementation: Utilize RAG techniques to analyze and generate insights from the guides.
- Output Generation: Produce some form of output that answers technical questions about the contents of the guides (e.g. how do I use branches in git?).
Implementation Guidelines π
- Focus on creating a robust and scalable data pipeline.
- Your code should be of production quality. Pay attention to:
- Clean, readable, and well-structured code organization
- Appropriate commenting and documentation
- Error handling and logging
- Consider best practices for data processing and RAG implementation.
- Be creative! We're interested in novel approaches to making meetings more efficient or enjoyable.
Evaluation Criteria π
We will evaluate your work based on the following:
- Functional quality: How well does your solution address the problem of extracting and accessing knowledge?
- Code quality: We're looking for clean, readable, and well-structured code.
- Pipeline design: How well-designed and scalable is your data processing pipeline?
- RAG implementation: How effectively have you utilized RAG techniques?
- Documentation quality: We're looking for a well-structured
README.md
file that explains your approach and how to use your solution. - Problem solving: We want to see your thought process, so commit early and often with clear, informative commit messages.
Submission π€
- Provide your code in a public GitHub / GitLab repository.
- Include a README file that contains:
- An overview of your solution's architecture
- Installation and usage instructions
- Explanation of your RAG implementation and how it addresses the challenge
- Any assumptions you made and why
- Ideas for future improvements or extensions of your solution
- If you've implemented any particularly unique or innovative features, please highlight them in your README. π
- Send all this information via email to Stephan ([email protected]).
Questions? β
If you have any questions or need clarification, please don't hesitate to reach out. We're here to help! π§
Good luck with the challenge! We're excited to see your creative and effective solutions to improve knowledge access. ππ