Project Title: Developing Intelligent Chinese Knowledge Base System--Document Database and Natural Language Query Implementation
Project Overview: We are building an intelligent Chinese knowledge base, which aims to help students and teachers better learn and teach Chinese through efficient data collation and classification, semantic search and auxiliary learning functions. Now we need to find experienced developers or teams to help implement the following two core technology modules:
- 1. Document database: Transform existing Chinese materials (texts, exercises, study guides, lesson plans, etc.) into a structured database, including semantic embedding generation and association annotation.
- 2. Natural language query to SQL: Develop a system that can automatically convert users' natural language questions into SQL query statements, retrieve relevant information from the database and generate intelligent answers.
Detailed requirements:
1. Database of documents
- · Data collation and classification:
1. Classification standard design
Based on the needs of Chinese teaching, it is classified according to the following dimensions:
- · Content Type: Text, Exercises, Study Guide, Lesson Plan.
- · Grades: elementary school, junior high school and senior high school.
- · Topics: Literary knowledge, ancient poetry, reading comprehension, grammar knowledge, etc.
- · Version: Textbook version (People's Education Press, Jiangsu Education Press, etc.).
- o Design and create the corresponding database table structure according to the classification criteria provided, including but not limited to:
- § Text table: stores the content of the text and its unique identifier.
- § Resource table: stores after-class exercises, study guides, lesson plans, etc.
- § Association table: stores the association information between texts and resources (such as association type: strong/weak).
- · Database design:
- o Example table structure:
- § Text table: ID Text title Grade Textbook version Text content embedding vector
- § Resource table: ID type (exercises/guidelines/lesson plans) grade version content embedding vector
- § Association table: Text ID Resource ID Association type (strong/weak)
Import and preprocessing:
- · Multi-format support:
- o Input: PDF, Word, pictures, scans, etc.
- o Output: Normalized text after cleaning.
- · Content extraction and cleaning:
- o OCR tool recognition (Tesseract or PaddleOCR).
- o Automatically clear watermarks, advertisements and redundant characters.
- · Task flow design:
- o Data cleaning-> classification storage-> embedding generation-> association annotation.
- · Semantic embedding generation:
- o Use advanced embedding models (such as BERT, GPT, etc.) to generate vector representations of texts and related resources, and store them in databases for support vector retrieval (such as Pinecone, Weaviate, etc.).
- o Supports fine-tuning and optimizes model performance based on large-scale data in the Chinese field.
- · Relationship annotation:
- o Based on the annotation of teachers or language experts, an accurate mapping relationship between the text and its related resources is established to ensure the relevance and accuracy of the retrieval results.
- o Using the similarity of semantic embedding, the association relationship between text and resource is preliminarily generated (if the cosine similarity is greater than the set threshold, it is regarded as "strong association").
- o Association Type:
(1) Strong correlation: direct correlation (such as exercises in the text).
(2) Weak correlation: auxiliary correlation (such as extended reading of texts).
(1) One-way: one-way reference relationship.
(2) Two-way: interrelated.
- · Visual management tools:
Admin interface:
(1) Data entry: batch upload and single editing functions.
(2) Embedding management: View and update semantic embeddings.
(3) Associated annotation: drag-and-drop operation to realize visual associated editing.
Permission control: Set different role permissions
(1) Administrator: full authority management.
(2) Teacher: Labeling is associated with update.
(3) Students: Only content can be retrieved and queried.
2. Convert natural language query to SQL
- · Semantic search flow:
- o After the user enters keywords or questions, the system needs to be able to understand the user's intention, automatically generate corresponding SQL query statements, and retrieve matching texts and related resources from the database. Check the following information:
(1) Inquiry objectives (such as text content, exercises, study guides, lesson plans).
(2) Inquire about the topic (such as the name of the text: "Spring Dawn").
(3) Query conditions (such as grade and textbook version).
- o The returned results need to be sorted according to the degree of relevance, and related resources, such as exercises, study guides, lesson plans, etc., are automatically supplemented. Sort by the following rules:
(1) Directly related content (such as the text of Spring Dawn).
(2) Resources closely related to the query target (such as exercises and lesson plans).
(3) related content (such as extended reading or study guidance).
- · Example functionality:
- o User search: "Contents and related exercises of Spring Dawn"
- § Returns:
- § Text: The complete content of Spring Dawn.
- § Exercises: After-class exercises corresponding to Spring Dawn.
- § Guidelines: Suggestions on the study of Spring Dawn.
- § Lesson plan: Teacher's explanation ideas.
- · External data integration:
- o When the data in the database is insufficient to answer the user's question, the system should be able to obtain relevant content through external search or API, and combine it with the existing database content to generate a complete answer.
Additional requirements:
3. Assisted learning function (optional module)
- · Knowledge points are explained step by step:
- o Use models such as GPT to generate easy-to-understand step-by-step explanations of complex Chinese knowledge points.
- · Supplementary content generation:
- o Automatically generate illustrations, example sentences, and small tests to help students consolidate their knowledge.
- · Context-related display:
- o After students enter knowledge points, they will display relevant extended content and background knowledge to provide comprehensive learning resources.
Skill Requirements:
- · Database design and management:
- o Familiarity with relational databases (e.g. MySQL, PostgreSQL) or non-relational databases (e.g. MongoDB).
- o Experience uses vector databases (e.g., Pinecone, Weaviate) for semantic embedding storage and retrieval.
- · Natural language processing (NLP):
- o Familiar with using NLP models (such as BERT, GPT) for text embedding and semantic understanding.
- o Project experience converting natural language to SQL queries.
- · Programming languages and frameworks:
- o Proficient in Python or other programming languages suitable for NLP and database operations.
- o Familiarity with using relevant NLP libraries (e.g. spaCy, Transformers) and database interfaces.
- · API development and integration:
- o Be able to develop RESTful API to realize data interaction between front and back ends.
- o Experience integrating 3rd party APIs, particularly in data acquisition and processing.
- · Project Management and Collaboration:
- o Good communication skills, able to understand project requirements and complete tasks efficiently.
- o Experience with version control tools such as Git.
Project Delivery:
- · Stage 1: Complete the database design and construction, import the initial data, generate and store semantic embedding.
- · Stage 2: Develop natural language query to SQL module to realize basic query functions.
- · Stage 3: Optimize the retrieval and answer generation logic to ensure the accuracy and relevance of the answers.
- · Stage 4 (optional): Integrate assisted learning functions to improve user experience.
Project budget: Please make a reasonable quotation according to your experience and project needs.
Project duration: It is hoped that the entire project will be completed within [specific time frame, such as 3 months], which can be negotiated according to the development progress.
How to apply: Please provide the following information so that we can better understand your abilities and experience:
- 1. Relevant project experience and case sharing.
- 2. A brief description of the technical proposal, explaining how you will implement the above requirements.
- 3. Project quotation and estimated completion time.
- 4. Other information that you feel will be helpful in the application.
Looking forward to cooperating with you to build an efficient and practical intelligent Chinese knowledge base system!