BLOSSOM V6 SFT Stage2 is a high-quality, diverse large language model fine-tuning dataset designed for the second-stage SFT training of the Blossom V6 model. Its purpose is to further enhance the model's ability to handle complex instructions on more rare real-world problems.
While open-source large language models often release model weights and technical reports, the most advanced open-source models typically withhold their pre-training and post-training data, making it difficult for the community to replicate their capabilities. Blossom is committed to providing researchers with reproducible post-training data for model capability development.
Data Sources: ShareGPT, WildChat, Wizard, Stackoverflow, Math, Magpie, AgentInstruct, InfinityPreference, Code, Flan, Olcc, Ruozhiba, etc.
Synthesis Workflow Overview:
Primarily employs three cost-effective models—Yi-Lightning, Deepseek-V2.5, and Doubao-Pro-32K (denoted as A, B, C)—to regenerate responses under different scenarios using tailored synthesis strategies.
For example:
Additional rule-based filtering is applied, such as:
Further technical details will be released in the future. The data is synthesized by the 🌸BlossomData framework.
Primarily Chinese and English, with a roughly 1:1 ratio of Chinese-to-English data.
Each entry represents a conversational sample with the following fields:
id: Unique identifier combined with metadata.source.type: Always set to chat.metadata: Contains source indicating the data origin.messages: A list of dialogue messages. Each message includes role (user or assistant) and content (text).This dataset is AI-generated. Despite preliminary validation and filtering, it may still contain inaccuracies or severe errors.