JT-Coder-8B-Base

License

JT-Coder is a series of high-performance and energy-efficient code large language models (LLMs) developed by the JiuTian team. Our core philosophy is: high-quality data is more important than massive amounts of data. Through our innovative data-centric framework, JT-Coder, while pre-trained using only 1.6T tokens, comprehensively outperforms multiple models of similar scale trained on approximately 4x the data, providing a more efficient and reproducible path for the development of code LLMs.

"/"

Figure 1: Performance of JT-Coder-8B-Instruct on code generation benchmarks.

Core Features

  • ๐Ÿš€ State-of-the-Art Performance: JT-Coder achieves or surpasses the performance of existing top open-source models at both 1.5B and 8B scales across multiple code generation and comprehension benchmarks, including EvalPlus, BigCodeBench, LiveCodeBench, and FullstackBench.

  • ๐Ÿง  Extreme Data Efficiency: We completed pre-training using only 1.6T high-quality tokens. Compared to similar models that typically use 5-6T of data, our data efficiency is improved by 4x, demonstrating the immense value of our advanced data processing pipeline.

  • ๐Ÿ’ก Innovative Data-Centric Framework:

    • Pre-training Phase: We meticulously cleaned open-source code data, filtering out low-quality and sensitive information. Simultaneously, we recovered and enriched high-value data such as Jupyter Notebooks, and synthesized large-scale, context-rich Q&A data and programming guides.

    • Instruction Tuning Phase: We pioneered the "Instruction Evolution" technique. This technique reverse-engineers the model's various effective outputs for simple instructions, transforming implicit characteristics within the code (e.g., algorithm selection, error handling) into explicit, complex instruction constraints, thereby significantly enriching the diversity and complexity of instruction data.

Model List

We have released the following pre-trained base models and instruction-tuned models:

Model Name Type Size
JT-Coder-8B-Instruct Instruct 8B
JT-Coder-8B-Base(You are here!) Base 8B
JT-Coder-1.5B-Instruct Instruct 1.5B
JT-Coder-1.5B-Base Base 1.5B

License

The source code for this project is licensed under the Apache 2.0 license. The distribution and use of model weights adhere to their respective licensing agreements.

Acknowledgement

Our work is built upon the shoulders of giants in the open-source community, and we wish to express our profound gratitude.

We extend our sincere thanks to the Qwen Team. Adopting the Qwen2.5 tokenizer provided a robust vocabulary foundation that was crucial for our model's powerful multilingual and coding abilities.

Furthermore, we are deeply indebted to the creators and maintainers of pivotal open-source code datasets, including The Stack v2, Code-Matrix, etc. and the instruction data from projects like Opencoder. Their monumental efforts in collecting, curating, and sharing these vast resources provided the essential raw material for our data-centric framework. This project would not have been possible without their foundational contributions.

We hold the spirit of open collaboration in the highest regard and are proud to contribute back to the community that has enabled our research.

Disclaimer

JT-Coder is a large language model. While it has undergone rigorous data filtering and training, it may still generate inaccurate, biased, or harmful content. Users are advised to carefully evaluate the model's output and are responsible for any consequences arising from its use.

Downloads last month
10
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support