codefuse-ai
/

CodeFuse-CGE-Large

Model card Files Files and versions

CodeFuse-CGE-Large / README.md

codefuse-admin's picture

Upload README.md

58c1b6c verified over 1 year ago

|

history blame contribute delete

3.39 kB

	---
	frameworks:
	- Pytorch
	license: other
	tasks:
	- text-embedding
	---


	## CodeFuse-CGE-Large
	<p align="center">
	<img src="https://modelscope.cn/api/v1/models/codefuse-ai/CodeFuse-QWen-14B/repo?Revision=master&FilePath=LOGO.jpg&View=true" width="800"/>
	<p>

	Homepage: 🏡 https://github.com/codefuse-ai/CodeFuse-CGE (Please give us your support with a Star🌟 + Fork🚀 + Watch👀)

	## Model Description
	CodeFuse-CGE-Large is the Large version of the CodeFuse-CGE family which is fine-tuned based on CodeQwen1.5-7B. CodeFuse-CGE-Large is distinguish on text2code task for it's powerful ability of capturing the semantic relationship between code and text.

	This model has the following notable features:
	● Instruction-tuning is enabled for both query and code snippet sides.
	● The model obtains sentence-level and code-level representations through a layer of cross-attention computation module.
	● The model has a smaller dimensional size without significant degradation in performance.

	Model Configuration
	Model Size: 7B
	Embedding Dimension: 1024
	Hidden Layers: 32
	Max Input Tokens: 1024


	Requirements
	```
	flash_attn==2.4.2
	torch==2.1.0
	accelerate==0.28.0
	transformers==4.39.2
	vllm=0.5.3
	```

	## How to Use
	### transformers
	```
	from transformers import AutoTokenizer, AutoModel
	import torch

	model_name_or_path = "codefuse-ai/CodeFuse-CGE-Large"
	model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, truncation_side='right', padding_side='right')

	if torch.cuda.is_available():
	device = 'cuda'
	else:
	device = 'cpu'
	model.to(device)

	prefix_dict = {'python':{'query':'Retrieve the Python code that solves the following query:', 'passage':'Python code:'},
	'java':{'query':'Retrieve the Java code that solves the following query:', 'passage':'Java code:'},
	'go':{'query':'Retrieve the Go code that solves the following query:', 'passage':'Go code:'},
	'c++':{'query':'Retrieve the C++ code that solves the following query:', 'passage':'C++ code:'},
	'javascript':{'query':'Retrieve the Javascript code that solves the following query:', 'passage':'Javascript code:'},
	'php':{'query':'Retrieve the PHP code that solves the following query:', 'passage':'PHP code:'},
	'ruby':{'query':'Retrieve the Ruby code that solves the following query:', 'passage':'Ruby code:'},
	'default':{'query':'Retrieve the code that solves the following query:', 'passage':'Code:'}
	}

	text = ["Writes a Boolean to the stream.",
	"def writeBoolean(self, n): t = TYPE_BOOL_TRUE if n is False: t = TYPE_BOOL_FALSE self.stream.write(t)"]
	text[0] += prefix_dict['python']['query']
	text[1] += prefix_dict['python']['passage']
	embed = model.encode(tokenizer, text)
	score = embed[0] @ embed[1].T
	print("score", score)

	```

	## Benchmark the Performance
	We use MRR metric to evaluate the ability on text2code retrieval tasks: AdvTest, CosQA, CSN


	![result](./result.png)

	## Acknowledgement
	Thanks to the authors of open-sourced datasets, including CSN, Adv, CoSQA.

	## License
	Since CodeFuse-CGE-Large is fine-tuned based on CodeQwen1.5-7B model, our usage license follows the same terms as that of CodeQwen1.5-7B model.