reddy-lab-code-research/XLCoST
Code and data for XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence
This is a comprehensive dataset for training machine learning models that work with code across different programming languages. It provides aligned code snippets and full programs in 7 languages (C++, Java, Python, C#, Javascript, PHP, C) along with corresponding English comments and problem descriptions. Software engineers, researchers, and developers working on intelligent code tools would use this dataset to build models for tasks like code translation, summarization, and search.
No commits in the last 6 months.
Use this if you are building or evaluating AI models for code translation, summarization, or searching across multiple programming languages.
Not ideal if you need a dataset focused on a single programming language or if your task doesn't involve natural language descriptions.
Stars
91
Forks
6
Language
C
License
Apache-2.0
Category
Last pushed
Jan 21, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/ml-frameworks/reddy-lab-code-research/XLCoST"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
facebookresearch/fairseq2
FAIR Sequence Modeling Toolkit 2
lhotse-speech/lhotse
Tools for handling multimodal data in machine learning projects.
google/sequence-layers
A neural network layer API and library for sequence modeling, designed for easy creation of...
awslabs/sockeye
Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
OpenNMT/OpenNMT-tf
Neural machine translation and sequence learning using TensorFlow