FutureComputing4AI/KiloGrams
KiloGram algorithm for finding the top-k most frequent n-grams for large values of n quickly with fixed memory.
This tool helps cybersecurity researchers and data scientists efficiently extract the most frequent N-grams from large collections of files, like malware samples or software executables. You input paths to folders containing your files (e.g., 'goodware' and 'malware'), and it outputs a dataset in a format like libsvm, ready for machine learning analysis. It's designed for those building machine learning models for file classification, particularly in cybersecurity.
No commits in the last 6 months.
Use this if you need to generate features for machine learning models by identifying the most common very large N-grams from extensive file datasets, especially for malware analysis or binary classification tasks.
Not ideal if you require a user-friendly application with a graphical interface or if you need ongoing support and warranty for production systems.
Stars
9
Forks
4
Language
Java
License
Apache-2.0
Category
Last pushed
Oct 08, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/FutureComputing4AI/KiloGrams"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
facebookresearch/stopes
A library for preparing data for machine translation research (monolingual preprocessing,...
Droidtown/ArticutAPI
API of Articut 中文斷詞 (兼具語意詞性標記):「斷詞」又稱「分詞」,是中文資訊處理的基礎。Articut 不用機器學習,不需資料模型,只用現代白話中文語法規則,即能達到...
rkcosmos/deepcut
A Thai word tokenization library using Deep Neural Network
fukuball/jieba-php
"結巴"中文分詞:做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation:...
pytorch/text
Models, data loaders and abstractions for language processing, powered by PyTorch