FutureComputing4AI/KiloGrams

KiloGram algorithm for finding the top-k most frequent n-grams for large values of n quickly with fixed memory.

/ 100

Emerging

This tool helps cybersecurity researchers and data scientists efficiently extract the most frequent N-grams from large collections of files, like malware samples or software executables. You input paths to folders containing your files (e.g., 'goodware' and 'malware'), and it outputs a dataset in a format like libsvm, ready for machine learning analysis. It's designed for those building machine learning models for file classification, particularly in cybersecurity.

No commits in the last 6 months.

Use this if you need to generate features for machine learning models by identifying the most common very large N-grams from extensive file datasets, especially for malware analysis or binary classification tasks.

Not ideal if you require a user-friendly application with a graphical interface or if you need ongoing support and warranty for production systems.

malware-analysis cybersecurity-research feature-engineering binary-classification data-preparation

Stale 6m No Package No Dependents

Maintenance 2 / 25

Adoption 5 / 25

Maturity 16 / 25

Community 15 / 25

How are scores calculated?

Stars

Forks

Language

Java

License

Apache-2.0

Higher-rated alternatives

facebookresearch/stopes

A library for preparing data for machine translation research (monolingual preprocessing,...

Droidtown/ArticutAPI

API of Articut 中文斷詞 (兼具語意詞性標記)：「斷詞」又稱「分詞」，是中文資訊處理的基礎。Articut 不用機器學習，不需資料模型，只用現代白話中文語法規則，即能達到...

rkcosmos/deepcut

A Thai word tokenization library using Deep Neural Network

fukuball/jieba-php

"結巴"中文分詞：做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation:...

pytorch/text

Models, data loaders and abstractions for language processing, powered by PyTorch

Explore NLP Tools

All categories Trending NLP directory Insights