SMILES Pair Encoding (JCIM) first learns a vocabulary of high frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for ...
CodeSim is a research toolkit that implements and benchmarks 23 different unsupervised similarity measures for detecting code clones in Java source code. This work addresses the critical challenge of ...