Add free search for your website. Sign up now! https://webveta.alightservices.com/
Categories
A.I Artificial Intelligence NLP

An introduction to vector embeddings

               Have you ever wondered how machines truly understand the meaning of words? While they lack the nuanced comprehension we humans possess, they’ve learned to represent words as numerical vectors, opening a world of possibilities for tasks like search, translation, and even creative writing.

Cross Post – https://www.linkedin.com/pulse/introduction-vector-embeddings-kanti-kalyan-arumilli-wqg8c

Imagine each word in the English language as a point in a multi-dimensional space. Words with similar meanings cluster together, while those with contrasting meanings reside farther apart. This is the essence of vector embeddings: representing words as numerical vectors, capturing semantic relationships between them.

These vectors aren’t arbitrary; they’re learned through sophisticated machine learning algorithms trained on massive text datasets.

Vector embeddings revolutionize how machines process language because:

Semantic Similarity:  Words with similar meanings have vectors that are close together in the “semantic space.” This allows machines to identify synonyms, antonyms, and even subtle relationships between words.

Contextual Understanding: Capturing the nuanced meaning of a word based on its surrounding words.

Improved Performance: Embedding vectors as input to machine learning models often leads to significant performance gains in tasks like text classification, sentiment analysis, machine translation and neural search.

There are various types such as Dense, Sparse and Late Interaction. In each type there are several models trained on various datasets, fine-tuned on different datasets. The computational expenses and requirements are significantly different. Some models need high cpu, memory yet underperform and some models need lesser cpu and memory and yet perform well. However, based on the dataset and number of tokens used for generation, models trained on same datasets and higher number of tokens usually outperform models trained on same datasets and lower number of tokens.

Here is a very interesting link – https://huggingface.co/spaces/mteb/leaderboard

The above page lists several models, memory requirements, scores for various tasks, size of embeddings generated etc… Most of these models are free under MIT license and some are commercial.

In the past I have written a blog post about https://www.alightservices.com/2024/04/27/how-to-get-text-embeddings-from-meta-llama-using-c-net/ converting llama 2 / 3 into gguf and how to interact using C#.

Most of the free models mentioned in the above leaderboard have gguf and can be directly used from C# or via free HTTP local server for getting embeddings such as ollama, llama.cpp. But some models don’t have gguf, probably some can be converted or some might not. Some models have onnx format available. Some might need python code for generating embeddings. I have tried IronPython. But not suggesting IronPython or any 3rd party wrappers because of less reliability. Here is a blog post mentioning about Python integration from .Net https://www.alightservices.com/2024/04/09/c-net-python-and-nlp-natural-language-processing/

Proud partner of Microsoft for Startups
Proud partner of Microsoft for Startups

Mr. Kanti Kalyan Arumilli

Arumilli Kanti Kalyan, Founder & CEO
Arumilli Kanti Kalyan, Founder & CEO

B.Tech, M.B.A

Facebook

LinkedIn

Threads

Instagram

Youtube

Founder & CEO, Lead Full-Stack .Net developer

ALight Technology And Services Limited

ALight Technologies USA Inc

Youtube

Facebook

LinkedIn

Phone / SMS / WhatsApp on the following 3 numbers:

+91-789-362-6688, +1-480-347-6849, +44-07718-273-964

+44-33-3303-1284 (Preferred number if calling from U.K, No WhatsApp)

kantikalyan@gmail.com, kantikalyan@outlook.com, admin@alightservices.com, kantikalyan.arumilli@alightservices.com, KArumilli2020@student.hult.edu, KantiKArumilli@outlook.com and 3 more rarely used email addresses – hardly once or twice a year.

Categories
Security

Some SSH tips

This blog post is based on what I have learned over the past 1 year after my servers were hacked. This blog post has some tips based on my notes I have noted while reading various other web pages and blogs.

This is not an exhaustive list. I am NOT Linux admin but thought these tips could help some people.

In the /etc/ssh/sshd_config file:

sudo nano /etc/ssh/sshd_config

The following are some important settings:

Include /etc/ssh/sshd_config.d/*.conf

PermitRootLogin no
PasswordAuthentication no
PermitEmptyPasswords no
AllowUsers <user>
Protocol 2
DenyUsers root
MaxSessions 1

Don’t allow password login, don’t allow root, don’t allow empty password, specify users who are allowed, use version 2 of SSH protocol, unless there is a need for multiple people on same server at same time, specify max sessions.

The following can be a seperate file under /etc/ssh/sshd_config.d/ i.e the files get included.

sudo nano /etc/ssh/sshd_config.d/ssh-audit_hardening.conf

The following configuration disables weak ciphers:


Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com

MACs hmac-sha2-512-etm@openssh.com

HostKeyAlgorithms sk-ssh-ed25519-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,sk-ssh-ed25519@openssh.com,ssh-ed25519,rsa-sha2-512 

CASignatureAlgorithms sk-ssh-ed25519@openssh.com,ssh-ed25519,rsa-sha2-512,


HostbasedAcceptedAlgorithms sk-ssh-ed25519-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,sk-ssh-ed25519@openssh.com,ssh-ed25519,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-512

PubkeyAcceptedAlgorithms sk-ssh-ed25519-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,sk-ssh-ed25519@openssh.com,ssh-ed25519,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-512,rsa-sha2-256-cert-v01@openssh.com

Any algorithms below 256 bit are not used, 256 bit and higher are used in the above configuration.

Mr. Kanti Kalyan Arumilli

Arumilli Kanti Kalyan, Founder & CEO
Arumilli Kanti Kalyan, Founder & CEO

B.Tech, M.B.A

Facebook

LinkedIn

Threads

Instagram

Youtube

Founder & CEO, Lead Full-Stack .Net developer

ALight Technology And Services Limited

ALight Technologies USA Inc

Youtube

Facebook

LinkedIn

Phone / SMS / WhatsApp on the following 3 numbers:

+91-789-362-6688, +1-480-347-6849, +44-07718-273-964

+44-33-3303-1284 (Preferred number if calling from U.K, No WhatsApp)

kantikalyan@gmail.com, kantikalyan@outlook.com, admin@alightservices.com, kantikalyan.arumilli@alightservices.com, KArumilli2020@student.hult.edu, KantiKArumilli@outlook.com and 3 more rarely used email addresses – hardly once or twice a year.