Tech at

Topics


Big, Brainy, and Bold: The Rise of LLMs

Language Models

ChatGPT

Written by Manu, Abhishek and Jayant

15 May 23

Once upon a time, in the not-so-distant past, language models (LMs) were just learning to crawl. Today, they're sprinting at breakneck speed, leaving the human beings behind. Enter the world of Large Language Models (LLMs), massive AI systems that can understand the context and meaning behind words, generate human-like text, answer complex questions, and even write entire articles (unlike this one, which is almost fully written by a human).

The “BIG”

They're like the cool colleagues who can effortlessly write complex code, crack jokes (some of which can even be funny!) , and even help with your homework and essays (wink wink)!. But what sets LLMs apart from their smaller counterparts? Size matters, my friends. The “Large” in LLMs is not just hyperbole. GPT-3, the model supporting OpenAI’s chatGPT, has 175 billion parameters. Huawei Researchers have recently developed LLMs with over a trillion Parameters. In fact, GPT-4 is so large, they won’t even tell us!

These parameters basically represent the building blocks of their knowledge. Knowledge that allows them to perform tasks like sentiment analysis, language translation, and even personal assistant-like chatbots. But the hype around LLMs is not just due to what they can do, but also what they might be able to do. LLMs can be used in fields like healthcare, finance, education, litigation etc, where they can help with tasks like medical diagnosis, financial analysis, language learning, and even simplifying complex legal documents into layman-speak. They can also be used to create chatbots,virtual assistants, highly personalized educators and even video game characters.

The “BRAINY”

Let's dive into the science behind these behemoths (or rather just dip toes; as what follows is a hyper simplification of the inner workings of the most complex machines yet created by humans). Imagine LLMs as giant sponges, soaking up vast amounts of text from the internet, and breaking down the said text into smaller units called tokens; much like how sea sponges filter large amounts of water for plankton and break it down to simple sugar. These tokens are then fed into a neural network, which learns to predict the probability of the next word in a sequence based on the previous words.

There is an ongoing discourse which argues that LLMs are not that impressive, since all they are doing is predicting the next phrases based on their previous phrases. That being said, introspect about the fact that aren’t we humans doing the exact same thing?

The “BOLD”

AI philosophy aside, the evolution of LLMs has been nothing short of breathtaking. In just a few years, it has evolved from basic language models that could barely string a sentence together to sophisticated LLMs like GPT-4, which can write entire articles (like this one!). Following the rapidfire breakthroughs in Language models has been akin to watching a baby go from babbling to reciting Shakespeare in a day.

So, what's next for these linguistic titans? Honestly, it’s hard to predict. The capabilities of GPT-4, for example, shocked even the researchers to the point that they decided to fine tune existing technologies before moving on to developing GPT-5.

Once Spiderman Said…

“With great power comes great responsibility” and therein lies the problem with LLMs. The potential for misuse of LLMs can be scary to think about. They can be used to spread misinformation, generate fake news, or even create deepfake content. To address these concerns, researchers and developers are working on ways to make LLMs more transparent, accountable, and ethical. It's like teaching these AI prodigies not just to be smart, but also to be good citizens.

In a nutshell, the rise of Large Language Models has been a thrilling roller coaster ride, filled with jaw-dropping advancements and mind-boggling potential. As we continue to explore the possibilities and address the challenges, one thing is certain: LLMs are here to stay, and they're ready to save the world, one word at a time!

  • OpenAI Presents GPT-3, a 175 Billion Parameters Language Model." Accessed May 4, 2023.
  • Huawei has created the world's largest Chinese language model." Accessed May 4, 2023.
  • OpenAI's GPT-4 Is Closed Source and Shrouded in Secrecy - VICE." Accessed May 4, 2023.
  • NEXT-LEVEL NARRATIVES, GAMES AND EXPERIENCES." 13 Apr. 2023.

Worried About Insider Threats?

Cyber Security

Data security

Written by Jayant, Bhavya

15 May 23

In the new changing dynamics of the world economy, data and information have become priceless possessions. According to one of the articles by The Economist[1], the world’s most valuable resource is no longer oil, but data. With data becoming a valuable resource, securing it and ensuring that it is not misused, has become a matter of grave concern. Hence, it is imperative to take a step ahead of our adversaries and look for security problems associated with storing and handling data.

Cyber Security is the convergence of people, processes, and technology, to protect organizations, individuals, or networks, from digital attacks. It is comparatively easier to prevent cyber attacks, like phishing and malware, but stopping an insider attack is an incredibly daunting task. Insider attacks originate within the organization, and the attackers are generally closely associated with the workplace, directly, indirectly, physically, or logically. Interestingly, insider attacks are the most underestimated attacks in cybersecurity, but preventing them is an extremely challenging task. Training a model that can help prevent insider attacks is extremely difficult, due to the imbalanced nature of the dataset. Moreover, insider attacks are rare anomalies, so we do not have enough data that can be used to train a model.

Application of Machine Learning, in cybersecurity and data security, has always been a challenge, and scarcity of available annotated data resources aggravates this challenge further. Moreover, the availability of a balanced dataset makes machine learning all the more difficult. In the past, techniques, such as random oversampling, undersampling, SMOTE, and more, were used to make the dataset balanced. Synthetic data was created to handle skewed data, too. However, none of those techniques were effective.

We, at EZ, work relentlessly to improve and devise new techniques, such that our clients rest assured about the security of the valuable information they entrust us with. Recently, while reading a paper on Cybersecurity and Deep Learning[2], we found a new way to detect and prevent insider attacks. The proposed solution is split into three parts, namely, behavior extraction, conditional GAN-based data augmentation, and anomaly detection.

In behavior extraction, feature extraction is done from the dataset. Context-based behavior profiling is used, in which each user is identified as an insider, based on the entire activity log, where all the features contribute to the user behavior. Then, a Conditional Generative Adversarial Network (CGAN) is used to generate data and reduce the negative effect of skewed data. GAN models consist of two parts, namely, generator and discriminator. In the network, the discriminator (D) tries to distinguish whether the data is from the real distribution, and the generator (G) generates synthetic data and tries to fool the discriminator. The research paper uses a fully connected neural network in the generator and discriminator.

The final part of the proposed solution is to use multiclass classification, instead of binary classification. Anomaly detection based on multiclass classification considers labeled samples of training data as multiple normal and non-malicious classes. The multinomial classifier tries to discriminate the anomalous samples from the rest of the classes, which helps in building a more robust classifier. One additional feature of using multiclass classification is that in case a new insider activity emerges, there would be no need to make any changes to the existing framework. We should use t-distributed Stochastic Neighbor Embedding (t-SNE), a manifold-learning-based visualization method, to perform a qualitative analysis of the generated data. XGBoost, MLP, and 1-d CNN models were used in the research paper, XGBoost performed better for all sorts of datasets.

Intrigued to know more about Cyber Security and the unconventional ways to prevent insider attacks? Read the Reference articles provided below -

  • Mayra Macas, & Chunming Wu. (2020). Review: Deep Learning Methods for Cybersecurity and Intrusion Detection Systems.
  • Gautam Raj Mode, & Khaza Anuarul Hoque. (2020). Crafting Adversarial Examples for Deep Learning-Based Prognostics (Extended Version).
  • Ihai Rosenberg, Asaf Shabtai, Yuval Elovici, & Lior Rokach. (2021). Adversarial Machine Learning Attacks and Defense Methods in the Cyber Security Domain.
  • Li, D., & Li, Q. (2020). Adversarial Deep Ensemble: Evasion Attacks and Defenses for Malware Detection. IEEE Transactions on Information Forensics and Security, 15, 3886–3900.
  • Simran K, Prathiksha Balakrishna, Vinayakumar Ravi, & Soman KP. (2020). Deep Learning-based Frameworks for Handling Imbalance in DGA, Email, and URL Data Analysis.

Is Facial Recognition Biased?

Artificial Intelligence

Big Data

Facial Recognition

Written by Jayant, Bhavya

23 Nov 21

Mankind has witnessed three industrial revolutions, starting with the development of the Steam Engine, followed by electricity and digital computing. We are on the verge of a 4th industrial revolution that will be primarily driven by Artificial Intelligence and Big Data. Artificial Intelligence heavily relies on the data for the development of algorithms that can reason about the decision-making done by the intelligent systems or computer systems only.

Face Recognition: Modern Day Biometric Security Solution

The advent of these advanced technologies has provided us with various techniques for security solutions that will prevent unauthorized access to precious data, providing a sense of security to our clients. However, selecting the appropriate biometric security solution has become a major decision-making process for businesses and enterprises, across wide industries. A new biometric security system that has arrived under the umbrella of Artificial Intelligence is a Face recognition system.

With the ease of implementation and widespread adoption - face recognition is rapidly becoming the go-to choice for the modern implementation of Biometric Solutions. Facial recognition is a modern-day biometric solution developed for the purpose of recognizing a human face without any physical contact required. Facial recognition algorithms are designed to match the facial features of a person to the images or the facial data available in the database saved.

Facial Recognition The Next Big Thing?

Research and studies on Facial Recognition have been conducted for many years now, but there has been an unprecedented growth when we talk about the actual implementation of Facial Recognition. Technology has become so efficient that now we can unlock our phones using facial recognition. Countries have also started using facial recognition for surveillance purposes to track down criminals and use it to prevent crime. Tracking down criminals has become too easy with the help of facial recognition. All we need to do is set up a camera in public spaces and check if any criminal/suspicious person shows up.

Recent Studies Suggest Otherwise

Recent studies and research have suggested that the leading facial recognition software packages are biased. Yes! You read it right. Leading facial recognition packages tend to be more accurate for white, male faces than for people of color or for women.

In a 2019 study, it was found out that many commercial algorithms currently being used for surveillance show a high False Positive Rate for the minority community. There have been some cases around the world where someone innocent got arrested due to false positives shown by these surveillance devices. One such incident happened in January 2020, in Detroit, when police used facial recognition technology on surveillance footage of theft to falsely arrest a Black man.

Let us try and identify what lies at the core of this based nature of face recognition software programs. Facial recognition application is broadly divided into two parts; Verification and Identification.

  • Verification confirms that the faceprint matches with the stored faceprint.
  • This is usually used at airports and to unlock your smartphone.
  • The verification part of Facial recognition is not biased, in fact extremely accurate; here, artificial intelligence is as skillful as the sharpest-eyed humans.
  • The real issue is the Identification part of Facial Recognition, which is used for surveillance.

Disparate False Positive Rates

The false Positive Rate of 60 per 10,000 samples for minority groups might not seem much, but when you compare it with the Positive rate of <5 per 10,000 samples for white people, you can clearly see the difference. We need to make sure that the false-positive rate in the identification model should be minimal since this is usually used on crowd surveillance. If you are using facial recognition for crowd surveillance, and you are monitoring around 5000 people in a day, you could easily end up with hundreds of people being falsely accused[1].

Once the issue was identified, AI researchers started working on finding a solution to the biases available to these facial recognition models. In June 2020, IBM announced it would no longer offer a facial recognition service, while other service providers have acknowledged the issue and started working on finding a solution [2]. There has also been a public backlash against crowd surveillance.

The reason why there is such a high false-positive rate in facial recognition for a minority group is that the data on which these models were built had an uneven distribution of racial faces.

To avoid such errors, new databases and techniques have been used:

  • Techniques of augmentation of feature space of underrepresented classes were to make the dataset more balanced.
  • Recently, Generative Adversarial Networks (GAN) were also trained to generate face features to augment classes with fewer samples.
  • People have also started shifting to more balanced datasets like Racial Faces in the Wild (RFW) and Balanced Faces In the Wild (BFW) to reduce the bias[3].

There has been a great improvement in accuracy for facial recognition in the past few years. Researchers have better models and constructed better datasets to provide highly accurate and low bias models. Big service providers have acknowledged the problem, constantly researching to create accurate surveillance models. The future of facial Recognition seems bright now as the awareness among other service providers and clients has increased about the drawbacks of such technology.


Known Security Issues in Python Dependency Management System and How to Tackle them.

Python Programming

Python Package

Written by Jayant, Anjali

23 Nov 21

We, at EZ, believe that the purpose of technology is to assist us, and not replace us. Therefore, before becoming dependent on any programming language, we understand its flaws and make conscious efforts to overcome them. As a programming language, Python provides us with innumerable Python libraries and frameworks, a mature and supportive Python Community, versatility, efficiency, reliability, speed, and more. We work with Python so extensively that its security flaws often get ignored. Read the blog below to know about the security loopholes found in the PyPI ecosystem, and how we can overcome them.

What is PIP?

  • PIP, or Python Package Installer for Python, is a default python package manager that provides a platform for developers to share and reuse the codes written by third-party developers.
  • PIP supports downloading packages from PyPI.org, a repository for the Python programming language. PyPI helps in finding and installing packages or software for python programming languages.
  • By design, the PyPI ecosystem allows any arbitrary user to share and reuse python software packages, which along with their dependencies, are downloaded recursively with the help of PIP.

Security risk while the installation of Python Packages

Bagmar et al. had provided a detailed study on the security threats in the python ecosystem, which is largely based on the PyPI repository database.

  • Every time, while PIP installs invocation, two python files are executed, namely, setup.py and __init__.py.
  • Along with these executions, some arbitrary Python codes, which may contain exploits, also get executed at varying points.
  • Exploits come in two modes, which are given below:
    1. Directly from the source, using editable mode installation, and importing the malicious package.
    2. Installation using sudo(administrator) privileges.

Factors that help us determine the impact of exploiting python packages

There are four main factors that can help us understand the impact of exploiting Python packages, which are given below:

  • Package Reach: It is defined as the number of other packages that explicitly require it transitively or directly. Packages with high package reach are liable to higher attack vectors, making them malicious.
  • Maintainer Reach: It is the combined reach of all the Maintainer packages. Influential Maintainers are the potential targets for security attacks.
  • Implicitly Trusted Packages: It is the number of distinct nodes traversed while searching for the longest path from a given starting node. An increase in implicitly trusted packages increases security risk attacks.
  • Implicitly Trusted Maintainers: This metric gives the vulnerability score based on other package maintainer's accounts.

Most common Python Package Impersonation Attacks

Package impersonation attacks are user-centric attacks, which aim at tricking users to download a malicious package.

There are various ways of fooling the users, and making them download malicious packages, some of which are given below:

  • TypoSquating: Intentionally making minor spelling mistakes.
  • Altering Word Order: Changing the order in which packages name themselves.
  • Python3 vs Python2: Adding number “3” in the package, imitating the original package, with support to python3.
  • Removing Hyphenation: Removing hyphen in the genuine packages.
  • Built-In Packages: There are multiple instances of packages being uploaded to PyPI.
  • Jellyfish Attack: In this attack, a TypoSquat package is imported somewhere.

License Violation in PyPI ecosystem

PyPI does not perform any automated checks for OSS license violations. Any violation can be considered when a package imports another package having a less permissible license.

Suggested Preventive Measures

  • There should be strict enforcement and compulsion to specify dependencies in the metadata of uploaded packages.
  • A permission model, similar to mobile phones, can be implemented while installing packages.
  • Having a trusted or maintainer package badge on a popular package might be helpful.
  • Showing statistics while installing packages.
  • License fields must not be free text.

Topics

Popular Reads

WhatsApp Us