Tech at

Topics


AI Breakthrough: Decoding Foundation Models

Artificial Intelligence

Written by Jayant Jha

15 Feb 24

AI is undergoing a paradigm shift in which models are being pre-trained on a broad set of extensive training data using self-supervised learning with deep neural networks with further fine-tuning on a wide range of downstream tasks. Usually, these pre-trained models (BERT, BART, DALL-E, or GPT's) are called foundation models. These foundation models provide a strong basis for solving downstream tasks such as text classification, text entailment, image classification, etc.

Due to the emergence of foundation models, significant interest has developed in Artificial Intelligence to utilize these models in revolutionizing various fields, from natural language processing to computer vision.

Due to the emergence of foundation models, significant interest has developed in Artificial Intelligence to utilize these models in revolutionizing various fields, from natural language processing to computer vision utilizing new emergent capabilities. However, alongside their promises and incentives, foundation models also bring forth a host of challenges and risks that necessitate careful consideration. Emergence and homogenization are two key concepts that play a significant role in the development and impact of foundation models and AI systems in general. Emergence is a source of both excitement and concern in AI research. It can lead to unexpected and novel capabilities in AI systems that were not explicitly programmed or anticipated during the design phase. For instance, the ability of GPT-3 to perform a wide range of tasks through natural language prompts is an emergent property that was not explicitly trained for. **Homogenization** refers to the process of consolidating methodologies or approaches for building AI systems across different applications or domains. It involves standardizing practices, architectures, or techniques to create a unified foundation that can be leveraged for various tasks. However, having Homogenization poses a risk of a single point of failure for AI systems.

This blog will outline some of the incentives, Opportunities, and Risks of utilizing Foundation models.

Key Incentives of Using Foundational Models

Task and Domain Adaptation: Foundation models are like building blocks that can be easily adjusted to solve different problems, making AI more versatile and useful in many situations. Unlike "one-size-fits-all" solutions, these models can be adapted to fit the specific needs of different industries and tasks.

Transfer Learning: Building AI solutions just got easier and faster! Thanks to foundation models, companies can jump-start their AI projects without needing massive resources or time investments. Forget starting from scratch. Foundation models act as pre-trained building blocks, allowing businesses to quickly customize AI solutions for their specific needs, saving both time and money.

Cost and Time Savings: Building AI solutions just got easier and faster! Thanks to foundation models, companies can jump-start their AI projects without needing massive resources or time investments. Forget starting from scratch! Foundation models are like pre-built Lego towers for AI, saving valuable time for developers. Say goodbye to months of AI development. Foundation models come pre-trained, like learning to ride a bike with training wheels, accelerating your journey.

Data Annotation: Traditional AI is a data hog, needing mountains of labeled examples to learn. Foundation models, however, are like efficient students, requiring far fewer examples thanks to their pre-trained knowledge. This is a game-changer when data is scarce or expensive. Forget drowning in data! Foundation models learn like champions, needing less labeled data to excel, making them perfect when data collection is a bottleneck.

Versatility: Instead of building separate AI solutions for each data type, foundation models offer a multimodal powerhouse. They can handle text analysis, image recognition, and even combined tasks, streamlining your AI efforts.

Key Opportunities of a Foundation Models

Enhanced Adaptation Performance: Foundation models signify a paradigm shift where massive amounts of data are utilized to enhance adaptation performance significantly. The overarching principle of "the more data, the better" underscores the potential for improved model adaptation and performance.

Multimodal Integration: Foundation models enable data integration across new modalities, such as robotics and healthcare, expanding the scope of applications and capabilities in diverse domains.

Language Understanding and Generation: Foundation models exhibit exceptional proficiency in understanding and generating human language. This capability empowers applications in translation, summarization, conversational interfaces, and various language-related tasks.

Comprehensive Vision Capabilities: In the realm of computer vision, foundation models have shown promise in leveraging RGB-3D data to comprehend indoor environments, paving the way for advancements in visual understanding and scene analysis.

Improved Content Creation: Foundation models have the potential to generate content that looks like it has been created by humans, enabling the creation of high-quality text and images across a wide range of languages. This capability can be harnessed for various creative and communicative purposes.

Empowering AI Applications: The capabilities of foundation models extend to diverse fields such as law, healthcare, and education, offering opportunities to enhance existing applications and develop innovative solutions in these domains.

Scalability and Efficiency: Foundation models, with their large-scale architecture and training procedures, provide scalability and efficiency in handling complex tasks and datasets, making them suitable for a wide range of applications.

Risks in using Foundational Models

Bias and Fairness Concerns: Foundation models can adopt biases embedded in the training data, resulting in biased outcomes and perpetuating societal inequalities. Addressing bias and ensuring fairness in model predictions is crucial to prevent discriminatory practices.

Lack of Transparency: Foundation models are often complex and difficult to interpret, making it challenging to make specific decisions or predictions. This lack of transparency,can impede trust in the model's outputs and raise concerns about accountability.

Data Privacy and Security: Foundation models require vast amounts of data for training, raising concerns about data privacy and security. Unauthorized access to sensitive data used in training foundation models can lead to privacy breaches and data misuse.

Environmental Impact: The training and deployment of foundation models consume significant computational resources, contributing to environmental concerns related to energy consumption and carbon emissions. Addressing the environmental impact of foundation models is essential for sustainable AI development.

Robustness and Generalization: Foundation models, due to their complexity and scale, may exhibit vulnerabilities and lack robustness in certain scenarios. Ensuring the robustness and generalization of foundation models across diverse use cases is crucial to prevent unexpected failures.

Ethical Considerations: The widespread deployment of foundation models raises ethical dilemmas related to the responsible use of AI technologies. Ethical considerations such as transparency, accountability, and fairness must be prioritized to mitigate potential harm and ensure ethical AI practices.

Misuse and Misalignment: Foundation models can be susceptible to misuse or optimization for misaligned goals, leading to unintended consequences or ethical dilemmas. Safeguarding against the misuse of foundation models and aligning their objectives with societal values is essential for ethical AI development.

As we navigate the Incentives, opportunities, and risks inherent in foundation models, it becomes clear that their development requires a nuanced understanding of training, data, and evaluation methodologies.

  • "[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...." 11 Oct. 2018, https://arxiv.org/abs/1810.04805. Accessed 7 Feb. 2024.
  • "BART: Denoising Sequence-to-Sequence Pre-training for Natural ...." 29 Oct. 2019, https://arxiv.org/abs/1910.13461. Accessed 7 Feb. 2024.
  • "DALL·E 3 - OpenAI." https://openai.com/dall-e-3. Accessed 7 Feb. 2024.
  • "GPT-4 - OpenAI." 13 Mar. 2023, https://openai.com/gpt-4. Accessed 7 Feb. 2024.
  • "On the Opportunities and Risks of Foundation Models - arXiv." 16 Aug. 2021, https://arxiv.org/abs/2108.07258. Accessed 7 Feb. 2024.
  • "Are We Modeling the Task or the Annotator? An Investigation ... - arXiv." 21 Aug. 2019, https://arxiv.org/abs/1908.07898. Accessed 7 Feb. 2024.
  • "On the Opportunities and Risks of Foundation Models - arXiv." https://arxiv.org/abs/2108.07258. Accessed 7 Feb. 2024.
  • "Foundation models: Opportunities, risks and mitigations - IBM." https://www.ibm.com/downloads/cas/E5KE5KRZ. Accessed 7 Feb. 2024.

The Rise of Delete Culture and the Need for Machine Unlearning

Rise of Delete Culture

Written by Abhishek and Jayant

07 Jul 23

Introduction

In recent years, there has been a growing trend of people deleting their social media accounts and online profiles in an effort to regain control of their data and privacy. This "delete culture" movement has seen users leaving platforms like Facebook, Reddit, and others due to data leaks, privacy concerns, and unwanted policy changes. For example, after the Cambridge Analytica scandal revealed that up to 87 million Facebook users had their data improperly shared, many decided to #DeleteFacebook. Additionally, changes to Reddit's API and policies around third-party apps is leading their users to abandon the platform.

As people become more aware of the misuse of their personal data in the virtual world, the desire to delete one's digital footprint is understandable. However, deleting an account is not as simple as pressing a button. Our data persists in many forms, from cached versions on other servers to machine learning models that have analyzed our profiles. There is a need to not just let people control their data, but also provide them with control over how it is used to train algorithms. But the most important thing in the whole discussion is the “trust” that needs to be established by responsible use of AI systems.

A growing field of study in AI that focuses on the moral and ethical implications of the design, development and use of artificial intelligence systems is termed as ethical and responsible AI. More empirical and qualitative study is required when considering the impact of AI and its application on individuals, society and the environment with respect to open ended issues such as biasness, fairness, transparency, accountability and privacy. Machine unlearning and responsible design patterns in AI applications can be one potential solution for maintaining privacy with respect to AI and its applications.

Machine Unlearning

Machine unlearning refers to the process of removing data from AI and machine learning models. The goal is to induce "selective amnesia" so that the models forget specific people or types of information, without compromising the model's overall performance. Apart from enforcing data privacy on a deeper, more meaningful level, there are other benefits of this procedure too:

  • Improving data security by eliminating vulnerabilities from machine learning models
  • Increasing trust in AI systems by providing users with more transparency, and control over their data.
  • Reducing bias in these systems by addressing data imbalances.
  • Supporting privacy regulations like GDPR's data deletion requirements.

But wait, how can a machine ‘unlearn’? Just like, it’s much easier to teach a child something than to make them forget it.

There are actually some intuitive ways to make a model ‘unlearn’ the information it learned from a user’s data. Some of these are:

  • Model retraining: This involves retraining a machine learning model from scratch with the deleted data. This is computationally expensive but ensures the data is fully removed from the model.
  • Data poisoning:This modifies or removes the deleted data in a way that the model's predictions for that data become meaningless, essentially corrupting the knowledge it gained from the data.
  • Differential privacy: Noise is added to data before training a model so that no individual's data can be identified, allowing for data deletion at a later point.

However, these relatively simple strategies may not be effective for the gigantically complex models of today. There are several approaches to help these massive machines forget what they've learned from our data:

  • SISA Training: This method strategically limits the influence of a data point in the training procedure, expediting the unlearning process. SISA training reduces the computational overhead associated with unlearning, even in the worst-case scenarios where unlearning requests are made uniformly across the training set.
  • Data Partitioning and Ordering:By taking into account the distribution of unlearning requests, data can be partitioned and ordered accordingly to further decrease the overhead from unlearning. It's like organizing a messy room, making it easier to find and remove specific items when needed.
  • Transfer Learning:This technique involves using pre-trained models to speed up the retraining process after unlearning.

Responsible Design Patterns

In addition to machine unlearning, responsible design patterns machine learning pipelines play a pivotal role in ensuring ethical and fair use of AI. These patterns involve designing the pipeline in a way that promotes transparency, accountability, and fairness.

For example,

  • Using diverse and representative datasets can help reduce bias in the models.
  • Pushing towards explainable AI allows us to understand how the model makes decisions and identify any potential biases.
  • Regular audits and monitoring of the pipeline can detect and address any unintended consequences or biases that may arise.

By implementing responsible design patterns, we can create machine learning systems that are more reliable, unbiased, and respectful of user privacy and data.

Future Work

It is worth noting that machine unlearning is an active area of research, and there is no one-size-fits-all solution. Different approaches may be more suitable depending on the specific context and requirements. As we look into the future, there's still much work to be done in the field of machine unlearning. Researchers and developers must continue to explore new techniques and refine existing ones to make unlearning more efficient and effective. Additionally, the adoption of Responsible Design Patterns will be crucial in ensuring that AI systems are built with machine unlearning and data privacy in mind from the ground up.

Conclusion

In a world where "delete culture" is on the rise, machine unlearning should be considered an essential skill for AI. The AI revolution is unstoppable, and it’s here to stay. Machine unlearning is the closest thing we have to a delete button on the memory of an intelligent system, allowing them to adapt and forget the data they've learned from us.

But machine unlearning is not a perfect solution to this problem. In an ideal world, there will be no need for a model to “unlearn”, as it wouldn’t learn what it is not supposed to in the first place. The only way to ensure this is to standardize and integrate responsible design patterns in our existing machine learning pipelines.

So, let's praise the undying contribution of machine unlearning – the unsung hero of the delete culture, helping us keep our digital skeletons safely locked away in the closet.

  • "[1912.03817] Machine Unlearning - arXiv." 9 Dec. 2019, https://arxiv.org/abs/1912.03817. Accessed 15 Jun. 2023.
  • "[2209.02299] A Survey of Machine Unlearning - arXiv." 6 Sep. 2022, https://arxiv.org/abs/2209.02299. Accessed 15 Jun. 2023.
  • "Responsible Design Patterns for Machine Learning Pipelines - arXiv." 31 May. 2023, https://arxiv.org/abs/2306.01788. Accessed 15 Jun. 2023.
  • "a Pattern Collection for Designing Responsible AI Systems - arXiv." 2 Mar. 2022, https://arxiv.org/abs/2203.00905. Accessed 15 Jun. 2023.

Big, Brainy, and Bold: The Rise of LLMs

Language Models

ChatGPT

Written by Manu, Abhishek and Jayant

15 May 23

Once upon a time, in the not-so-distant past, language models (LMs) were just learning to crawl. Today, they're sprinting at breakneck speed, leaving the human beings behind. Enter the world of Large Language Models (LLMs), massive AI systems that can understand the context and meaning behind words, generate human-like text, answer complex questions, and even write entire articles (unlike this one, which is almost fully written by a human).

The “BIG”

They're like the cool colleagues who can effortlessly write complex code, crack jokes (some of which can even be funny!) , and even help with your homework and essays (wink wink)!. But what sets LLMs apart from their smaller counterparts? Size matters, my friends. The “Large” in LLMs is not just hyperbole. GPT-3, the model supporting OpenAI’s chatGPT, has 175 billion parameters. Huawei Researchers have recently developed LLMs with over a trillion Parameters. In fact, GPT-4 is so large, they won’t even tell us!

These parameters basically represent the building blocks of their knowledge. Knowledge that allows them to perform tasks like sentiment analysis, language translation, and even personal assistant-like chatbots. But the hype around LLMs is not just due to what they can do, but also what they might be able to do. LLMs can be used in fields like healthcare, finance, education, litigation etc, where they can help with tasks like medical diagnosis, financial analysis, language learning, and even simplifying complex legal documents into layman-speak. They can also be used to create chatbots,virtual assistants, highly personalized educators and even video game characters.

The “BRAINY”

Let's dive into the science behind these behemoths (or rather just dip toes; as what follows is a hyper simplification of the inner workings of the most complex machines yet created by humans). Imagine LLMs as giant sponges, soaking up vast amounts of text from the internet, and breaking down the said text into smaller units called tokens; much like how sea sponges filter large amounts of water for plankton and break it down to simple sugar. These tokens are then fed into a neural network, which learns to predict the probability of the next word in a sequence based on the previous words.

There is an ongoing discourse which argues that LLMs are not that impressive, since all they are doing is predicting the next phrases based on their previous phrases. That being said, introspect about the fact that aren’t we humans doing the exact same thing?

The “BOLD”

AI philosophy aside, the evolution of LLMs has been nothing short of breathtaking. In just a few years, it has evolved from basic language models that could barely string a sentence together to sophisticated LLMs like GPT-4, which can write entire articles (like this one!). Following the rapidfire breakthroughs in Language models has been akin to watching a baby go from babbling to reciting Shakespeare in a day.

So, what's next for these linguistic titans? Honestly, it’s hard to predict. The capabilities of GPT-4, for example, shocked even the researchers to the point that they decided to fine tune existing technologies before moving on to developing GPT-5.

Once Spiderman Said…

“With great power comes great responsibility” and therein lies the problem with LLMs. The potential for misuse of LLMs can be scary to think about. They can be used to spread misinformation, generate fake news, or even create deepfake content. To address these concerns, researchers and developers are working on ways to make LLMs more transparent, accountable, and ethical. It's like teaching these AI prodigies not just to be smart, but also to be good citizens.

In a nutshell, the rise of Large Language Models has been a thrilling roller coaster ride, filled with jaw-dropping advancements and mind-boggling potential. As we continue to explore the possibilities and address the challenges, one thing is certain: LLMs are here to stay, and they're ready to save the world, one word at a time!

  • OpenAI Presents GPT-3, a 175 Billion Parameters Language Model." Accessed May 4, 2023.
  • Huawei has created the world's largest Chinese language model." Accessed May 4, 2023.
  • OpenAI's GPT-4 Is Closed Source and Shrouded in Secrecy - VICE." Accessed May 4, 2023.
  • NEXT-LEVEL NARRATIVES, GAMES AND EXPERIENCES." 13 Apr. 2023.

Worried About Insider Threats?

Cyber Security

Data security

Written by Jayant, Bhavya

15 May 21

In the new changing dynamics of the world economy, data and information have become priceless possessions. According to one of the articles by The Economist[1], the world’s most valuable resource is no longer oil, but data. With data becoming a valuable resource, securing it and ensuring that it is not misused, has become a matter of grave concern. Hence, it is imperative to take a step ahead of our adversaries and look for security problems associated with storing and handling data.

Cyber Security is the convergence of people, processes, and technology, to protect organizations, individuals, or networks, from digital attacks. It is comparatively easier to prevent cyber attacks, like phishing and malware, but stopping an insider attack is an incredibly daunting task. Insider attacks originate within the organization, and the attackers are generally closely associated with the workplace, directly, indirectly, physically, or logically. Interestingly, insider attacks are the most underestimated attacks in cybersecurity, but preventing them is an extremely challenging task. Training a model that can help prevent insider attacks is extremely difficult, due to the imbalanced nature of the dataset. Moreover, insider attacks are rare anomalies, so we do not have enough data that can be used to train a model.

Application of Machine Learning, in cybersecurity and data security, has always been a challenge, and scarcity of available annotated data resources aggravates this challenge further. Moreover, the availability of a balanced dataset makes machine learning all the more difficult. In the past, techniques, such as random oversampling, undersampling, SMOTE, and more, were used to make the dataset balanced. Synthetic data was created to handle skewed data, too. However, none of those techniques were effective.

We, at EZ, work relentlessly to improve and devise new techniques, such that our clients rest assured about the security of the valuable information they entrust us with. Recently, while reading a paper on Cybersecurity and Deep Learning[2], we found a new way to detect and prevent insider attacks. The proposed solution is split into three parts, namely, behavior extraction, conditional GAN-based data augmentation, and anomaly detection.

In behavior extraction, feature extraction is done from the dataset. Context-based behavior profiling is used, in which each user is identified as an insider, based on the entire activity log, where all the features contribute to the user behavior. Then, a Conditional Generative Adversarial Network (CGAN) is used to generate data and reduce the negative effect of skewed data. GAN models consist of two parts, namely, generator and discriminator. In the network, the discriminator (D) tries to distinguish whether the data is from the real distribution, and the generator (G) generates synthetic data and tries to fool the discriminator. The research paper uses a fully connected neural network in the generator and discriminator.

The final part of the proposed solution is to use multiclass classification, instead of binary classification. Anomaly detection based on multiclass classification considers labeled samples of training data as multiple normal and non-malicious classes. The multinomial classifier tries to discriminate the anomalous samples from the rest of the classes, which helps in building a more robust classifier. One additional feature of using multiclass classification is that in case a new insider activity emerges, there would be no need to make any changes to the existing framework. We should use t-distributed Stochastic Neighbor Embedding (t-SNE), a manifold-learning-based visualization method, to perform a qualitative analysis of the generated data. XGBoost, MLP, and 1-d CNN models were used in the research paper, XGBoost performed better for all sorts of datasets.

Intrigued to know more about Cyber Security and the unconventional ways to prevent insider attacks? Read the Reference articles provided below -

  • Mayra Macas, & Chunming Wu. (2020). Review: Deep Learning Methods for Cybersecurity and Intrusion Detection Systems.
  • Gautam Raj Mode, & Khaza Anuarul Hoque. (2020). Crafting Adversarial Examples for Deep Learning-Based Prognostics (Extended Version).
  • Ihai Rosenberg, Asaf Shabtai, Yuval Elovici, & Lior Rokach. (2021). Adversarial Machine Learning Attacks and Defense Methods in the Cyber Security Domain.
  • Li, D., & Li, Q. (2020). Adversarial Deep Ensemble: Evasion Attacks and Defenses for Malware Detection. IEEE Transactions on Information Forensics and Security, 15, 3886–3900.
  • Simran K, Prathiksha Balakrishna, Vinayakumar Ravi, & Soman KP. (2020). Deep Learning-based Frameworks for Handling Imbalance in DGA, Email, and URL Data Analysis.

Is Facial Recognition Biased?

Artificial Intelligence

Big Data

Facial Recognition

Written by Jayant, Bhavya

23 Nov 21

Mankind has witnessed three industrial revolutions, starting with the development of the Steam Engine, followed by electricity and digital computing. We are on the verge of a 4th industrial revolution that will be primarily driven by Artificial Intelligence and Big Data. Artificial Intelligence heavily relies on the data for the development of algorithms that can reason about the decision-making done by the intelligent systems or computer systems only.

Face Recognition: Modern Day Biometric Security Solution

The advent of these advanced technologies has provided us with various techniques for security solutions that will prevent unauthorized access to precious data, providing a sense of security to our clients. However, selecting the appropriate biometric security solution has become a major decision-making process for businesses and enterprises, across wide industries. A new biometric security system that has arrived under the umbrella of Artificial Intelligence is a Face recognition system.

With the ease of implementation and widespread adoption - face recognition is rapidly becoming the go-to choice for the modern implementation of Biometric Solutions. Facial recognition is a modern-day biometric solution developed for the purpose of recognizing a human face without any physical contact required. Facial recognition algorithms are designed to match the facial features of a person to the images or the facial data available in the database saved.

Facial Recognition The Next Big Thing?

Research and studies on Facial Recognition have been conducted for many years now, but there has been an unprecedented growth when we talk about the actual implementation of Facial Recognition. Technology has become so efficient that now we can unlock our phones using facial recognition. Countries have also started using facial recognition for surveillance purposes to track down criminals and use it to prevent crime. Tracking down criminals has become too easy with the help of facial recognition. All we need to do is set up a camera in public spaces and check if any criminal/suspicious person shows up.

Recent Studies Suggest Otherwise

Recent studies and research have suggested that the leading facial recognition software packages are biased. Yes! You read it right. Leading facial recognition packages tend to be more accurate for white, male faces than for people of color or for women.

In a 2019 study, it was found out that many commercial algorithms currently being used for surveillance show a high False Positive Rate for the minority community. There have been some cases around the world where someone innocent got arrested due to false positives shown by these surveillance devices. One such incident happened in January 2020, in Detroit, when police used facial recognition technology on surveillance footage of theft to falsely arrest a Black man.

Let us try and identify what lies at the core of this based nature of face recognition software programs. Facial recognition application is broadly divided into two parts; Verification and Identification.

  • Verification confirms that the faceprint matches with the stored faceprint.
  • This is usually used at airports and to unlock your smartphone.
  • The verification part of Facial recognition is not biased, in fact extremely accurate; here, artificial intelligence is as skillful as the sharpest-eyed humans.
  • The real issue is the Identification part of Facial Recognition, which is used for surveillance.

Disparate False Positive Rates

The false Positive Rate of 60 per 10,000 samples for minority groups might not seem much, but when you compare it with the Positive rate of <5 per 10,000 samples for white people, you can clearly see the difference. We need to make sure that the false-positive rate in the identification model should be minimal since this is usually used on crowd surveillance. If you are using facial recognition for crowd surveillance, and you are monitoring around 5000 people in a day, you could easily end up with hundreds of people being falsely accused[1].

Once the issue was identified, AI researchers started working on finding a solution to the biases available to these facial recognition models. In June 2020, IBM announced it would no longer offer a facial recognition service, while other service providers have acknowledged the issue and started working on finding a solution [2]. There has also been a public backlash against crowd surveillance.

The reason why there is such a high false-positive rate in facial recognition for a minority group is that the data on which these models were built had an uneven distribution of racial faces.

To avoid such errors, new databases and techniques have been used:

  • Techniques of augmentation of feature space of underrepresented classes were to make the dataset more balanced.
  • Recently, Generative Adversarial Networks (GAN) were also trained to generate face features to augment classes with fewer samples.
  • People have also started shifting to more balanced datasets like Racial Faces in the Wild (RFW) and Balanced Faces In the Wild (BFW) to reduce the bias[3].

There has been a great improvement in accuracy for facial recognition in the past few years. Researchers have better models and constructed better datasets to provide highly accurate and low bias models. Big service providers have acknowledged the problem, constantly researching to create accurate surveillance models. The future of facial Recognition seems bright now as the awareness among other service providers and clients has increased about the drawbacks of such technology.


Known Security Issues in Python Dependency Management System and How to Tackle them.

Python Programming

Python Package

Written by Jayant, Anjali

23 Nov 21

We, at EZ, believe that the purpose of technology is to assist us, and not replace us. Therefore, before becoming dependent on any programming language, we understand its flaws and make conscious efforts to overcome them. As a programming language, Python provides us with innumerable Python libraries and frameworks, a mature and supportive Python Community, versatility, efficiency, reliability, speed, and more. We work with Python so extensively that its security flaws often get ignored. Read the blog below to know about the security loopholes found in the PyPI ecosystem, and how we can overcome them.

What is PIP?

  • PIP, or Python Package Installer for Python, is a default python package manager that provides a platform for developers to share and reuse the codes written by third-party developers.
  • PIP supports downloading packages from PyPI.org, a repository for the Python programming language. PyPI helps in finding and installing packages or software for python programming languages.
  • By design, the PyPI ecosystem allows any arbitrary user to share and reuse python software packages, which along with their dependencies, are downloaded recursively with the help of PIP.

Security risk while the installation of Python Packages

Bagmar et al. had provided a detailed study on the security threats in the python ecosystem, which is largely based on the PyPI repository database.

  • Every time, while PIP installs invocation, two python files are executed, namely, setup.py and __init__.py.
  • Along with these executions, some arbitrary Python codes, which may contain exploits, also get executed at varying points.
  • Exploits come in two modes, which are given below:
    1. Directly from the source, using editable mode installation, and importing the malicious package.
    2. Installation using sudo(administrator) privileges.

Factors that help us determine the impact of exploiting python packages

There are four main factors that can help us understand the impact of exploiting Python packages, which are given below:

  • Package Reach: It is defined as the number of other packages that explicitly require it transitively or directly. Packages with high package reach are liable to higher attack vectors, making them malicious.
  • Maintainer Reach: It is the combined reach of all the Maintainer packages. Influential Maintainers are the potential targets for security attacks.
  • Implicitly Trusted Packages: It is the number of distinct nodes traversed while searching for the longest path from a given starting node. An increase in implicitly trusted packages increases security risk attacks.
  • Implicitly Trusted Maintainers: This metric gives the vulnerability score based on other package maintainer's accounts.

Most common Python Package Impersonation Attacks

Package impersonation attacks are user-centric attacks, which aim at tricking users to download a malicious package.

There are various ways of fooling the users, and making them download malicious packages, some of which are given below:

  • TypoSquating: Intentionally making minor spelling mistakes.
  • Altering Word Order: Changing the order in which packages name themselves.
  • Python3 vs Python2: Adding number “3” in the package, imitating the original package, with support to python3.
  • Removing Hyphenation: Removing hyphen in the genuine packages.
  • Built-In Packages: There are multiple instances of packages being uploaded to PyPI.
  • Jellyfish Attack: In this attack, a TypoSquat package is imported somewhere.

License Violation in PyPI ecosystem

PyPI does not perform any automated checks for OSS license violations. Any violation can be considered when a package imports another package having a less permissible license.

Suggested Preventive Measures

  • There should be strict enforcement and compulsion to specify dependencies in the metadata of uploaded packages.
  • A permission model, similar to mobile phones, can be implemented while installing packages.
  • Having a trusted or maintainer package badge on a popular package might be helpful.
  • Showing statistics while installing packages.
  • License fields must not be free text.

Topics

Popular Reads

WhatsApp Us