Fake Job Posting Detection and Getting Useful Insights from the Job Postings
Detecting whether the jobs are real or fake on the basis of a suitable model and extracting useful information from the given dataset, i.e., getting useful insights from the job postings.
Now-a-days, there are a lot of job scams because of unemployment. There are a lot of websites which connect recruiter to a suitable candidate. Sometimes, fake recruiters post a job posting on the job portal with a motive to get money. This problem occurs with many job portals. Later, people shift to a new job portal in search of real job but the fake recruiters join this portal as well. So, it is important to detect real and fake jobs.
The dataset has been taken from Kaggle.
Title: [Real or Fake] Fake JobPosting Prediction
1. job_id → Unique Job ID
2. title → Title of the job
3. location → Geographical location of the job
4. department → Corporate Department
5. salary_range → Range of Expected Salary
6. company_profile → Brief Description of the Company
7. description → Detailed description of the job
8. requirements → Enlisted requirements for the job opening
9. benefits → Enlisted offered benefits by the employer
10. telecommunicating → True for telecommuting positions
11. has_company_logo → True if company logo is present
12. has_questions → True if screening questions are present
13. employment_type → Full-type, Part-time, Contract, etc.
14. required_experience → Executive, Entry level, Intern, etc.
15. required_education → Doctorate, Master’s Degree, Bachelor, etc.
16. industry → Automotive, IT, Health care, Real estate, etc.
17. function → Consulting, Engineering, Research, Sales etc.
18. fraudulent → target — Classification attribute (1 for fraudulent, else 0)
The dataset contains a lot of text description. Hence, we can say that the project is related to Natural Language Processing.
As a human, if we have to judge whether the job posting is real or fake, we would most probably go through the description and other related text. Similarly, the machine needs to process the text in order to predict whether the job posting is real or fake.
We have used missingno library for visualization of missing data. Missing data can be visualized by using matrix or bar plot or heatmap.
Matrix representation gives us the pattern of missing data.
Bar plot gives us the idea of the number of missing values in the attributes of the given dataset.
The dataset contains nearly 17880 job descriptions out of which 866 are fake. Since the fraudulent job postings are less, it might create a problem because the model might train in such a way that it classifies every job posting as real.
We have plotted various bar graphs for getting comparative information about real job postings as well as fake job postings.
We can see that Full-time jobs occupies most of the dataset of real jobs as well as fake jobs.
We can see that most of the real jobs require Mid-Senior level experience whereas most of the fake jobs require Entry level experience.
We can see that most of the real jobs require Bachelor’s Degree whereas most of the fake jobs require High School or equivalent education.
We can see that Information Technology and Services industry has a lot of real jobs whereas Oil & Energy industry has the maximum number of fake jobs.
We can see that Information Technology has the maximum number of real jobs whereas Administrative as well as Engineering sector has a lot of fake jobs.
Data Visualization on Unmodified Dataset:
The word frequency of the text of Original DataFrame, Real Jobs DataFrame, and Fake Jobs DataFrame respectively can be seen through some interesting images.
Original DataFrame Words:
Real Jobs DataFrame Words:
Fake Jobs DataFrame Words:
We are creating tokens by splitting the text description of the job posting. So, tokenization procedure converts string into token of words.
• URLs Removal:
We have used regular expression for removal of HTML links.
• Removing Stopwords:
Stopwords are a group of words that carry no meaning by themselves.
E.g.: in, and, the, which, etc.
Stopwords are not required for analytics. So, we have removed stopwords from the token of words. A standard or custom stop-word dictionary can be used for this purpose.
• Removing Punctuation Marks:
We need words and not the punctuation marks. If we keep punctuation marks, it will be treated as separate tokens. So, it is better to remove punctuation marks from the token of words.
Data Visualization on Modified Dataset:
The word frequency of the text of Original DataFrame, Real Jobs DataFrame, and Fake Jobs DataFrame respectively can be seen through some interesting images. Let us see the word frequency after doing some modifications like removing stop words, making the words in lower case, etc.
Original DataFrame Words after Modifications:
Real Jobs DataFrame Words after Modifications:
Fake Jobs DataFrame Words after Modifications:
Data Pre-processing (Continued):
• Stemming and Lemmetization:
A stem is the base part of the word to which affixes can be attached for the derivatives. Stemming is the process which converts a words into its stem. Stemming keeps only the base word, thus reducing the total words in the corpus. Basically, stemming cut off the affix of the word.
eg:”Combine” is the stem for combining, combine, and combined.
Lemmetization is similar to stemming, but produces a proper root word that belong to the language. Lemmatization uses a dictionary to match words to their root word.
Eg: “Combine” is the lemmatized version of combine, combined and combining.
We preferred lemmatization over stemming because of its property of finding meaningful root of the words.
• Count Vectorizer:
It is one of the techniques to create a vector from the given words by taking into consideration the word count.
• TF-IDF Vectorizer:
Term Frequency — Inverse Document Frequency(TF-IDF) is a popular text mining technique. ML need a numeric representation. TF-IDF is used to convert text into numeric table representation. TF-TDF output is a table where row represent documents and columns resent words. Each cell provides a count/value that indicates the strength of the word with respect to the document.
In both the cases, we did a lot of hyperparameter tuning while finding accuracy like changing the number of iterations(max_iters) and checking accuracy on unigrams and bigrams.
· Over Sampling:
The dataset was imbalanced because the number of fake job postings was very less as compared to the number of real job postings. Hence, imblearn library was used for balancing the data. We have applied over sampling over the training part of sparse matrices which were obtained by Count Vectorizer and TF-IDF Vectorizer.
From the above figure, you can see that the training dataset is balanced after over sampling.
For getting the results, first we have splitted the data in 80:20 ratio for training and testing. Two text mining procedures were applied which were Counter Vectorizer and TFIDF Vectorizer. Now, for applying data to a machine learning modal we need to convert text into numeric data which are handled by these text mining procedures since it we cannot apply text based data to a machine learning modal directly. We have applied the following models over the vectors produced by Count Vectorizer as well as over the vectors produced by TF-IDF Vectorizer and found out accuracy as well as F1 score:
- Multinomial Naïve Bayes
- Logistic Regression
- Support Vector Classifier
- Multilevel Perceptron Classifier
Accuracy(in percentege) using different models over the matrix produced by CV as well as TF-IDF are as follows:
We have used 80:20 split [80% training data and 20% testing data].
Just viewing the accuracy was not important. It was also important to note that the data was not balanced. So, viewing individual F1 score becomes significantly important to analyze the model because sometimes even if the models classify the job postings as real, the accuracy can come out to be pretty high as the number of real job postings are very high. So, the F1 scores of all the machine learning models can be with real and fake jobs segregated is shown below where CV-R represents count vectorisation for real values and CV-F represents count vectroisation for fake values.
After analyzing the F1 Scores of all the models, it can be seen that Multilayered Perceptron, i.e., Neural Network, performed the best with Count Vectorisation and TF-IDF embedding. With TF-IDF the accuracy for real jobs was 0.99 and for fake jobs was 0.85.
Confusion Matrix for MLP model:
Deep Learning Model (LSTM):
We have also applied Deep Learning model — LSTM (Long Short Term Memory) with the embedding layer of the keras and checked accuracy as well as F1 score which are mentioned below.
Test Loss : 0.26
Test Accuracy : 0.97
Train Loss : 0.016
Train Accuracy: 0.99
F1 Score (Real): 0.99
F1 Score (Fake): 0.74
Model summary is as follow:
We have used embedding layer, LSTM layer as well as Dense layer.
We have trained the model for 50 epochs.
Confusion Matrix for LSTM model:
The Confusion matrix clearly shows that the real jobs were classified correctly almost every time but there was significant error while classifying the fake jobs.
We got the highest accuracy using MLP(Multi Layer Perceptron) classifier using TF-IDF vectorizer.
Accuracy using MLP: 0.9876
F1 Score (Real) using MLP: 0.99
F1 Score (Fake) using MLP: 0.85
The complete use of various supervised Machine learning algorithms was used on the job posting data, but various features of the dataset like location , department, salary range were removed from the usage for the simplification purpose. So, we can try out some more features and check the accuracy of the models.
- Rishabh Bafna, M.Tech. CSE, IIIT Delhi [LinkedIn]
- Vandana Sharma, M.Tech. CSE, IIIT Delhi
- Chetan Sharaf, M.Tech. CSE, IIIT Delhi [LinkedIn]
Under the guidance of:
We are thankful and extend our sincere gratitude to Professor Dr. Tanmoy Chakraborty for his constant support and guidance throughout this Machine learning course project.
Teaching Fellow: Ms. Ishita Bajaj
Teaching Assistants: Nirav Diwan, Pragya Srivastava, Shiv Kumar Gehlot, Chhavi Jain, Vivek Reddy and Shikha Singh.
(1) Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos. 2000. Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. CoRR, cs.CL/0009009.
(2) Shawni Dutta and Samir Kumar Bandyopadhyay. 2020. Fake job recruitment detection using machine learning approach. International Journal of Engineering Trends and Technology, 68(4):48–53.
(3) Ibrahim M. Nasser and Amjad H. Alzaanin. 2020. Machine learning and job posting classification: A comparative study. International Journal of Engineering and Information Systems (IJEAIS), 4(9):06–14.
(4) J. C. S. Reis, A. Correia, F. Murai, A. Veloso, and F. Benevenuto. 2019. Supervised learning for fake news detection. IEEE Intelligent Systems, 34(2):76–81.