Have you ever wondered how your email filters out spam messages? Or maybe how autocorrect on your phone knows what you're trying to type? In this post, we'll cover some basics of natural language processing like reading in and creating structure in messy text data, and then cleaning and tokenizing that data. Then the post will cover some of the more advanced topics like lemmatising, stemming, and vectorizing the data. In other words, converting it from text into a numeric matrix. The post do this with a focus on preparing the data to build a machine learning classifier on top of it. The post will also learn how to build two different types of machine learning models, while thoroughly testing and evaluating different variations of those models. We'll have the tools to go from messy dataset to concise and accurate predictions from machine learning model, to deliver solutions to complex business problems.
This post is going to describe how to use the Apache Spark Platform for Machine Learning. It will start by reviewing the basics of the dataframe data structure. Then, it will cover the pre-processing to both numeric and text data so that is ready to use with Spark's MLlib machine learning library. The post will also describe multiple algorithms for clustering, classification and regression. In the end, it will briefly describe a recommendation system.
This post is going to help you completely configering Hadoop environment with Spark and other essential setups from scratch. The post includes Operating System (Ubuntu), Hadoop, Spark, Jupyter, SSH Setup.