Trending December 2023 # Spark: Lighting A Fire Under Hadoop # Suggested January 2024 # Top 17 Popular

You are reading the article Spark: Lighting A Fire Under Hadoop updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Spark: Lighting A Fire Under Hadoop

Also see: Hadoop and Big Data

Hadoop has come a long way since its introduction as an open source project from Yahoo. It is moving into production from pilot/test stages at many firms. And the ecosystem of companies supporting it in one way or another is growing daily.

It has some flaws, however, that are hampering the kinds of Big Data projects people can do with it. The Hadoop ecosystem uses a specialized distributed storage file system, called HDFS, to store large files across multiple servers and keep track of everything.

While this helps managed the terabytes of data, processing data at the speed of hard drives makes it prohibitively slow for handling anything exceptionally large or anything in real-time. Unless you were prepared to go to an all-SSD array – and who has that kind of money? – you were at the mercy of your 7,200 RPM hard drives.

The power of Hadoop is all centered around distributed computing, but Hadoop has primarily been used for batch processing. It uses the framework MapReduce to execute a batch process, oftentimes overnight, to get your answer. Because of this slow process, Big Data might have promised real-time analytics but it often couldn’t deliver.

Enter Spark. It moved the processing part of MapReduce to memory, giving Hadoop a massive speed boost. Developers claim it runs Hadoop up to 100 times faster in certain applications, and in the process opens up Hadoop to many more Big Data types of projects, due to the speed and potential for real-time processing.

Spark started as a project in the University of California, Berkeley AMPLab in 2009 and was donated as an open source project to the Apache Foundation in 2012. A company was spun out of AMPLab, called Databricks, to lead development of Spark.

Patrick Wendell, co-founder and engineering manager at Databricks, was a part of the team that made Spark at Berkeley. He says that Spark was focused on three things:

1) Speed: MapReduce was based on an old Google technology and is disk-based, while Spark runs in memory.

2) Ease of use: “MapReduce was really hard to program. Very few people wrote programs against it. Developers spent so much time trying to write their program in MapReduce and it was huge waste of time. Spark has a developer-friendly API,” he said. It supports eight different languages, including Phython, Java, and R.

3) Make something broadly compatible: Spark can run on Amazon EC2, Apache’s Mesos, and various cloud environments. It can read and write data to a variety of databases, like PostgreSQL, Oracle, MySQL and all Hadoop file formats.

“Many people have moved to Spark because they are performance-sensitive and time is money for them,” said Wendell. “So this is a key selling point. A lot of original Hadoop code was focused on off line batch processing, often run overnight. There, latency and performance don’t matter much.”

Because Spark is not a storage system, you can use your existing storage network and Spark will plug right into Hadoop and get going. Governance and security is taken care of. “We just speed up the actual crunching of what you are trying to do,” said Wendell. Of course, that’s also predicated on giving your distributed servers all the memory they will need to run everything in memory.

Prakash Nanduri, CEO of the analytics firm Paxata, said that Spark made Hadoop feasible for working in real time. “Now you have the ability to focus at real-time analytics as scale. The huge implication is suddenly you go from 10 use cases to 100 use cases and do it at a cost that is significantly lower than for traditional interactive analytic use cases,” he said.

Many of the cloud vendors that offer some kind of Hadoop solution, like Cloudera, Hortonworks, and MapR, are bundling Spark with Hadoop as a standard offering now, said Wendell.

At a recent Spark Summit, Toyota Motor offered an example of the speed Spark offers. It uses social media to watch for repair issues in addition to customer inquiries. The problem with the latter is people don’t care about surveys, so it shifted its emphasis to Twitter and Facebook. The company built an entire system on Spark to monitor social media to watch for keywords.

Its original customer experience app, done as a regular Hadoop batch job, would take 160 hours, or 6 days. The same job rewritten for Spark is completed in just four hours. The company also parsed the flood of input from social media and was able to filter out things like dealer promos, irrelevant material and incident reports involving Toyota products and reduced the amount of data to process by 50%.

Another use case is log processing and fraud detection, where speed is of the utmost, as banks, businesses and other financial and sales institutions need to move fast to catch fraudulent activity and act on the warnings.

“The business value you achieve is fundamentally derived through the apps. In the case of financial services, you need to be able to detect money laundering cases. You cannot find money laundering signals by running a batch process at night, it has to be in real time,” said Nanduri. “An app built on Spark can do the entire data set in real time and interactive speeds and get to the answer much faster.”

But Spark isn’t just about in-memory processing. Wendell said half of the performance gains come from running in memory and other half is from optimizations. “The other systems weren’t designed for latency so we improved on that a lot,” he said.

There is still more work to be done. Wendell said there is a big initiative underway with Databricks and Apache to further improve Spark performance, but he would not elaborate.

While it offers a standardized way to build highly distributed and interactive analytical apps, it still has a long way to go,” said Nanduri. “Spark lacks security and needs enhanced support for multiple concurrent users, so there is still some work to do.

Photo courtesy of Shutterstock.

You're reading Spark: Lighting A Fire Under Hadoop

Hyperx Pulsefire Haste 2 Wireless Review: A Lighting


The solid top back is comfortable and stylish

Its 26,000 DPI sensor can target at lightning speeds

Bluetooth functionality makes it one of the most versatile esports mice you can get


The compact buttons can feel a bit cramped at times

There’s just the one RGB zone to play with

It’s more expensive than its predecessor

Our Verdict

The HyperX Pulsefire Haste 2 Wireless features a lighting-fast 26,000 DPI sensor, Bluetooth and 2.4GHz wireless connectivity, as well a dedicated DPI switcher, and still manages to be lighter and more affordable than some rivals. That makes it an outstanding choice for discerning gamers.

Best Prices Today: HyperX Pulsefire Haste 2 Wireless




View Deal



View Deal

The HyperX Pulsefire Haste Wireless really impressed me when it came out. Budget-conscious esports players could scarcely find a better mouse—with its pinpoint accuracy, ultra-light weight, and a price tag so reasonable you could barely believe the value, it was not only suitably primed to cross swords with performance goliaths in competitive matches but wouldn’t make much of a dint in your tournament winnings either.

Based on that history, it was with bated breath that I sized up the pros and cons of the HyperX Pulsefire Haste 2 Wireless, HyperX’s successor. The verdict? While the price is a little bit heftier than before, it’s a very worthy sibling.

Further reading: See our roundup of the best wireless gaming mice to learn about competing products.

HyperX Pulsefire Haste 2 Wireless features

So, what features are worth drooling over this time? At least three things get my attention immediately. First and most notably, it weighs exactly the same as the Haste 1 Wireless—a lofty 61 grams, so it feels as light as a feather in your hand. That’s despite the addition of an upgraded HyperX 26K sensor, and a new solid upper shell, which replaces the Haste 1 Wireless’s perforated one.

A mere 61 grams also puts it in good company—making it 2 grams lighter than premium pro-grade esports mice like Logitech’s G X Pro Superlight and the Razer’s DeathAdder V3 Pro, both of which cost upwards of $60 more.

I found I had all the speed and accuracy I needed to win out in most spontaneous firefights.

Suffice to say, that loftiness means the quick movement we achieved so easily in the Haste 1 Wireless is just as easily reproduced in the Haste 2 Wireless. Merely bump it and your curser will shoot across the screen like a falling star. Move it quickly for real and you’ll need bionic vision to see the curser—yes, it’s that quick.

Secondly, the Haste 2 Wireless adds Bluetooth connectivity to its super-speedy 2.4GHz connectivity, which is a feature that neither the Haste 1 Wireless nor a string of more expensive esports rivals have. So, in saying that, the Haste 2 Wireless is one of the more versatile esports mice you can currently buy.

Mentioned in this article…

Razer DeathAdder V3 Pro

Read our review

Best Prices Today:

The third and most surprising thing to note about it is surely its price. Somehow HyperX has managed to keep it down to just $89.99—admittedly that’s a steep increase over the original’s $59.99 shipping price, but seems reasonable considering all the upgraded technology onboard, and when you compare it to the price of rivals like the $149.99 Razer DeathAddver V3 Pro.

Once again, esports players who give it a go will likely be more than satisfied with what they’re getting for their money.

HyperX Pulsefire Haste 2 Wireless design and build

The Haste 1 Wireless’s design is written all over the Haste 2 Wireless. In fact, from its symmetrical shape to its six compact programmable buttons, it looks very similar to before. But alas there are some refinements worth noting.

Additionally, while there are no surprises in the dimensions—the depth, height, and width being virtually identical as before at 4.9 x 1.50 x 2.6 inches—the Haste 2 Wireless’s solid top makes it functionally superior. How so? For one, it won’t cause the dotty indentations claw grippers were prone to get on the tips of their fingers from pressing against the Haste 1’s perforated top.

Mentioned in this article…

HyperX Pulsefire Haste Wireless

Read our review

Best Prices Today:

That’s not even accounting for how much more comfortable it now feels too. While palm and fingertip grippers are unlikely to notice much difference in snugness levels, claw grippers will no longer feel like they’re pressing down on a porcupine—a small point which is sure to win over a legion of new fans.

Regardless of grip type, though, the Haste 2 Wireless’s matte plastic finish is a most welcome upgrade. In fact, the thousands of tiny bumps covering its body do a much better job of holding your hand firmly in place, providing a greater sense of control than previously. They also extinguish any sign of your fingerprints, so this mouse always looks sharp next to your rig.

The underside of the Haste 1 Wireless’s successor has had a slight upgrade too. In addition to the four virgin-grade PTFE skates we saw in the Haste 1 Wireless, you now also get a full ring of PTFE around the sensor, so there’s a noticeable absence of friction on tabletops.

Unlike its predecessor, the HyperX Pulsefire Haste 2 Wireless has a solid band of virgin-grade PTFE around its sesnor.

Dominic Bayley / IDG

The Haste 2 Wireless also comes with a useful set of accessories. A peek in the box reveals extra grip tapes, as well as additional PTFE skates. But you also get a 5.9-foot braided USB A to USB C paracord, which doubles as both a wired connection and a charger. HyperX claims the Haste 2 Wireless’s battery lasts a very respectable 100 hours from a single charge.

Only black and white color options are available, but they each look remarkably stylish. The only visible signs breaking up my white review unit’s visage were two attractive HyperX logos—one on top and the other on the left-hand side, each accentuating rather than spoiling its appearance. Clearly, this is a mouse worth showing off to your friends at every opportunity.

The one caveat to that is if your friends are diehard fans of RGB. That’s because you still only get the one RGB zone located in the mouse wheel, which for an esports mouse of this caliber is a little underwhelming. Still, this is just a trifle that won’t make any difference to your gameplay.

How does the HyperX Pulsefire Haste 2 Wireless perform?

The Pulsefire Haste 2 Wireless’s sensor tracks movement at a maximum resolution of 26,000 DPI with an accuracy of 650 inches per second. A maximum acceleration of 50G and polling rate of 1,000Hz round out the rest of the sensor’s performance specifications.

To test the sensor’s performance, I loaded up a game of Insurgency, trying out a range character types from Riflemen to Snipers and switching randomly between the mouse’s four preset DPI settings. Here, I made maneuvers ranging from quick short jerks to larger crisscross motions and circular movements, watching all the while for glitches or stuttering in my mouse’s movement.

Indeed, the mouse handled superbly in just about every situation, no matter what kind of movement I used, or DPI setting it was in. The sensor tracked my movements perfectly and fluidly, without even the slightest hint of smoothing or stuttering. It seemed to me like its quick responsiveness even helped even out the playing field when I joined overseas servers and was lumped with tedious 300-millisecond pings.

While the wired version of the Haste 2 has an outstanding 8,000Hz polling rate, which allows it to send up to eight times more data per second and reduce latency to 1/8th of a millisecond, the Wireless version gets by with a 1,000Hz polling rate. That being the case, 1,000Hz is still perfectly fine for ultra-fast competitive play—I found I had all the speed and accuracy I needed to win out in most spontaneous firefights.

While the HyperX Pulsefire Haste 2 Wireless’s design is quite compact it will comfortably fit most small- to medium-sized hands. 

Dominic Bayley / IDG

The Haste 2 Wireless’s symmetry and compactness was also a big asset to my play. I was delighted to find it slid around my mat with the same laser-like precision as I saw in the Haste 1 Wireless, which made pinpointing targets exceptionally quick. This was really something I noticed after ditching my mouse pad and playing on tabletops, the extra skates really making an impact on how smooth it moved.

But while I could pinpoint targets as quick as lightning, there seemed a bit of a disconnect in my firing. Don’t get me wrong, that’s not because the Haste 2 Wireless’s buttons lack any responsiveness: they’re impressively quick. It’s just that they’re not as well served by the mouse’s design as some rivals—a point that deserves some explanation….

Although not a huge deal, that did occasionally mean that instead of hitting my target square-on at the kill point, I hit them somewhere less lethal. Could this drawback have affected my K:D ratio? Quite possibly.

Mentioned in this article…

Logitech G Pro X Superlight

Read our review

Best Prices Today:

Still, these are just my personal musings. In fact, there will be tons of players that will actually prefer the Haste 2 Wireless’s compact buttons and more symmetrical design.

Surely, in choosing the Haste 2 Wireless over mice with these asymmetrical traits, the upside is you can apply any of the three main grip styles and not have to worry about losing out on any performance. Plus, left-handers will find it a whole lot easier, too.

HyperX’s Ngenuity software

HyperX’s Ngenuity software is the Haste 2 Wireless’s go-to software app, which you can download for free in just a minute from the Microsoft Store. In it you can assign functions to your six programmable buttons, adjust your DPI settings, and make color and effects changes to the mouse’s RGB zone, too.

I mainly used Ngenuity to change DPI settings. You can simply add an additional DPI setting by hitting the + icon at the bottom of the app window, while changing the values of the four default DPI settings is as straightforward as moving sliders. The app also conveniently prompts you when you need firmware updates so there’s no need to sift through HyperX’s official website looking for them.

HyperX’s Ngenuity app showing a visual representation of the HyperX Pulsefire Haste 2 Wireless.

Should you buy the HyperX Pulsefire Haste 2 Wireless?

HyperX’s original PulseFire Haste Wireless was one of the best value esports mice you could buy at the time of its release, offering gamers a taste of pro-grade functionality at a reasonable price. A year on, the company has delivered a sequel that, while costing a little more than before, packs in some excellent upgrades whilst amazingly keeping the weight exactly the same.

Unless you need a particular shape of design for your trusty esports rodent or can’t live without tons of RGB or an 8,000Hz Hyperpolling rate, this mouse has everything you could want. It’ll also keep a ton of change in your hip pocket.

Ibm Speeds Up Db2 10.5, Remolds It As A Hadoop Killer

Developed by the IBM Research and Development Labs, BLU (a development code name that stood for Big data, Lightening fast, Ultra easy) is a bundle of novel techniques for columnar processing, data deduplication, parallel vector processing and data compression.

The focus of BLU was to enable databases to be “memory optimized,” Vincent said. “It will run in memory, but you don’t have to put everything in memory.” The BLU technology can also eliminate the need for a lot of hand-tuning of SQL queries to boost performance.

Faster data analysis

Because of BLU, DB2 10.5 could speed data analysis by 25 times or more, IBM claimed. This improvement could eliminate the need to purchase a separate in-memory database—such as Oracle’s TimesTen—for speedy data analysis and transaction processing jobs. “We’re not forcing you from a cost model perspective to size your database so everything fits in memory,” Vincent said.

On the Web, IBM provided an example of how 32-core system using BLU technologies could execute a query against a 10TB data set in less than a second.

“In that 10TB, you’re [probably] interacting with 25 percent of that data on day-to-day operations. You’d only need to keep 25 percent of that data in memory,” Vincent said. “You can buy today a server with a terabyte of RAM and 5TB of solid state storage for under $35,000.”

IBM’s BLU acceleration technology speeds DB2 queries against large data sets.

Also, using DB2 could cut the labor costs of running a separate data warehouse, given that the pool of available database administrators is generally larger than that of data warehouse experts. In some cases, it could even serve as an easier-to-maintain alternative to the Hadoop data processing platform, Vincent said. Among the new technologies is a compression algorithm that stores data in such a way that, in some cases, the data does not need to be decompressed before being read. Vincent explained that the data is compressed in the order in which it is stored, which means predicate operations, such as adding a WHERE clause to a query, can be executed without decompressing the dataset.

Another time-saving trick: the software keeps a metadata table that lists the high and low key values for each data page, or column of data. So when a query is executed, the database can check to see if any of the sought values are on the data page.”If the page is not in memory, we don’t have to read it into memory. If it is in memory, we don’t have to bring it through the bus to the CPU and burn CPU cycles analyzing all the values on the page,” Vincent said. “That allows us to be much more efficient on our CPU utilization and bandwidth.”With columnar processing, a query can pull in just the selected columns of a database table, rather than all the rows, which would consume more memory. “We’ve come up with an algorithm that is very efficient in determining which columns and which ranges of columns you’d want to cache in memory,” Vincent said.

On the hardware side, the software comes with parallel vector processing capabilities, a way of issuing a single instruction to multiple processors using the SIMD (Single Instruction Multiple Data) instruction set available on Intel and PowerPC chips. The software can then run a single query against as many columns as the system can place on a register. “The register is the most efficient memory utilization aspect of the system,” Vincent said.

Competitors rally

IBM is not alone in investigating new ways of cramming large databases into the server memory. Last week, Microsoft announced that its SQL Server 2014 would also come with a number of techniques, collectively called Hekaton, to maximize the use of working memory, as well as a columnar processing technique borrowed from Excel’s PowerPivot technology.

Database analyst Curt Monash, of Monash Research, has noted that with IBM’s DB2 10.5 release, Oracle now is “now the only major relational DBMS vendor left without a true columnar story.”

IBM itself is using the BLU components of DB2 10.5 as a cornerstone for its DB2 SmartCloud infrastructure as a service (IaaS), to add computational heft for data reporting and analysis jobs. It may also insert the BLU technologies into other IBM data store and analysis products, such as Informix.

Mastering 3D Lighting In Blender

Making images in Blender 3D has a lot in common with photography. In fact, if you have any photographic skills, these will transfer nicely into 3D software like Blender.

In previous articles we have discussed how you light a scene on a basic level. But how can you use all the different kinds of lamp for something approaching real cinematography?

Types of Lamp

Cinematography is all about choosing the right lights. In the virtual world of a 3D program the lights are all computed rather than real, but they perform the same function as real world lights. To get good lighting in 3D graphics images, you need to have a grasp of lighting in the real world, so a good tip is to learn how to light photographs from photography tutorials out there on the Internet.

The basic types of lamps in Blender are as follows: Point, Sun, Spot, Hemi and Area.


These lights are a tiny ball of light which are omnidirectional – that is to say scattering light in all directions like a lightbulb. Shadows fan out from the source centre in radiating lines.


Sun lights emulate the light you get from the sun; the light comes down from the source in parallel lines. Shadows cast straight down from the source and are soft.


Spotlights have a point source, but they fan out at a particular angle set in the properties, and they have a soft transition from the middle to the outer radius, the same as a real spotlight. Shadows are hard-edged and follow the angle of the beam.


These lights are like spotlights, but the difference is the source is a half sphere and the light focusses in straight lines, like a lighting brolly. Shadows are hard-edged.


Area lights are flat planes which cast light like a softbox or light reflected from a large reflective surface. Shadows are sharp when the objects are close to a surface but softer when they are distant.

Emission Surfaces

Another kind of light you can have in Blender is to turn an object into a light by selecting a surface texture of Emission. The texture emits light, meaning you can make a ball, cube or plane be a light emitter. The light is soft and the shadows smooth.

You can turn objects into lights, the benefit being that you can see the lights. The standard lights in Blender are invisible to the camera, but lights which are objects can be seen. The only light sources in this scene are the objects themselves.

Basic Setup

The basic lighting setup taught by all photography courses is to have a key, fill and rim or edge light.

The key light is either a strong, sun-like light or spotlight shining on the front of the object being lit. This casts light on the front and top of the object and shadows on the surface over any undercuts. In this example we used a sun light above and to the right of the camera. Strength is set to 700.

The fill light is positioned opposite to the key light to fill in any shadows. In this example, an Area light is positioned below the camera and to the left pointing up at the object. Strength is set to 75.

The rim or edge light is positioned behind the object pointing towards the object and the camera to highlight the edge of the object to separate it from its background. In this example, a Hemi light is positioned above, to the left and behind the skull pointing forwards towards the camera. The Strength is set to 2.

And that is how you light something perfectly.

Lighting Tips

The main tip for setting up lights and even textures in Blender is to use a rendered viewport. This makes a draft-quality rendering of the light that you can see updated in real time to allow you to position lights and shadows perfectly while seeing the effects of your light positions live on the screen.


Learn as much as you can about real world lighting for photography and transfer that knowledge to the 3D virtual world of Blender for fantastic lighting.

Image Credit: Cole Harris

Phil South

Phil South has been writing about tech subjects for over 30 years. Starting out with Your Sinclair magazine in the 80s, and then MacUser and Computer Shopper. He’s designed user interfaces for groundbreaking music software, been the technical editor on film making and visual effects books for Elsevier, and helped create the MTE YouTube Channel. He lives and works in South Wales, UK.

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Sign up for all newsletters.

By signing up, you agree to our Privacy Policy and European users agree to the data transfer policy. We will not share your data and you can unsubscribe at any time.

Introduction To Aggregation Functions In Apache Spark

This article was published as a part of the Data Science Blogathon.


Aggregating is the process of getting some data together and it is considered an important concept in big data analytics. You need to define a key or grouping in aggregation. You can also define an aggregation function that specifies how the transformations will be performed among the columns. If you give multiple values as input, the aggregation function will generate one result for each group. Spark’s aggregation capabilities are sophisticated and mature, with a variety of different use cases and possibilities. Aggregations are generally used to get the summary of the data. You can count, add and also find the product of the data. Using Spark, you can aggregate any kind of value into a set, list, etc. We will see this in “Aggregating to Complex Types”.

We have some categories in aggregations.

Simple Aggregations

The simplest grouping is to get a summary of a given data frame by using an aggregation function in a select statement.

Grouping Aggregations

A “group by” allows you to specify more than one keys or aggregation function to transform the columns.

Window functions

A “window” provides the functionality to specify one or more keys also one or more aggregation functions to transform the value columns. However, the input rows to the aggregation function are somewhat related to the current row.

All these aggregations in Spark are implemented via built-in functions.

In this article, I am going to discuss simple aggregations.


Here, I am using Apache Spark 3.0.3 version and Hadoop 2.7 version. It can be downloaded here.

I am also using Eclipse Scala IDE. You can download it here.

I am using a CSV data file. You can find it on the github page.

The data set contains the following columns.

station_id, name, lat, long, dockcount, landmark, and installation.

This is bike station data.

Importing Functions

I am importing all functions here because aggregation is all about using aggregate functions and window functions.

This can be done by using

import org.apache.spark.sql.functions._

Now I am reading the data file into a data frame.

Simple Aggregations

Now, we are ready to do some aggregations. Let’s start with the simplest one.

The simplest form of aggregation is to summarize the complete data frame and it is going to give you a single row in the result. For example, you can count the number of records in this data frame and it will return you a single row with the count of records.

Now, we start with the data frame and use the select() method and apply the count function. You can also hive alias to the summary column. You can also add one more summary column for the sum of the dockcount column. You can also compute the average. We also have countDistinct() function. Here, I am counting the unique values of the landmark column. The countDistinct() will give the number of the unique landmark in this data frame. There is another thing called approx_count_distinct(). When we give countDistinct(), it will group the distinct values and count them. What happens when we have a huge dataset with millions of rows. The countDistinct() function will take time. In that case, we can use approx_count_distinct() which will return an approximate count. It is not 100% accurate. We can use this when speed is more important than accuracy. When you want to get the sum of a distinct set of values, you can use the sumDistinct() function.

be implemented like this. count("*").as("Count *"), sum("dockcount").alias("Total Dock"), avg("dockcount").alias("avg dock"), countDistinct("landmark").alias("landmark count"), approx_count_distinct("station_id").alias("app station"), sumDistinct("station_id").alias("station_id") ).show()

The select method will return a new data frame and you can show it.

Let me run this.

The output will be as follows.

So, as expected, we summarized the whole data frame and got one single row in the result.


We have many other aggregation functions like first() and last() where you can get the first and last values in a data frame. We can get the minimum and maximum values using min() and max() functions respectively.

This can be done in Scala like this. first("station_id").alias("first"), last("station_id").alias("last"), min("dockcount").alias("min"), max("dockcount").alias("max") ).show()

When we execute this, we will get the following output.

Now, I am going to use selectExpr() where we can pass the SQL like expressions.

df.selectExpr( "mean(dockcount) as mean_count" ).show()

Here, I am calculating the mean of the dockcount column.

The mean value is displayed.

Variance and Standard Deviation

Let’s look into other aggregate functions like variance and standard deviation. As we all know variance is the average of squared differences from the mean and standard deviation is the square root of variance.

They can be calculated by var_pop("dockcount"), var_samp("dockcount"), stddev_pop("dockcount"), stddev_samp("dockcount") ).show()

And the output is

Skewness and Kurtosis

Skewness is the degree of distortion from the normal distribution. It may be positive or negative. Kurtosis is all about the tails of the distribution. It is used to find outliers in the data.

It can be identified by skewness("dockcount"), kurtosis("dockcount") ).show()

The output is

Covariance and Correlation

Next, we will see about covariance and correlation. Covariance is the measure of how much two columns or features or variables vary from each other. Correlation is the measure of how much they are related to each other.

It can be calculated by corr("station_id", "dockcount"), covar_samp("station_id", "dockcount"), covar_pop("station_id", "dockcount") ).show()

The output is

Aggregating to complex types

Next, we will see about aggregating to complex types. Suppose if you want to store a particular column in a list or if you need unique values of a column in a list, you can use collect_list() or collect_set(). collect_set() will store the unique values and collect_list() will contain all the elements.

Here is the implementation.

df.agg(collect_set("landmark"), collect_list("landmark")).show(false)

The output is

Complete Code

Here is the entire implementation.

import org.apache.spark.sql.functions._ import org.apache.spark.SparkContext import org.apache.spark.SparkConf object demo extends App{ val conf = new SparkConf().setAppName("Demo").setMaster("local[1]") val sc = new SparkContext(conf) val spark = org.apache.spark.sql.SparkSession.builder.master("local[1]").appName("Demo").getOrCreate; count("*").as("Count *"), sum("dockcount").alias("Total Dock"), avg("dockcount").alias("avg dock"), countDistinct("landmark").alias("landmark count"), approx_count_distinct("station_id").alias("app station"), sumDistinct("station_id").alias("station_id") ).show() first("station_id").alias("first"), last("station_id").alias("last"), min("dockcount").alias("min"), max("dockcount").alias("max") ).show() df.selectExpr( "mean(dockcount) as mean_count" ).show() var_pop("dockcount"), var_samp("dockcount"), stddev_pop("dockcount"), stddev_samp("dockcount") ).show() skewness("dockcount"), kurtosis("dockcount") ).show() corr("station_id", "dockcount"), covar_samp("station_id", "dockcount"), covar_pop("station_id", "dockcount") ).show() df.agg(collect_set("landmark"), collect_list("landmark")).show(false) } End notes

So, these are all simple aggregations. The simple aggregations will always give you a one-line summary. Sometimes, you may want a detailed summary. For example, if you want to combine two or more columns and apply aggregations there. It can be done simply by using Spark SQL. But you can do the same using data frame expressions also. It can be done by the concept of grouping aggregations. I will discuss grouping aggregations in another article. You can find it here.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.


How To Build A 4K Gaming Pc For Under $1500

This is 2023 and most of the people are upgrading to 4K devices. Especially in gaming, where the graphics cards that are available these days are capable of handling games in 4K resolution. We have tons of 4K monitors and TVs to experience this level of detail in all its glory. Turn a couple of years back and you had to buy a computer worth thousands of dollars to play a game in 4K resolution. But now, the prices have come down and 4K gaming systems are finally affordable for most of us. So, if you’re in the market looking forward to upgrade to a high-end gaming PC to play games at 4K resolution on a budget, we’ve got you covered. It’s not that expensive either. Today we’re going to let you know what components you should go for and where to get them from, so that you’ll know how to build a 4K gaming PC for under $1500:

The Components for a 4K Gaming PC

We’re going to discuss each of the components that you’ll need to build the PC separately, in order to avoid any confusion. You can get all of these components on Amazon and the links to purchase it have been provided under each of these components. So, let’s take a look at all the components required.

Note: This $1500 build will include all the components that you’ll need to power 4K displays. The cost of the 4K monitor is not included in the price of the build and you’ll have to buy it based on your personal needs, because even the least expensive ones cost above $300. Come on! You cannot build a similar spec’d rig with a monitor for under $1500 at this point of time. If you are llooking for a 4K monitor, you can check out our list of the best gaming monitors you can buy.

1. Processor

As this component is basically the most important part of your computer that decides the overall performance, we will not be sacrificing performance just to bring down the price. As far as this build is concerned, we will be going with Intel’s latest Kaby Lake i7-7700K desktop processor which has base clock of 4.2 GHz and boost clock of up to 4.5 GHz. It’s the processor which is widely preferred by people for their gaming needs. The i7-7700k is an unlocked processor, which means you will be able to easily overclock it if you need more performance than it already has. It currently costs $329 on Amazon and prices may slightly vary any time.

Buy from Amazon: ($329.99)

2. Graphics Card

The GPU that your PC runs on, completely determines the gaming performance that you’re going to get. We aren’t gonna make any compromises here and we will be going for NVIDIA’s top-tier Pascal card, the GTX 1080 which was released back in May 2023. This graphics card can handle 4K and VR on almost all of the games with absolute ease. There are several variants of the GTX 1080 made by different manufacturers like Asus, Zotac, MSI, Gigabyte and EVGA, but we will be going with the Gigabyte GTX 1080 G1 Gaming graphics card which has a triple fan set up which is essential to keep the temperatures down, considering you’re living in a hot country like India. If you want a slight boost in performance, you can easily overclock it with MSI Afterburner software as well.

Buy from Amazon: ($502)

3. RAM

Buy from Amazon: ($119.99)

4. Case

5. Motherboard

This will be the first component that you will be installing inside your Corsair SPEC-ALPHA case and this will house and connect to all of the components of your computer. Considering your having an unlocked i7-7700K processor which is capable of overclocking, we decided to go for a motherboard that will easily allow overclocking as well. We’ll be using the MSI Z170A Gaming M5 motherboard for this build, as it’s not expensive and provides almost all the features that you’ll need for this build. It’s priced at just under $130 on Amazon.

Buy from Amazon: ($129.99)

6. Power Supply Unit

Do not underestimate the importance of a Power Supply Unit (PSU). It’s equally important as all the other components mentioned above. The PSU powers up your entire system and without this, your PC is incomplete. We’ve decided the PSU that we’re going to use in this build based on three things – Power output, Efficiency and Manufacturer. EVGA is a well known manufacturer and their customer support is impressive. That’s not the only reason why we chose EVGA 600 B1 PSU for this build. It’s 80 PLUS bronze certified and has 85% efficiency which is quite important in a gaming rig. For this build, we chose the 600 watt PSU to give you some overclocking headroom, just in case you decided to overclock your CPU and GPU for some performance boost. We could’ve gone for a fully modular PSU to get better cable management for a higher price, but like I said, we’re on a budget and we’re going for performance, not the looks.

Buy from Amazon: ($49.99)

7. Storage

We’re living in 2023, so don’t imagine building a PC without an SSD. We”ll be using two storage drives in our build. One will be an SSD that will be our primary boot drive, to speed up your Windows. Other will be a traditional HDD that will satisfy all your space requirements. You can install some important applications that you frequency use on your SSD as well, in order to speed up their load times. As far as SSD is concerned, we’ll be going for a Sandisk SSD PLUS 120GB SATA solid state hard drive. Sorry, if you were thinking that we were gonna add an NVMe or M.2 SSD, as that won’t fit our budget. We decided to choose Western Digital’s 1 TB Caviar Blue SATA 6 Gb/s 7200 RPM HDD which is pretty much loved by almost everyone on Amazon.

Buy from Amazon: SSD ($49.99) and HDD ($49.99)

8. CPU Cooler

Buy from Amazon: ($29.99)

9. Keyboard & Mouse

This is the year of RGB components, as the craze for RGB is at it’s peak right now. If you haven’t read the news, there’s even an RGB chair for gamers. We didn’t want to disappoint you in this regard. We wanted to choose an RGB Keyboard & Mouse combo for under $100 and guess what? We just did it. We went for the Logitech’s G213 Prodigy Gaming Keyboard with RGB lighting, but make no mistake, this is not a mechanical keyboard, it has rubber dome switches in them. The keys have 4mm travel distance and the keyboard also has an armrest, so you wont be disappointed on that front.

Buy from Amazon: ($34.99)

Installation of Components

Note: If you don’t know what you’re doing, we highly recommend you to get the assistance of an expert technician to build your PC. We will not be responsible for any damage that you may cause to your system during this process.

Setting Up Motherboard

We don’t recommend mounting the motherboard inside the case before installing the processor, CPU cooler and RAM sticks, as doing it separately on the outside gives you more room to work with.

Firstly, you need to mount the CPU in the motherboard. You can do this by pulling the lever on the CPU socket backwards to lift it up. Now place it in the socket and close the lever to secure the processor in place. Make sure your CPU is in the right orientation by aligning the tiny arrow at the bottom left hand corner of your processor with the one on the motherboard.

Secondly, you need to insert the RAM sticks in two of the four slots right next your CPU socket. For a dual-channel set up, insert both of your RAM sticks in either the first and third slots or the second and fourth slots.

Finally, mount the CPU cooler on to the top of your CPU socket using the mounting bracket and screws provided along with the cooler. Attached to the Hyper 212 Evo is a power cable which you can use to connect to the CPU fan header which is located on the motherboard, right above the cooler. Make sure you read the instructions booklet inside the box to avoid making any mistakes.

Before you proceed to mount the motherboard inside your case, make sure you install the I/O shield that was provided with your motherboard to the rear of the PC’s case. It will easily snap right into place, if you’re doing it correctly.

Mounting The Rest Of The Components

Once you’ve mounted and lined up your motherboard with the I/O shield on your case, it’s time to get the rest of the components inside your case as well. But before that, your Corsair Carbide SPEC-ALPHA case has pre-installed fans with cables hanging inside. Connect them to the fan headers on your motherboard for power. There are also cables inside your case that connect to the front I/O ports. You will have to connect these cables to the connectors located at the bottom of your motherboard in order to get those ports to work. Now, lets proceed to mount the other components.

Firstly, let’s mount your SSD and HDD. The Western Digital HDD can be mounted inside the hard drive cage and the Sandisk SSD can be mounted on one of the 2.5 inch slots located above the hard drive cage. Connect the SSD and HDD to your motherboard with the SATA cables that came with your motherboard.

Secondly, mount your GTX 1080 graphics card on the motherboard. To do this, you need to remove the top two PCIe slots located on the back of your case using a screwdriver. Now, you can easily install your graphics card on the top PCIe slot located right below your CPU cooler. The process is quite similar to how you installed the RAM. The graphics card should snap right into place if you’re doing it correctly. Now, just screw it back in to secure the card in place.

Finally, we’ve reached the last step of the building process. Mount the Power Supply Unit (PSU) to the bottom of the case. Your EVGA 600 B1 PSU comes with a lot of power connectors. Connect the 20+4 pin connector to the header located on the right side of your motherboard. This will be used by the motherboard mainly to draw power. In order to supply power to your graphics card, connect the 8-pin power cable from the your PSU. Now, you need to supply power to your CPU by using an 8 pin cable to connect to the header located at the top right of the motherboard. At last, you need to use the SATA power cables on your PSU to connect it to the SSD and HDD so that it can draw power.

Great, we’re almost done here. Connect your keyboard and mouse to the USB ports on the rear of your case. Also, connect the monitor to the graphics card using a DVI, HDMI or DisplayPort. Now, use the power adapter that came with your Power Supply Unit to plug it into the wall. Turn it on and see if you boot right into the Motherboard BIOS, so that you can install Windows. If you managed to reach this far, then you did a good job. If not, you probably messed something up and you’ll need assistance from expert immediately.

                 SEE ALSO: Intel Core i9 vs AMD Ryzen Threadripper: Quick Comparison

Ready to Build Your Very Own 4K Gaming PC?

Update the detailed information about Spark: Lighting A Fire Under Hadoop on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!