# Every record contains a label and feature vector
df=spark.createDataFrame(data,["label","features"])# Split the data into train/test datasets
train_df,test_df=df.randomSplit([.80,.20],seed=42)# Set hyperparameters for the algorithm
rf=RandomForestRegressor(numTrees=100)# Fit the model to the training data
model=rf.fit(train_df)# Generate predictions on the test dataset.
model.transform(test_df).show()
df=spark.read.csv("accounts.csv",header=True)# Select subset of features and filter for balance > 0
filtered_df=df.select("AccountBalance","CountOfDependents").filter("AccountBalance > 0")# Generate summary statistics
filtered_df.summary().show()
立即运行
$ docker run -it --rm spark /opt/spark/bin/spark-sql